More adventures with Unicode. I logged in to my Virgin Media account to see when my promotional discount would end. Here's what their billing PDF said.
Let'S Ignore The Weird Capitalisation Virgin'S System Uses. What's that
Â doing there?
Their website says:
Â symbol, but also no
£ sign. Ah, but let's look at the underlying code.
What's that weird character? It is the control character
string terminator, of course...
Well, my discount is nearly finished, so I asked them for a larger discount. "Sure!" they said "How does Ÿ3 sound?"
Amusingly, when I copy the
Ÿ from that PDF, it shows up as the character
What's Going On?
I've written extensively about how the £ symbol is encoded - but here's a primer.
£in ISO-8859-1 (Latin-1) is decimal
£in Unicode is also
163- but it gets stored as two UTF-8 bytes -
163. In hex this is
- In Windows-1252 - the legacy encoding for ancient version of Microsoft's software -
0xC2gets rendered as
So, at some point, Virgin's billing software is seeing
0xC2 0xA3, encoding it as
Â£, and then grabbing the first character to print on the bills.
Where do the other characters come from?
- In Code Page 437 - an ancient IBM encoding - the
0x9Cin Windows 1252 is
- The String Terminator is
Ÿ character? Not a clue! Inspecting the raw text of the PDF shows the underlying code is:
6m \2343.00 RIV Discount.
PDFs escape octal characters. Octal
234 is decimal
156 - which is hex
Nearest I can get is the ISO/IEC 8859-15 encoding, where
0xBE. Perhaps a font substitution error?
Everything is awful
This isn't just ugly. It points to the fact that Virgin don't test their software and don't upgrade their systems. What other horrors lie in their technology stack?
And it isn't just a tech issue. It is bad for screenreaders - meaning visually impaired users get a poor experience.
The year is 2018. And we're still battling text encoding issues due to crappy software.