Twitter's Weird Control Character Handling


A little curio for you all. A StackOverflow user has pointed out that certain Twitter profiles contain very odd Unicode characters. What on Earth is going on? Let's take a look at Bill Clinton's profile on Twitter. Ok, that looks pretty normal. But let's take a look at the HTML source. Huh... What are those funny characters? Unicode Character U+0003 is "End of Text" - it's one of the original ASCII Control Characters used to inform a computer to stop processing the received data. In…

Continue reading →

Searching For A Smile


What happens if you search the web for the Unicode character "☺"? On the one hand, it's a symbol just like the letter A or the punctuation mark "!" - on the other, it contains semantic meaning. A smiling, happy face. I decided to look at a few popular search engines to see what they'd return. First up, a surprisingly poor entry from Google. The site at the top by Tim Whitlock is fine - but it's hardly an authoritative site. The rest appear to be random sites with the character in the URL. …

Continue reading →

Facebook Mangles Unicode URLs


2025 Update - Bitly removed the ability to create emoji links, so some of these links are now dead. Facebook rewrite URLs with Unicode in the path - this is not best practice and could be dangerous. It is possible to create a URL like http://bit.ly/😀 - the Unicode characters are valid in the path. The URL Encoded representation is : bit.ly/%F0%9F%98%80 Facebook mangles these URLs in such a way that it might be possible to redirect a user to a malicious site. Here's what's happening. When …

Continue reading →

Evading Profanity Filters Using Bi-Directional Text


There are some very sensitive souls on the Internet who object to seeing swear words. To that end, a huge industry has sprung up around "Profanity Filters" - services which claim to be able to detect naughty words and automatically redact them. The approach of dumbly looking for strings of text leads to a range of problems, including false positives (known colloquially as the Scunthorpe Problem). A common way to bypass these filters is to use homoglyphs - substituting a lower-case L for an…

Continue reading →

RTL Bugs


Take a look at the following text, looks normal enough doesn't it? "Harry ‮".draziw a si ‭Potter Now, try to select the text and see what happens. WHAT WITCHCRAFT IS THIS?! If you examine the source code for this page, you'll see that I'm using the Unicode Bi-Directional characters. "Harry ‮".draziw a si ‭Potter These characters are useful when writing text that includes, say, English and Arabic - but they can also be used for malicious purposes. On a more mundane level, the…

Continue reading →

Homoglyph Attacks


Homoglyphs are characters that love each other very much look strikingly similar to each other. Can you quickly tell the difference between these two - O0? That's The capital letter "o" and the number 0. How about Il1|? Depending on the font used - and your attention to detail, it may be hard to spot the difference between all three. The sites homoglyphs.net and IronGeek are great resources for creating text which uses similar looking - but not identical - characters. Τһⅰѕ text may loоk lik…

Continue reading →

Let's get the IEC Power Symbol into Unicode


I've just launched a campaign to get the IEC Power Symbol into Unicode! A couple of months ago, I asked this question on HackerNews I was looking for the electrical "standby" symbol - AKA IEC5009 / IEEE1621. You know, the circle with the line through it. The one that's on every single bloody piece of electronic equipment produced since the mid-1970s. It's not in the Unicode standard. I can, if I want, have a snowman ☃ or a reversed rotated floral bullet ☙. What other useful and/or imp…

Continue reading →

Subsetting (Chinese) Fonts


There are loads of really delightful Simplified and Traditional Chinese True Type Fonts available on the web. There's only one issue - the file sizes are really large. In many cases, too large to effectively use as a web-font. For example, this calligraphy style font is 3.4MB. The beautiful Paper Cut Font weighs in at 14MB! That file-size is far to heavy to embed on a web page. Subsetting Generally speaking, font files like .ttf contain a representation of every single character. 0-9,…

Continue reading →

HOWTO: Make a Doctor Who "Bells of St John" Style WiFi Name


No spoilers, sweetie :-) This evening's Doctor Who - The Bells of St John - revolves around mysterious WiFi signals. Alien SSIDs which, if you connect to them.... well, watch the episode to find out! In the show, they look like these: So, can we do the same thing for our home WiFi network? Yup! There are some limitations though. SSIDs can only have a maximum length of 32 byte. Those are usually interpreted as 8-bit characters, so if you're using multibyte Unicode characters, you're…

Continue reading →

Usability of mixing LTR and RTL text?


Annoyingly, FourSquare has started be be a source of spam for me. I get friend request from people who only like certain brands of stores, from recruitment consultants trying to work out who I'm visiting, and from cultists who are desperate for me to visit Scientology centres. I also get friend requests from people I've never met, including from Ahmed al-Najjar (احمد النجار). I've never met Ahmed and I've no wish to taint him as a spammer - I'm sure he's just misclicked in his friend request …

Continue reading →