únicode is hard


Receipts with a missing pound sign

In the last couple of months, I've been seeing the ú symbol on British receipts. Why? 1963 - ASCII In the beginning* was ASCII. A standard way for computers to exchange text. ASCII was originally designed with 7 bits - that means 128 possible symbols. That ought to be enough for everyone, right? Wrong! ASCII […] Read More

How Do You Sort Chinese Numbers?


Chinese characters in filenames sorted in linux - the files are in the wrong order

Imagine you have a series of number you wish to sort. Sorting is a well known computer science problem - generally speaking you compare one value to the next and then move the item either up or down a list. With "English" characters, that's fairly easy. When a computer sees the character 1 it's really […] Read More

How to type Emoji in Ubuntu


New tech site Gadgette has a great article on how to type Emoji on Mac and Windows - but they (understandably) didn't cover Ubuntu. So here I am to show you how. Get The Fonts If your computer doesn't have the requite font, install the latest version of Symbola. Simply open up the .zip file, […] Read More

Twitter's Weird Control Character Handling


A little curio for you all. A StackOverflow user has pointed out that certain Twitter profiles contain very odd Unicode characters. What on Earth is going on? Let's take a look at Bill Clinton's profile on Twitter. Ok, that looks pretty normal. But let's take a look at the HTML source. Huh... What are those […] Read More

Searching For A Smile


What happens if you search the web for the Unicode character "☺"? On the one hand, it's a symbol just like the letter A or the punctuation mark "!" - on the other, it contains semantic meaning. A smiling, happy face. I decided to look at a few popular search engines to see what they'd […] Read More

Facebook Mangles Unicode URLs


Facebook rewrite URLs with Unicode in the path - this is not best practice and could be dangerous. It is possible to create a URL like http://bit.ly/😀 - the Unicode characters are valid in the path. The URL Encoded representation is : bit.ly/%F0%9F%98%80 Facebook mangles these URLs in such a way that it might be […] Read More

Evading Profanity Filters Using Bi-Directional Text


There are some very sensitive souls on the Internet who object to seeing swear words. To that end, a huge industry has sprung up around "Profanity Filters" - services which claim to be able to detect naughty words and automatically redact them. The approach of dumbly looking for strings of text leads to a range […] Read More

RTL Bugs


Take a look at the following text, looks normal enough doesn't it? "Harry ‮".draziw a si ‭Potter Now, try to select the text and see what happens. WHAT WITCHCRAFT IS THIS?! If you examine the source code for this page, you'll see that I'm using the Unicode Bi-Directional characters. "Harry ‮".draziw a si ‭Potter These […] Read More