Twitter's Weird Control Character Handling
A little curio for you all.
A StackOverflow user has pointed out that certain Twitter profiles contain very odd Unicode characters. What on Earth is going on?
Let's take a look at Bill Clinton's profile on Twitter.
Ok, that looks pretty normal. But let's take a look at the HTML source.
Huh... What are those funny characters?
Unicode Character U+0003 is "End of Text" - it's one of the original ASCII Control Characters used to inform a computer to stop processing the received data. In this case, it's the Unicode equivalent of ^C
.
I'm struggling to think of a legitimate use for including this character in one's Twitter Bio. Don't get me wrong, I don't think there's any great conspiracy here - but I wonder what weird app allowed those characters through in the middle of the text.
Twitter will let users have almost any Unicode string as their bio. I wonder if having control characters in there could cause problems for computers processing the text? It would be mightily unusual for code to come across the control characters when parsing text and treat them as real instructions. Although stranger things have happened.
The Twitter API disallows some of these characters in regular Tweets. But not all are banned!
Which Characters Can We Use?
Of the first 32 control characters, most can be used in a Twitter Bio. None can be used in a Tweet.
Here's a snapshot from Twitter API showing a test Tweet I made.
The characters which can't be used in a Bio are 5,7,8,10,11,12,13,14,15 - the reason why is left as an exercise for the reader.
The C1 Control Set (127-159) is much more useful.
Again, another API snapshot - you may need to view the source of the API response in order to see the characters.
All of the characters are stored in both the bio and the status. Each does count towards the character limit - but that doesn't mean we can't have fun with them!
Potential Uses
Ok, so that's... interesting, I guess. Is there anything useful which can be done with these characters?
Stop Auto-Links
We can stop Twitter autolinking URls.
A quick look at the source of that Tweet and you'll see the Unicode Delete character between the dot and the com.
Of course, the user can't copy & paste the URl - try it!
This is similar to Marcin Wichary's excellent article on esoteric spacing characters in Unicode.
Top Secret!
An interesting use which springs to mind is steganography - hiding messages within messages. There are 33 control characters which can appear in a status. That's enough for an alphabet's worth of letters - or for a basic code book.
Of course, not every Twitter client handles these characters gracefully - Twidere, for example, gives us this hot mess.
Breaking Things
There's the possibility of causing all sorts of minor vandalism like the above - most text processing libraries should be able to ingest the data without issue, but there's always going to be one of two which will throw up errors.
For example, using example.com/␡test.html
in a Tweet, rather messes up Facebook.
The text is passed through, but the hyperlink gets mangled by Facebook - it becomes "https://t.co/WQOcry9nhy%7Ftest.html", which is broken.
Phishing
There's a minor phishing risk:
With no apparent space between one URl and the next, an unwary user may be fooled into clicking on something dodgy. Not helped by the proliferation of weird TLDs!
Incidentally, when Facebook tries to process that Tweet it suffers a major malfunction:
Obfuscation
Suppose you want to talk about a controversial subject, but don't want people to be able to search for what your saying. You can tweet about "Game␡rGate" and Twitter's search engine will ignore it.
Handy if you want to gripe about your employer, but are worried that someone will be searching out every mention of them on Twitter.
Anything Else?
If you can think of anything interesting / dastardly / amusing to do with this, please stick a comment in the box.
Joe Loughry says:
Hey Terence!
A prosaic but likely explanation for "why are those characters in Bill Clinton's profile?" is "...copy-and-pasted from Microsoft Word".
Terence Eden says:
Ha! Who knows what manner of weird stuff MS Word infests text with?