Twitter's Weird Control Character Handling


A little curio for you all.

A StackOverflow user has pointed out that certain Twitter profiles contain very odd Unicode characters. What on Earth is going on?

Let's take a look at Bill Clinton's profile on Twitter.

Bill Clinton Twitter Profile

Ok, that looks pretty normal. But let's take a look at the HTML source.

Clinton Source-fs8

Huh... What are those funny characters?

Unicode Character U+0003 is "End of Text" - it's one of the original ASCII Control Characters used to inform a computer to stop processing the received data. In this case, it's the Unicode equivalent of ^C.

I'm struggling to think of a legitimate use for including this character in one's Twitter Bio. Don't get me wrong, I don't think there's any great conspiracy here - but I wonder what weird app allowed those characters through in the middle of the text.

Twitter will let users have almost any Unicode string as their bio. I wonder if having control characters in there could cause problems for computers processing the text? It would be mightily unusual for code to come across the control characters when parsing text and treat them as real instructions. Although stranger things have happened.

The Twitter API disallows some of these characters in regular Tweets. But not all are banned!

Which Characters Can We Use?

Of the first 32 control characters, most can be used in a Twitter Bio. None can be used in a Tweet.

Here's a snapshot from Twitter API showing a test Tweet I made.

The characters which can't be used in a Bio are 5,7,8,10,11,12,13,14,15 - the reason why is left as an exercise for the reader.

The C1 Control Set (127-159) is much more useful.

Again, another API snapshot - you may need to view the source of the API response in order to see the characters.

All of the characters are stored in both the bio and the status. Each does count towards the character limit - but that doesn't mean we can't have fun with them!

Potential Uses

Ok, so that's... interesting, I guess. Is there anything useful which can be done with these characters?

We can stop Twitter autolinking URls.

A quick look at the source of that Tweet and you'll see the Unicode Delete character between the dot and the com.

Of course, the user can't copy & paste the URl - try it!

This is similar to Marcin Wichary's excellent article on esoteric spacing characters in Unicode.

Top Secret!

An interesting use which springs to mind is steganography - hiding messages within messages. There are 33 control characters which can appear in a status. That's enough for an alphabet's worth of letters - or for a basic code book.

Of course, not every Twitter client handles these characters gracefully - Twidere, for example, gives us this hot mess. Corrupt Tweet-fs8

Breaking Things

There's the possibility of causing all sorts of minor vandalism like the above - most text processing libraries should be able to ingest the data without issue, but there's always going to be one of two which will throw up errors.

For example, using example.com/␡test.html in a Tweet, rather messes up Facebook.

Facebook Control Char Twitter-fs8

The text is passed through, but the hyperlink gets mangled by Facebook - it becomes "https://t.co/WQOcry9nhy%7Ftest.html", which is broken.

Phishing

There's a minor phishing risk:

With no apparent space between one URl and the next, an unwary user may be fooled into clicking on something dodgy. Not helped by the proliferation of weird TLDs!

Incidentally, when Facebook tries to process that Tweet it suffers a major malfunction: Facebook Invalid URl-fs8

Obfuscation

Suppose you want to talk about a controversial subject, but don't want people to be able to search for what your saying. You can tweet about "Game␡rGate" and Twitter's search engine will ignore it.

Handy if you want to gripe about your employer, but are worried that someone will be searching out every mention of them on Twitter.

Anything Else?

If you can think of anything interesting / dastardly / amusing to do with this, please stick a comment in the box. 


Share this post on…

  • Mastodon
  • Facebook
  • LinkedIn
  • BlueSky
  • Threads
  • Reddit
  • HackerNews
  • Lobsters
  • WhatsApp
  • Telegram

2 thoughts on “Twitter's Weird Control Character Handling”

  1. says:

    Hey Terence!

    A prosaic but likely explanation for "why are those characters in Bill Clinton's profile?" is "...copy-and-pasted from Microsoft Word".

    Reply
    1. Terence Eden says:

      Ha! Who knows what manner of weird stuff MS Word infests text with?

      Reply

What are your reckons?

All comments are moderated and may not be published immediately. Your email address will not be published.

Allowed HTML: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <p> <pre> <br> <img src="" alt="" title="" srcset="">