Everything you know about Twitter character counting is wrong

api documentation twitter · 800 words · Viewed ~449 times

How many characters can a Tweet contain? It used to be 140, back in the good old days. Now it's 280. Unless you're Japanese. Let me explain…

I run OpenBenches - a site which collects memorial benches. When a user adds a bench, the inscription is automatically Tweeted. If the inscription is longer than 280 characters, it is truncated.

The PHP code to truncate text to a specific length is pretty simple:

 PHP$tweet_inscription = mb_substr($inscription, 0, 280);

I use Multibyte String operations, because a single character can take up more than one byte. Especially if the text contains characters outside of the "Latin" range.

But this is not sufficient for Twitter.

My friend, Beck Strickland, recently added some gorgeous photographs of benches in Hiroshima, Japan. The inscriptions were pretty long, and were not being Tweeted.

The error I was getting back from Twitter was "186: Tweet needs to be a bit shorter."

Here's the Japanese text:

まちを緑でいっぱいに ! 広島ゾンタクラブ医療法人あさだ会有設計工房岩重老沼薬品広島市立山田小学校広島県立皆実高校小田耳鼻咽喉科医院翠清会梶川病院広島大学総合科学部片桐眼科広島歯科医院しいのレディースクリニック石橋三千男事務所 Akemi Engish School 高橋内科小児科広島総合病院川村,大迫法律事務所 (株)山本薬品㈲)メディカルサービス老人保険施設ベルローゼスタッフ·トゥー·ワン原田病院広島大学医学部ひろしま通訳·ガイド協会有森信パーキングビル (株)幸房

Go paste that into any character counter. Even after newlines, it's a heck of a lot less than 280. But Twitter disagrees!

Japanese text pasted into the Twitter compose window. It is showing that there are too many characters.

What this problem is not

I wasted a lot of time looking at esoteric Unicode Normalisation Forms as directed by the Twitter documentation on Counting Characters.

Tweet length is measured by the number of codepoints in the NFC normalized version of the text.

That wasn't the answer.

The documentation did point to a page about the Twitter Text Parsing Library. Which, helpfully, has this to say on the matter:

The Configuration defines Unicode code point ranges, with a weight associated with each of these ranges. This enables language density to be taken into consideration when counting characters.

Aha! Japanese is a fairly dense language. The English word Monday has six characters, the Japanese translation 月曜 has two.

When Twitter announced the increase to 280 characters they published a blog post saying:

We want every person around the world to easily express themselves on Twitter, so we're doing something new: we're going to try out a longer limit, 280 characters, in languages impacted by cramming (which is all except Japanese, Chinese, and Korean).

やった!

When I first wrote this post, there was nothing that I could find in the official Twitter Developer Documentation which explained how these weights are calculated. Aside from the above blog post, I couldn't find anything which explained that CJK characters are counted differently.

The Twitter Text library is open source, but doesn't explicitly spell out how the weightings are calculated. There are implementations in Ruby, JS, Objective C, and Java. Thankfully, Takashi Nojima maintains a PHP version.

Because Tweets can contain a mixture of languages, Twitter has developed a "Weighted Length" algorithm. For example "石橋三千男事務所 Akemi Engish School" looks to have 28 characters but, because some of them are in a higher density script, Twitter gives a weightedLength of 36.

This makes truncating Tweets automatically somewhat difficult!

 10 Is weightedLength less than 280?
20 Yes - Tweet it.
30 No - remove one character from the end.
40 GOTO 10

OK, it isn't quite that bad. The PHP implementation lets you find the valid length of the string using:

 PHP$data = \Twitter\Text\Parser::create()->parseTweet($tweet_inscription);
echo $data->validRangeEnd;

Is everything you know about Twitter character counting wrong?

An inflammatory clickbait title? I don't think so. There was literally no way of knowing how the character count algorithm works from reading the developer documentation.

How do you count the characters in an emoji? Do you count the skin tone and gender modifiers separately? Again, the developer documentation didn't say anything.

Developer documentation is essential. And it needs to be kept up to date with the latest changes. I'm friends with some of the folk on the Twitter Developer Team (with friends like this, right?!) and I shared my concerns with them.

Since writing this post, I've been working with them to improve the developer documentation. I'm happy to say that the documentation now reflects reality.

The counting characters page now explains how to count CJK characters and complex emoji. Enjoy!

Everything you know about Twitter character counting is wrong

What this problem is not

Loving the algorithm

Is everything you know about Twitter character counting wrong?

What are your reckons? Cancel reply

What this problem is not

Loving the algorithm

Is everything you know about Twitter character counting wrong?

Share this post on…

What are your reckons? Cancel reply