Everything you know about Twitter character counting is wrong

by @edent | # # #

How many characters can a Tweet contain? It used to be 140, back in the good old days. Now it's 280. Unless you're Japanese. Let me explain…

I run OpenBenches - a site which collects memorial benches. When a user adds a bench, the inscription is automatically Tweeted. If the inscription is longer than 280 characters, it is truncated.

The PHP code to truncate text to a specific length is pretty simple:

$tweet_inscription = mb_substr($inscription, 0, 280);

I use Multibyte String operations, because a single character can take up more than one byte. Especially if the text contains characters outside of the "Latin" range.

But this is not sufficient for Twitter.

My friend, Beck Strickland, recently added some gorgeous photographs of benches in Hiroshima, Japan. The inscriptions were pretty long, and were not being Tweeted.

A large block of Japanese text.

The error I was getting back from Twitter was "186: Tweet needs to be a bit shorter."

Here's the Japanese text:

まち
を 緑でい っ ぱ
い に
!
広島ゾンタクラブ 医療法人あさだ会 有設計工房岩重老沼薬品 広島市立山田小学校広島県立皆実高校 小田耳鼻咽喉科医院 翠清会梶川病院 広島大学総合科学部
片桐眼科 広島歯科医院 しいのレディースクリニック 石橋三千男事務所 Akemi Engish School 高橋内科小児科 広島総合病院 川村,大迫法律事務所 (株)山本薬品
㈲)メディカルサービス 老人保険施設ベルローゼ スタッフ·トゥー·ワン 原田病院 広島大学医学部 ひろしま通訳·ガイド協会 有森信パーキングビル (株)幸房

Go paste that into any character counter. Even after newlines, it's a heck of a lot less than 280. But Twitter disagrees!

Japanese text pasted into the Twitter compose window. It is showing that there are too many characters.

What this problem is not

I wasted a lot of time looking at esoteric Unicode Normalisation Forms as directed by the Twitter documentation on Counting Characters.

Tweet length is measured by the number of codepoints in the NFC normalized version of the text.

That wasn't the answer.

The documentation did point to a page about the Twitter Text Parsing Library. Which, helpfully, has this to say on the matter:

The Configuration defines Unicode code point ranges, with a weight associated with each of these ranges. This enables language density to be taken into consideration when counting characters.

Aha! Japanese is a fairly dense language. The English word Monday has six characters, the Japanese translation 月曜 has two.

When Twitter announced the increase to 280 characters they published a blog post saying:

We want every person around the world to easily express themselves on Twitter, so we're doing something new: we're going to try out a longer limit, 280 characters, in languages impacted by cramming (which is all except Japanese, Chinese, and Korean).

やった!

Loving the algorithm

When I first wrote this post, there was nothing that I could find in the official Twitter Developer Documentation which explained how these weights are calculated. Aside from the above blog post, I couldn't find anything which explained that CJK characters are counted differently.

The Twitter Text library is open source, but doesn't explicitly spell out how the weightings are calculated. There are implementations in Ruby, JS, Objective C, and Java. Thankfully, Takashi Nojima maintains a PHP version.

Because Tweets can contain a mixture of languages, Twitter has developed a "Weighted Length" algorithm. For example "石橋三千男事務所 Akemi Engish School" looks to have 28 characters but, because some of them are in a higher density script, Twitter gives a weightedLength of 36.

This makes truncating Tweets automatically somewhat difficult!

10 Is weightedLength less than 280?
20 Yes - Tweet it.
30 No - remove one character from the end.
40 GOTO 10

OK, it isn't quite that bad. The PHP implementation lets you find the valid length of the string using:

$data = \Twitter\Text\Parser::create()->parseTweet($tweet_inscription);
echo $data->validRangeEnd;

Is everything you know about Twitter character counting wrong?

An inflammatory clickbait title? I don't think so. There was literally no way of knowing how the character count algorithm works from reading the developer documentation.

How do you count the characters in an emoji? Do you count the skin tone and gender modifiers separately? Again, the developer documentation didn't say anything.

Developer documentation is essential. And it needs to be kept up to date with the latest changes.
I'm friends with some of the folk on the Twitter Developer Team (with friends like this, right?!) and I shared my concerns with them.

Since writing this post, I've been working with them to improve the developer documentation. I've happy to say that the documentation now reflects reality.

The counting characters page now explains how to count CJK characters and complex emoji. Enjoy!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.