Counting Invisible Strings


When is a string not a string? When it's a series of control characters! Not a particularly funny riddle, but one I've been wrestling with recently.

Imagine we want to write a program which displays a Twitter user's name. Not their @ handle, but their "real" name.

For example, instead of @POTUS, display "President Obama". Easy, right? Not quite. What happens when a user is named "️"?

Normally, we'd just say

if (null == $name) {
   ...Do Stuff...
}

Ah! But that's not an empty string, it's ️ AKA %EF%B8%8F AKA variation selector-16.

Yup! Some clever wag has managed to set their Twitter name to a Unicode control character. Interesting and annoying!

That rather puts a spanner in the works. Something like
<a href="https://twitter.com/example">&#xFE0F;</a> won't be clickable because it is not a displayable character. It is invisible.

So, how can we test to see if a Unicode string is invisible? I'm using PHP because, hey, that's what I'm using.

Can we count the characters?

print strlen(urldecode("%EF%B8%8F"));
3
print mb_strlen(urldecode("%EF%B8%8F"));
3

Nope.

PHP has some built in functions ctype_print and ctype_graph - but they only test whether the string contains any non-printable characters. No good for us, because the string may contain visible and invisible characters.

Ok, can we use regex? That's what some people suggested to me - but it doesn't seem to deal with the edge-case of non-printing characters.

Well, I'm stumped! If anyone knows of a good way to do this - please reveal yourself!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.