Counting Invisible Strings


When is a string not a string? When it's a series of control characters! Not a particularly funny riddle, but one I've been wrestling with recently.

Imagine we want to write a program which displays a Twitter user's name. Not their @ handle, but their "real" name.

For example, instead of @POTUS, display "President Obama". Easy, right? Not quite. What happens when a user is named "️"?

Normally, we'd just say

if (null == $name) {
   ...Do Stuff...
}

Ah! But that's not an empty string, it's ️ AKA %EF%B8%8F AKA variation selector-16.

Yup! Some clever wag has managed to set their Twitter name to a Unicode control character. Interesting and annoying!

That rather puts a spanner in the works. Something like <a href="https://twitter.com/example">&#xFE0F;</a> won't be clickable because it is not a displayable character. It is invisible.

So, how can we test to see if a Unicode string is invisible? I'm using PHP because, hey, that's what I'm using.

Can we count the characters?

print strlen(urldecode("%EF%B8%8F"));
3
print mb_strlen(urldecode("%EF%B8%8F"));
3

Nope.

PHP has some built in functions ctype_print and ctype_graph - but they only test whether the string contains any non-printable characters. No good for us, because the string may contain visible and invisible characters.

Ok, can we use regex? That's what some people suggested to me - but it doesn't seem to deal with the edge-case of non-printing characters.

Well, I'm stumped! If anyone knows of a good way to do this - please reveal yourself!


Share this post on…

  • Mastodon
  • Facebook
  • LinkedIn
  • BlueSky
  • Threads
  • Reddit
  • HackerNews
  • Lobsters
  • WhatsApp
  • Telegram

What are your reckons?

All comments are moderated and may not be published immediately. Your email address will not be published.

Allowed HTML: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <p> <pre> <br> <img src="" alt="" title="" srcset="">