Counting Invisible Strings
When is a string not a string? When it's a series of control characters! Not a particularly funny riddle, but one I've been wrestling with recently.
Imagine we want to write a program which displays a Twitter user's name. Not their @ handle, but their "real" name.
For example, instead of @POTUS, display "President Obama". Easy, right? Not quite. What happens when a user is named "️"?
Normally, we'd just say
if (null == $name) { ...Do Stuff... }
Ah! But that's not an empty string, it's ️
AKA %EF%B8%8F
AKA variation selector-16.
Yup! Some clever wag has managed to set their Twitter name to a Unicode control character. Interesting and annoying!
That rather puts a spanner in the works. Something like <a href="https://twitter.com/example">️</a>
won't be clickable because it is not a displayable character. It is invisible.
So, how can we test to see if a Unicode string is invisible? I'm using PHP because, hey, that's what I'm using.
Can we count the characters?
print strlen(urldecode("%EF%B8%8F")); 3 print mb_strlen(urldecode("%EF%B8%8F")); 3
Nope.
PHP has some built in functions ctype_print and ctype_graph - but they only test whether the string contains any non-printable characters. No good for us, because the string may contain visible and invisible characters.
Ok, can we use regex? That's what some people suggested to me - but it doesn't seem to deal with the edge-case of non-printing characters.
Well, I'm stumped! If anyone knows of a good way to do this - please reveal yourself!