Hashtag Standards
This is one of the longest and geekiest posts I've done. It's a work in progress. All comments and abuse welcome.
#hashtag – As long has there has been a way to search Tweets* people have been adding information to make the easy to find. The #hashtag syntax has become the standard for attaching a succinct tag to Tweets.
That's all well and good, but as I discovered yesterday, without standardisation the ability to search falls apart.
I'm not talking about whether you should use the #LondonFire tag rather than #FireOfLondon or #LDNfire. Rather; how does a computer recognise what a valid tag is?
Why Does This Matter?
Search and tracking quickly break down if they are inconsistent. For example, if you are using #Romeo&Juliet to mark all your conversations about the play you are watching, different Twitter clients will link through to either #Romeo, #Romeo&, or #Romeo&Juliet. Each search returning potentially different conversations.
What's The Convention?
Twitter's website ought to be the definitive source of how hashtags work. This is their main site.
Yet, when we visit their mobile site - we get a completely different experience.
Application Confusion
Because there aren't any widely publicised definitions for what hashtags are, some applications have a significantly different attitude to hashtags
Standardisation
To be fair, the Twitter team do have a standard. Even if they don't use it themselves.
They even have some limited test cases and libraries in Ruby and Java.
So, given that Twitter, their implementation and apps all disagree on what a hashtag is, let's try to work our what they should be.
Anatomy of a Tag
To begin at the beginning. A hashtag starts with a hash. #. Simple, no? No.
There are two different hash symbols! There's the # we all know and love, and there's #. Looks pretty similar, but in fact it's the unicode symbol [U+FF03]
Actually, that's not the beginning. What comes before the # of the hashtag?
Consider the following examples - which should be hashtags?
- #tag - the # starts off the Tweet
- This is my tweet #test - the # comes after a space.
- This is it.#tag - the # is pushed against some punctuation, perhaps for reasons of space.
- Here we go-#LiftOff - the # is pushed against a -
- I've run out of space#OhNo - the # is pushed against some text
- &#nbsp; - the # is part of an HTML entity
- text #hashtag - the # comes after a "wide space" (U+3000)
- Should I use #tag/#hashtag? The # comes after a /
- Is this valid ##tag - there are two #s
So, we can see it's a little more complicated than we first thought.
The End
Let's skip over what's in a hashtag and as "how do we know that a tag has finished?"
Consider the following examples -
- New album #OMG! - should the ! be part of the hashtag?
- #BreakingNews: dog bites man - should the : be part of the hashtag?
- (is this a #tag) - should the ) be part of the hashtag?
- I like #tags#
We probably don't want to have any punctuation at the end of our tag. Can you think of any counter examples?
Yummy Filling
Our language is more than just the letters A-Z. We've got punctuation, numbers, symbols and all manner of other glyphs. Which of them count as part of a hashtag?
Take a look at these examples
- Vote Bush! #Don't
- My dog died #:-(
- Einstein #e=mc^2
- I'm on bus #123
- I'm giving #110%
Using Twitter's standards, none of the above render as complete tags.
Foreign Languages
We've mentioned accents above. As we can see in the first example, "funny" characters can cause problems. Broadly speaking, there are three issues.
- Accents. Should the é on #Café be linked?
- Accents. Is #Romeo the same as #Ŕöméø?
- Japanese, and some other languages, don't use spaces. Is #tagの valid? What about # 会議中 ?
Exhausted
These are a fraction of the possible problems. It's exhausting trying to find all the possible textual combinations and permutations which could lead into a hashtag. No wonder there is confusion!
Search is a complex, profitable, and useful business. It's of vital importance that there is a legitimate, comprehensive standard which all sites and applications can follow.
warpcafe says:
I somehow have the impression that in the early days of Twitter, noone cared. Later, the people behind the hashtag idea seem to have underestimated the user requirements complexity. But what would a solution for the current "experience" boil-down to? Either a better syntax for "extended" hashtags or a better "server-sided parsing" of the text sent. Having the latter would again mean to hide the control about what is possible from the user... much like it is now. If I was to take the decision, I would opt for the first way: Besides the #tag (would be supported for, errr... compatibility), there could be the "extended" syntax that goes #[tag] with everything in between the square brackets being the text to "hash". Yes I know -sigh- it will waste another 2 characters from the precious 140 budget, but think of the benefits. ( BTW, there's always a way to save 2 chars from a tweet, right? If you can't find your tweet fitting into less than exactly 140 characters, you probably should think about rephrasing it anyway...) The funny (or sad) thing about the above idea is, that it didn't take months of investigation and reading technical specs to come up with it: It's simply borrowed from Wikis... something that was around before there was Twitter IIRC.
jrü says:
I thought you were joking when you were talking about how this post is nerdy and then I read the date of your post and then verified you still had a blog and in conclusion, this is still bothering me in 2018, so I’m really glad I’m not the only. Not as a coder, but as a bilingual individual pushing for bilingual social media in New Mexico, I’d say this is way more important than it seems. We particularly are trying to reach Spanish speakers, those people who politically can sometimes seem elbowed out of the equation, but we are a clinic that really needs to reach them. So, then, I figure the rest of the Spanish speaking world would have done something about this, yet, alas, they can’t seem to unify since Spanish is so idiomatic full of dialectical discrepancies. So, AT LEAST, it would be cool to think using accents in common words would be able to happen. And yet, I’m #posting #pósting double posting tags and I feel stupid. Did you ever get around to finding more info on this or did we decide #MeToo is the only relevant hashtag for the next five years?