This is one of the longest and geekiest posts I've done. It's a work in progress. All comments and abuse welcome.
#hashtag – As long has there has been a way to search Tweets* people have been adding information to make the easy to find. The #hashtag syntax has become the standard for attaching a succinct tag to Tweets.
That's all well and good, but as I discovered yesterday, without standardisation the ability to search falls apart.
I'm not talking about whether you should use the #LondonFire tag rather than #FireOfLondon or #LDNfire. Rather; how does a computer recognise what a valid tag is?
Why Does This Matter?
Search and tracking quickly break down if they are inconsistent.
For example, if you are using #Romeo&Juliet to mark all your conversations about the play you are watching, different Twitter clients will link through to either #Romeo, #Romeo&, or #Romeo&Juliet. Each search returning potentially different conversations.
What's The Convention?
Twitter's website ought to be the definitive source of how hashtags work. This is their main site.
Yet, when we visit their mobile site - we get a completely different experience.
Because there aren't any widely publicised definitions for what hashtags are, some applications have a significantly different attitude to hashtags
To be fair, the Twitter team do have a standard. Even if they don't use it themselves.
They even have some limited test cases and libraries in Ruby and Java.
So, given that Twitter, their implementation and apps all disagree on what a hashtag is, let's try to work our what they should be.
Anatomy of a Tag
To begin at the beginning. A hashtag starts with a hash. #. Simple, no? No.
There are two different hash symbols! There's the # we all know and love, and there's ＃. Looks pretty similar, but in fact it's the unicode symbol [U+FF03]
Actually, that's not the beginning. What comes before the # of the hashtag?
Consider the following examples - which should be hashtags?
- #tag - the # starts off the Tweet
- This is my tweet #test - the # comes after a space.
- This is it.#tag - the # is pushed against some punctuation, perhaps for reasons of space.
- Here we go-#LiftOff - the # is pushed against a -
- I've run out of space#OhNo - the # is pushed against some text
- &#nbsp; - the # is part of an HTML entity
- text #hashtag - the # comes after a "wide space" (U+3000)
- Should I use #tag/#hashtag? The # comes after a /
- Is this valid ##tag - there are two #s
So, we can see it's a little more complicated than we first thought.
Let's skip over what's in a hashtag and as "how do we know that a tag has finished?"
Consider the following examples -
- New album #OMG! - should the ! be part of the hashtag?
- #BreakingNews: dog bites man - should the : be part of the hashtag?
- (is this a #tag) - should the ) be part of the hashtag?
- I like #tags#
We probably don't want to have any punctuation at the end of our tag. Can you think of any counter examples?
Our language is more than just the letters A-Z. We've got punctuation, numbers, symbols and all manner of other glyphs. Which of them count as part of a hashtag?
Take a look at these examples
- Vote Bush! #Don't
- My dog died #:-(
- Einstein #e=mc^2
- I'm on bus #123
- I'm giving #110%
Using Twitter's standards, none of the above render as complete tags.
We've mentioned accents above. As we can see in the first example, "funny" characters can cause problems. Broadly speaking, there are three issues.
- Accents. Should the é on #Café be linked?
- Accents. Is #Romeo the same as #Ŕöméø?
- Japanese, and some other languages, don't use spaces. Is #tagの valid? What about # 会議中 ?
These are a fraction of the possible problems. It's exhausting trying to find all the possible textual combinations and permutations which could lead into a hashtag. No wonder there is confusion!
Search is a complex, profitable, and useful business. It's of vital importance that there is a legitimate, comprehensive standard which all sites and applications can follow.