<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="https://shkspr.mobi/blog/wp-content/themes/edent-wordpress-theme/rss-style.xsl" type="text/xsl"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	    xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	     xmlns:dc="http://purl.org/dc/elements/1.1/"
	   xmlns:atom="http://www.w3.org/2005/Atom"
	     xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	  xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>
<channel>
	<title>urls &#8211; Terence Eden’s Blog</title>
	<atom:link href="https://shkspr.mobi/blog/tag/urls/feed/" rel="self" type="application/rss+xml" />
	<link>https://shkspr.mobi/blog</link>
	<description>Regular nonsense about tech and its effects 🙃</description>
	<lastBuildDate>Wed, 25 Feb 2026 08:56:43 +0000</lastBuildDate>
	<language>en-GB</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://shkspr.mobi/blog/wp-content/uploads/2023/07/cropped-avatar-32x32.jpeg</url>
	<title>urls &#8211; Terence Eden’s Blog</title>
	<link>https://shkspr.mobi/blog</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title><![CDATA[Why your blog URLs should contain dates.]]></title>
		<link>https://shkspr.mobi/blog/2015/02/why-your-blog-urls-should-contain-dates/</link>
					<comments>https://shkspr.mobi/blog/2015/02/why-your-blog-urls-should-contain-dates/#comments</comments>
				<dc:creator><![CDATA[@edent]]></dc:creator>
		<pubDate>Wed, 25 Feb 2015 14:07:35 +0000</pubDate>
				<category><![CDATA[/etc/]]></category>
		<category><![CDATA[urls]]></category>
		<category><![CDATA[web]]></category>
		<guid isPermaLink="false">https://shkspr.mobi/blog/?p=20635</guid>

					<description><![CDATA[I have a (very minor and polite) disagreement with Matt Gemmel&#039;s argument against dates in URLs.  Before I start, let me be very clear; your blog = your rules.  If you want to write your URLs as a series of Emoji or in Klingon - go right ahead.  There really is no such thing as &#34;best practice&#34; - only personal preference and observed behaviour.  That said...  Here&#039;s my case for keeping dates in…]]></description>
										<content:encoded><![CDATA[<p>I have a (very minor and polite) disagreement with <a href="http://mattgemmell.com/permalinks/" title="Gemmell with two Ls...">Matt Gemmel's argument against dates in URLs</a>.</p>

<p>Before I start, let me be very clear; your blog = your rules.  If you want to write your URLs as a series of Emoji or in Klingon - go right ahead.  There really is no such thing as "best practice" - only personal preference and observed behaviour.</p>

<p>That said...</p>

<p>Here's my case for <em>keeping</em> dates in URLs.</p>

<p>URLs are designed to provide information to humans <strong>and</strong> computers.  That's why we don't just use IP addresses and binary representation of paths.</p>

<ul>
    <li>The "scheme" (http, https, ftp, etc) tells me whether my connection is secure, and what program is likely to try to open the link.</li>
    <li>The domain "example.com" tells me the destination.  I can decide whether I consider it to be trustworthy or not.</li>
    <li>The path "/2015/blog-urls-explained.pdf" gives me further semantic information about the destination.  Is it recent information? what's the page about? Is it a web page or a file?</li>
</ul>

<p>All of which lead to my decision as to whether I visit the link or not.</p>

<p>Matt's <a href="http://mattgemmell.com/permalinks/">arguments</a> are mostly aesthetic.</p>

<blockquote><p>They’re visually ugly. Strings of numbers aren’t nice to look at. They look like they’re made for machines.</p></blockquote>

<p>Well, that's a matter of opinion.  I can't find any evidence that people are somehow offended or alienated by numbers.  The semantic information is useful for humans - people can quickly see that a post about the iPhone is from 2011 and is probably obsolete.</p>

<blockquote><p>They’re unnecessarily lengthy. They’re exactly eleven characters too long, in fact.</p></blockquote>

<p>Personally, I prefer Year/Month format - but I rarely write each day.  For a site which has to publish multiple times per day, it may make sense to give the reader some pre-warning as to how fresh the content is.</p>

<p>Why is brevity a virtue?  What is the perfect length?  What are the consequences of being unnecessarily verbose?  The only explanation is the next point:</p>

<blockquote><p>They push the post’s title off to the right, maybe partially obscuring it in the address bar of the visitor’s browser (or their bookmarks menu, or history list).</p></blockquote>

<p>That's stretching it a bit!  If you're truly worried that obscuring the address is a concern, get a shorter domain!
<img src="https://shkspr.mobi/blog/wp-content/uploads/2015/02/Gemel-url-long-fs8.png" alt="Gemel-url-long-fs8" width="400" height="445" class="aligncenter size-full wp-image-20636">
The argument here is that the last few characters of a URL have much greater semantic importance than the date of publication.  I can't agree.  In the future we might be browsing on augmented reality goggles which give us a 360° field of view.  Complaining that small phones might be <em>slightly</em> disadvantaged seems like the sort of "pixel perfect" design the web is meant to eschew.</p>

<p>Worrying about the length of the URL just leads you to waste time crafting a URL which is <em>exactly</em> the right length for one particular device.  Or should you worry about how the URL will display on a Smart Watch?</p>

<blockquote><p>The page itself has the date of the post on it anyway. In the few cases where it doesn’t, that’s a deliberate design choice, and you’re not meant to be focusing on it.</p></blockquote>

<p>I agree, your posts should probably have a date on them.  But they also have a title, so why not remove that from the URL as well?</p>

<p>If your posts really are designed to stand the test of time and remain an immutable opinion - you may be confusing yourself with an infallible deity.  For most bloggers, posts are a product of their time.  Users probably do care that your opinions have evolved.  IF I write about the War in Afghanistan, I want it to be fairly obvious <em>which</em> of the many wars I am talking about.  I think the URL helps - as does the date on the page.</p>

<blockquote><p>In most cases, you don’t care about the date. Right now, a tiny subset of humans (technical people, who think of code examples or software tutorials when they read the phrase “blog post”) are going to argue that the date does matter. They are wrong. Any article with time-sensitive information will either mention its vintage explicitly, or is by definition poorly constructed.</p></blockquote>

<p>Hurrah! I'm in a subset of geeks!  Luckily, I am right.</p>

<p>Humans are notoriously bad at thinking in advance.  Most of us have neither the time nor the inclination to make our text adhere to the Platonic Ideal of a blog post.  Should I mention that my review of a book is the 2013 edition? Probably.  But if I forget, or simply don't consider the consequences, there's a handy guide for the next human in the shape of the URL data.</p>

<p>Dates in URLs help save us from our human failings.</p>

<h2 id="next-steps"><a href="https://shkspr.mobi/blog/2015/02/why-your-blog-urls-should-contain-dates/#next-steps">Next Steps</a></h2>

<p>Taken to its logical conclusion, Matt's idea (human-readable semantic information is ugly and redundant) ends with the <a href="https://web.archive.org/web/20121009232455/https://tommorris.org/posts/2451">proposal put forward by my friend Tom Morris</a> whose posts URLs are in the format "/post/1234".</p>

<blockquote><p>not every post can be adequately summarised with a bunch of ASCII characters with hyphens between them. What about just a photo post, without a title? Even my “formatted titles” are a bit of a bad hack I might turn off.</p>

<p>Other than SEO, there’s no particularly good reason why people should prefer a long URL with the title in than one that just has a unique identifier. The title is also not immutable on posts. That is, I can change a title, or even remove a title, after publication. Should I change the URLs? Well, no. URLs, once announced, should stay the same. Okay then, I’ll have URLs that are inaccurate.</p>

<p><cite><a href="https://web.archive.org/web/20121009232455/https://tommorris.org/posts/2451">Just say no to URL stubs</a></cite></p></blockquote>

<p>A plague on <strong>both</strong> your houses!</p>

<p>Humans are fragile and fallible.  We need every bit of help we can to navigate our way through life.  A fashionable choice now can have unintended consequences in the future.  We should design all aspects of our sites - from URL to content - to withstand changes in devices, formats, browsers, and people.</p>

<p>Would a URL with a date in have helped prevent <a href="https://www.thedrum.com/news/2015/02/16/twitter-confusion-sees-morph-creator-tony-hart-mourned-second-time">confusion over Tony Hart's date of death</a>?</p>

<p>Probably not.  We're all just monkeys letting squiggles of black and white <a href="http://www.goodreads.com/quotes/37611-reading-functions-as-hallucinating-a-meaning-between-letters-and-lines">emotionally manipulate our brains</a>.  Or, to put it crudely :</p>

<blockquote class="social-embed" id="social-embed-542348626711019520" lang="en" itemscope="" itemtype="https://schema.org/SocialMediaPosting"><header class="social-embed-header" itemprop="author" itemscope="" itemtype="https://schema.org/Person"><a href="https://twitter.com/KatieOldham" class="social-embed-user" itemprop="url"><img class="social-embed-avatar social-embed-avatar-circle" src="data:image/webp;base64,UklGRh4CAABXRUJQVlA4IBICAADwCQCdASowADAAPrVQoUwnJKMiI4344BaJZAC7M1TSDjdPc5xt2cYNedXoeVcXqXQSOSUYyIq1vVNT3ZjPxEayNqkga2tt3RVnWGaF2fF7P6USXZ6IRmzoAP752vHjUHPcEd+GVT37hCxZAjt/ye3MIylXbIZSseD4+b8bGJYvqgHR7dPI9gioQDjVl5iFU6bsOKz7M9vH5PxMH/AjY4WYsoIDvyySyHJNAp+4XR33trz6vbE/AKZCgWSyJZdAcQ/XBgX/CrV1LvWY7EMihmjLgKMj94ANzfDvYe4SjykXWn/9LD/ybf/sJnzs8A90PZM8HNMwMUYWTLQd5mC4l6tf3lWxyhE10u1KAL3SyPv3iijdeRJR2ng3UcQbQiuHhKNIdSD/8cIrePYy0UIwvEbzLaSm1xVL5H5xPcqPHzT5+9ZH3IPzaVV4X+ILjRIxKX7eKkXgRyEqSt1kXgsRb+b/4Ikf+/iGKApXGd/8OdIzgAcq9jstJg0FEUvaV/IbL0BDW66RlY9X/+BC1GW3eiFGnUO71pnNu+dsVrhfZFsCj8X76h+XuNuyNoaX4wa9QCSXRCtjp3aJ7kbJLnHWY21rNe3hqIGEVchsnzXRq2L9vuHITrxWa1vLHhXDJk01cj9Cos5sOT1fpd5w+tWKm68IX3T/bP9HSShNxj8ow6dNbJyjcgWtG3hZY2gvCDj1dvXgAA==" alt="" itemprop="image"><div class="social-embed-user-names"><p class="social-embed-user-names-name" itemprop="name">Katie</p>@KatieOldham</div></a><img class="social-embed-logo" alt="Twitter" src="data:image/svg+xml,%3Csvg%20xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%22%0Aaria-label%3D%22Twitter%22%20role%3D%22img%22%0AviewBox%3D%220%200%20512%20512%22%3E%3Cpath%0Ad%3D%22m0%200H512V512H0%22%0Afill%3D%22%23fff%22%2F%3E%3Cpath%20fill%3D%22%231d9bf0%22%20d%3D%22m458%20140q-23%2010-45%2012%2025-15%2034-43-24%2014-50%2019a79%2079%200%2000-135%2072q-101-7-163-83a80%2080%200%200024%20106q-17%200-36-10s-3%2062%2064%2079q-19%205-36%201s15%2053%2074%2055q-50%2040-117%2033a224%20224%200%2000346-200q23-16%2040-41%22%2F%3E%3C%2Fsvg%3E"></header><section class="social-embed-text" itemprop="articleBody">Ever realised how fucking surreal reading a book actually is? You stare at marked slices of tree for hours on end, hallucinating vividly</section><hr class="social-embed-hr"><footer class="social-embed-footer"><a href="https://twitter.com/KatieOldham/status/542348626711019520"><span aria-label="21331 likes" class="social-embed-meta">❤️ 21,331</span><span aria-label="315 replies" class="social-embed-meta">💬 315</span><span aria-label="0 reposts" class="social-embed-meta">🔁 0</span><time datetime="2014-12-09T16:02:43.000Z" itemprop="datePublished">16:02 - Tue 09 December 2014</time></a></footer></blockquote>
<img src="https://shkspr.mobi/blog/wp-content/themes/edent-wordpress-theme/info/okgo.php?ID=20635&HTTP_REFERER=RSS" alt="" width="1" height="1" loading="eager">]]></content:encoded>
					
					<wfw:commentRss>https://shkspr.mobi/blog/2015/02/why-your-blog-urls-should-contain-dates/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			</item>
		<item>
		<title><![CDATA[QRpedia - Custom URLs]]></title>
		<link>https://shkspr.mobi/blog/2011/11/qrpedia-custom-urls/</link>
					<comments>https://shkspr.mobi/blog/2011/11/qrpedia-custom-urls/#comments</comments>
				<dc:creator><![CDATA[@edent]]></dc:creator>
		<pubDate>Sat, 26 Nov 2011 09:02:24 +0000</pubDate>
				<category><![CDATA[qrpedia]]></category>
		<category><![CDATA[custom]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[urls]]></category>
		<guid isPermaLink="false">http://shkspr.mobi/blog/?p=4847</guid>

					<description><![CDATA[This blog post is designed to foster a technical and logistical discussion.  In much the same way as the earlier QRpedia language discussion did.  One of the most requested features in QRpedia is to have custom URLs.  For example, the British Museum may want a URL of &#34;bm.qrwp.org&#34;.  This has two main advantages.       Better analytics. Although the British Museum is the only place likely to have…]]></description>
										<content:encoded><![CDATA[<p>This blog post is designed to foster a technical and logistical discussion.  In much the same way as <a href="https://shkspr.mobi/blog/2011/10/qrpedia-dealing-with-minority-languages/">the earlier QRpedia language discussion</a> did.</p>

<p>One of the most requested features in QRpedia is to have custom URLs.</p>

<p>For example, the British Museum may want a URL of "<strong>bm</strong>.qrwp.org".  This has two main advantages.</p>

<ol>
    <li>Better analytics. Although the British Museum is the only place likely to have the Rosetta Stone, many museums will have exhibits about "Ancient Egypt" or "Gold".  By differentiating museums, their statistics are easier to view.</li>
    <li>Branding opportunities.  A user will know that they've scanned a code belong to a specific museum.</li>
</ol>

<p>From a technical perspective, this is fairly easy to implement.  Assuming that a museum is only generating codes in one language, we simply map $museum.qrwp to $language.qrwp - and record in the logging database as per usual.</p>

<p>However, there are a number of challenges around the naming of museums which means considerable thought is needed before we implement this.</p>

<h2 id="length"><a href="https://shkspr.mobi/blog/2011/11/qrpedia-custom-urls/#length">Length</a></h2>

<p>QR codes work best when the URL inside them is as short as possible.</p>

<p>This means, we don't want a URL like "<strong>BritishMuseum</strong>.qrwp.org" or even "<strong>PrestongrangeIndustrialHeritageMuseum</strong>.qrwp.org".</p>

<p>So, we need to choose suitable abbreviations.</p>

<h2 id="language-clashes"><a href="https://shkspr.mobi/blog/2011/11/qrpedia-custom-urls/#language-clashes">Language Clashes</a></h2>

<p>We could create a custom URL for the British Museum of "bm".  However, that's also the same language code as the <a href="http://en.wikipedia.org/wiki/Bamanankan">Bambara language</a>.</p>

<p>There are several <a href="http://en.wikipedia.org/wiki/Language_codes">Language Codes</a> in use - covering two and three letter combinations.  There are currently <a href="http://meta.wikimedia.org/wiki/List_of_Wikipedias">282 different language versions of Wikipedia</a>.</p>

<p>Those mostly use two or three letters to distinguish between languages - but there are the occasional surprise like "<a href="http://bat-smg.wikipedia.org/wiki/P%C4%97rms_poslapis">bat-smg</a>"</p>

<h2 id="abbreviation-clashes"><a href="https://shkspr.mobi/blog/2011/11/qrpedia-custom-urls/#abbreviation-clashes">Abbreviation Clashes</a></h2>

<p>Suppose that the British Museum wanted a custom URL of "<strong>brit</strong>.qrwp.org" - that may clash with the (fictitious) <strong>Br</strong>azilian <strong>I</strong>nstitute for <strong>T</strong>echnology.</p>

<h2 id="we-need"><a href="https://shkspr.mobi/blog/2011/11/qrpedia-custom-urls/#we-need">We Need...</a></h2>

<p>We need to meet these aims for custom URLs:</p>

<ol>
    <li>Short</li>
    <li>Unique</li>
    <li>Recognisable</li>
    <li>Fairly distributed</li>
</ol>

<p>How on Earth do we do that?</p>

<p>On your marks... Get set... Discuss!</p>
<img src="https://shkspr.mobi/blog/wp-content/themes/edent-wordpress-theme/info/okgo.php?ID=4847&HTTP_REFERER=RSS" alt="" width="1" height="1" loading="eager">]]></content:encoded>
					
					<wfw:commentRss>https://shkspr.mobi/blog/2011/11/qrpedia-custom-urls/feed/</wfw:commentRss>
			<slash:comments>12</slash:comments>
		
		
			</item>
		<item>
		<title><![CDATA[When is a URL not a URL?]]></title>
		<link>https://shkspr.mobi/blog/2011/07/when-is-a-url-not-a-url/</link>
					<comments>https://shkspr.mobi/blog/2011/07/when-is-a-url-not-a-url/#comments</comments>
				<dc:creator><![CDATA[@edent]]></dc:creator>
		<pubDate>Wed, 27 Jul 2011 11:37:57 +0000</pubDate>
				<category><![CDATA[/etc/]]></category>
		<category><![CDATA[usability]]></category>
		<category><![CDATA[api]]></category>
		<category><![CDATA[regex]]></category>
		<category><![CDATA[twitter]]></category>
		<category><![CDATA[urls]]></category>
		<guid isPermaLink="false">http://shkspr.mobi/blog/?p=4271</guid>

					<description><![CDATA[Summary  Twitter&#039;s way of linking URLs is broken.  It&#039;s annoying to users, and a pain in the arse to developers.  This quick post talks about the problem and offers a solution.  I&#039;ve raised a bug with Twitter and I hope you&#039;ll star it as important to you.   Preamble  A common trope in programming classes is &#34;how do you detect valid email address?&#34;  It should be obvious, right?  A string of text,…]]></description>
										<content:encoded><![CDATA[<h2 id="summary"><a href="https://shkspr.mobi/blog/2011/07/when-is-a-url-not-a-url/#summary">Summary</a></h2>

<p>Twitter's way of linking URLs is broken.  It's annoying to users, and a pain in the arse to developers.  This quick post talks about the problem and offers a solution.</p>

<p><a href="http://code.google.com/p/twitter-api/issues/detail?id=2240">I've raised a bug with Twitter</a> and I hope you'll star it as important to you.
<span id="more-4271"></span></p>

<h2 id="preamble"><a href="https://shkspr.mobi/blog/2011/07/when-is-a-url-not-a-url/#preamble">Preamble</a></h2>

<p>A common trope in programming classes is "<a href="http://www.regular-expressions.info/email.html">how do you detect valid email address</a>?"</p>

<p>It should be obvious, right?  A string of text, an @, a domain - probably ending in .com.
As it turns out, it's not that simple.  "who+o'toole@invalid.museum" is a potentially valid address, for example.
There are literally thousands of ways to detect the potentially infinite variety of email addresses.</p>

<p>The same is true for URLs - and slavish adherence to guidelines is killing Twitter's usefulness.</p>

<h2 id="the-url-matching-problem"><a href="https://shkspr.mobi/blog/2011/07/when-is-a-url-not-a-url/#the-url-matching-problem">The URL Matching Problem</a></h2>

<p>Which of these strings should be turned into hyperlinks?</p>

<pre>www.bbc.co.uk

example.com

http://test

https://test.test

ftp://news.com
</pre>

<p>As it happens, Twitter only matches "https://test.test" and none of the others.</p>

<p><a href="https://twitter.com/edent/status/96172785436590080"><img src="https://shkspr.mobi/blog/wp-content/uploads/2011/07/URL-test-1.jpg" alt="" title="URL test 1" width="514" height="216" class="aligncenter size-full wp-image-4274"></a></p>

<p>Twitter's matching regex is, as far as I can tell, this</p>

<pre>If it starts with http:// or https:// and has a dot in it - it's a URL</pre>

<p>I think this is a serious weakness.  Twitter users are sharing URLs which their followers can't click on - Twitter is also linking to URLs which don't exist.</p>

<p>I've picked these examples more or less at random.
<a href="https://twitter.com/ianvisits/status/82712842112991232"><img src="https://shkspr.mobi/blog/wp-content/uploads/2011/07/URL-test-2.jpg" alt="" title="URL test 2" width="514" height="216" class="aligncenter size-full wp-image-4275"></a></p>

<p><a href="https://twitter.com/PeakChief/status/82722453767462912"><img src="https://shkspr.mobi/blog/wp-content/uploads/2011/07/URL-test-3.jpg" alt="" title="URL test 3" width="514" height="216" class="aligncenter size-full wp-image-4276"></a></p>

<h2 id="solution"><a href="https://shkspr.mobi/blog/2011/07/when-is-a-url-not-a-url/#solution">Solution?</a></h2>

<p>Much like the email regexes, I would take a much more lax approach.  Essentially, if it looks vaguely like a URL - link to it.</p>

<p>I would suggest the following rules:</p>

<ul>
    <li>If it starts with a protocol - http:// ftp:// tel: etc - create a hyperlink.</li>
    <li>If it starts with www. - create a hyperlink.</li>
    <li>If it ends . then a <a href="http://data.iana.org/TLD/tlds-alpha-by-domain.txt">valid TLD</a> - create a hyperlink.</li>
    <li>If it contains a <a href="http://data.iana.org/TLD/tlds-alpha-by-domain.txt">valid TLD</a> followed by a slash then some other characters - create a hyperlink.</li>
</ul>

<p>The "correct" method would then be for Twitter to perform an <a href="http://en.wikipedia.org/wiki/HTTP#Request_methods">HTTP HEAD request</a> to see if the URL is potentially valid.  There are three drawbacks to this.</p>

<ol>
    <li>It may place excessive load on Twitter's servers to process and cache these requests.</li>
    <li>The URL may be that of an Intranet site - and thus inaccessible to Twitter.</li>
    <li>The URL may be valid but temporarily inaccessible.</li>
</ol>

<p>Regardless of the method, surely it's inexcusable that "www.example.com" isn't detected as a URL whereas "http://bork.bork.bork" is?</p>

<h2 id="action"><a href="https://shkspr.mobi/blog/2011/07/when-is-a-url-not-a-url/#action">ACTION!</a></h2>

<p>If you think Twitter's approach to hyperlinks is wrong - please <a href="http://code.google.com/p/twitter-api/issues/detail?id=2240">make your voice heard at the bug report</a>.</p>
<img src="https://shkspr.mobi/blog/wp-content/themes/edent-wordpress-theme/info/okgo.php?ID=4271&HTTP_REFERER=RSS" alt="" width="1" height="1" loading="eager">]]></content:encoded>
					
					<wfw:commentRss>https://shkspr.mobi/blog/2011/07/when-is-a-url-not-a-url/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			</item>
		<item>
		<title><![CDATA[Bugs in Twitter Text Libraries]]></title>
		<link>https://shkspr.mobi/blog/2010/03/bugs-in-twitter-text-libraries/</link>
					<comments>https://shkspr.mobi/blog/2010/03/bugs-in-twitter-text-libraries/#comments</comments>
				<dc:creator><![CDATA[@edent]]></dc:creator>
		<pubDate>Wed, 31 Mar 2010 10:27:50 +0000</pubDate>
				<category><![CDATA[/etc/]]></category>
		<category><![CDATA[mobile]]></category>
		<category><![CDATA[usability]]></category>
		<category><![CDATA[bugs]]></category>
		<category><![CDATA[dabr]]></category>
		<category><![CDATA[parse]]></category>
		<category><![CDATA[regex]]></category>
		<category><![CDATA[twitter]]></category>
		<category><![CDATA[urls]]></category>
		<guid isPermaLink="false">http://shkspr.mobi/blog/?p=1924</guid>

					<description><![CDATA[The Twitter Engineering Team have a set of text processing classes which are meant to simplify and standardise the recognition of URLs, screen names, and hashtags.  Dabr makes use of them to keep in conformance with Twitter&#039;s style.  One of the advantages of the text processing is that it will recognise that www.example.com is a URL and automatically create a hyperlink. Considering that dropping…]]></description>
										<content:encoded><![CDATA[<p>The <a href="https://blog.twitter.com/engineering/en_us/a/2010/introducing-the-open-source-twitter-text-libraries">Twitter Engineering Team have a set of text processing classes</a> which are meant to simplify and standardise the recognition of URLs, screen names, and hashtags.  Dabr makes use of them to keep in conformance with Twitter's style.</p>

<p>One of the advantages of the text processing is that it will recognise that www.example.com is a URL and automatically create a hyperlink. Considering that dropping the "http://" represents 5% saving on Twitter's 140 character limit for messages, this is great.</p>

<p>So, I was mightily surprised to get <a href="http://twitter.com/schmmuck/status/11352406573">this bug report</a> from user "schmmuck"</p>

<p></p><div id="attachment_1927" style="width: 490px" class="wp-caption aligncenter"><img aria-describedby="caption-attachment-1927" class="size-full wp-image-1927" title="Dabr rendering error" src="https://shkspr.mobi/blog/wp-content/uploads/2010/03/Capture8_19_22.jpg" alt="Dabr rendering error" width="480" height="320"><p id="caption-attachment-1927" class="wp-caption-text">Dabr rendering error</p></div><p></p>

<p>How very odd...  This is how it looks on <a href="http://m.twitter.com/">m.twitter.com</a>.</p>

<p></p><div id="attachment_1926" style="width: 490px" class="wp-caption aligncenter"><img aria-describedby="caption-attachment-1926" class="size-full wp-image-1926" title="m.twitter rendering error" src="https://shkspr.mobi/blog/wp-content/uploads/2010/03/Capture8_20_48.jpg" alt="m.twitter rendering error" width="480" height="320"><p id="caption-attachment-1926" class="wp-caption-text">m.twitter rendering error</p></div><p></p>

<p>Twitter also use <a href="http://mobile.twitter.com/">mobile.twitter.com</a> for smartphones.  Here's how that site renders the text.</p>

<p></p><div id="attachment_1925" style="width: 490px" class="wp-caption aligncenter"><img aria-describedby="caption-attachment-1925" class="size-full wp-image-1925" title="mobile.twitter rendering error" src="https://shkspr.mobi/blog/wp-content/uploads/2010/03/Capture8_21_54.jpg" alt="mobile.twitter rendering error" width="480" height="320"><p id="caption-attachment-1925" class="wp-caption-text">mobile.twitter rendering error</p></div><p></p>

<p>Finally, let's take a look at the "canonical" rendering at Twitter.com</p>

<p></p><div id="attachment_1928" style="width: 410px" class="wp-caption aligncenter"><img aria-describedby="caption-attachment-1928" class="size-full wp-image-1928" title="Twitter rendering error" src="https://shkspr.mobi/blog/wp-content/uploads/2010/03/Twitter-rendering-error.jpg" alt="Twitter rendering error" width="400" height="213"><p id="caption-attachment-1928" class="wp-caption-text">Twitter rendering error</p></div><p></p>

<h2 id="the-problems"><a href="https://shkspr.mobi/blog/2010/03/bugs-in-twitter-text-libraries/#the-problems">The Problem(s)</a></h2>

<p>The first issue is inconsistency.&nbsp; Twitter ought to be using the same regex for each of its sites.&nbsp; It doesn't.&nbsp; This means that different developers will get divergent experiences.&nbsp; This leads to confusion, which leads to fear, which, as we all know, leads to anger.... and so forth.</p>

<p>Secondly, and more importantly, parsing is <em>hard</em>.&nbsp; There are so many edge cases that errors inevitably creep in.&nbsp; My post about hashtags explains the problems in defining what <em>should</em> be recognised.</p>

<p>So, based on what we've seen, should Twitter recognise any of the following as URLs?</p>

<p>news.bbc.co.uk - no www there.</p>

<p>invalid.name - a silly URL, but a valid one.</p>

<p>खोज.com - International domains contain more than just ASCII</p>

<p>All the above are valid - yet they're not recognised by Twitter.</p>

<h2 id="a-simple-solution"><a href="https://shkspr.mobi/blog/2010/03/bugs-in-twitter-text-libraries/#a-simple-solution">A (Simple) Solution?</a></h2>

<p>There is a <a href="http://www.iana.org/domains/root/db/">canonical list of TLDs</a> which is also available as a <a href="http://data.iana.org/TLD/tlds-alpha-by-domain.txt">plain text list</a>.</p>

<p>Any string containing a "." followed by a valid TLD, then followed by a space or "/" should be treated as a URL.</p>

<p>Your thoughts?</p>
<img src="https://shkspr.mobi/blog/wp-content/themes/edent-wordpress-theme/info/okgo.php?ID=1924&HTTP_REFERER=RSS" alt="" width="1" height="1" loading="eager">]]></content:encoded>
					
					<wfw:commentRss>https://shkspr.mobi/blog/2010/03/bugs-in-twitter-text-libraries/feed/</wfw:commentRss>
			<slash:comments>5</slash:comments>
		
		
			</item>
	</channel>
</rss>
