<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="https://shkspr.mobi/blog/wp-content/themes/edent-wordpress-theme/rss-style.xsl" type="text/xsl"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	    xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	     xmlns:dc="http://purl.org/dc/elements/1.1/"
	   xmlns:atom="http://www.w3.org/2005/Atom"
	     xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	  xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>
<channel>
	<title>parse &#8211; Terence Eden’s Blog</title>
	<atom:link href="https://shkspr.mobi/blog/tag/parse/feed/" rel="self" type="application/rss+xml" />
	<link>https://shkspr.mobi/blog</link>
	<description>Regular nonsense about tech and its effects 🙃</description>
	<lastBuildDate>Mon, 31 Mar 2025 07:18:00 +0000</lastBuildDate>
	<language>en-GB</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://shkspr.mobi/blog/wp-content/uploads/2023/07/cropped-avatar-32x32.jpeg</url>
	<title>parse &#8211; Terence Eden’s Blog</title>
	<link>https://shkspr.mobi/blog</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title><![CDATA[Bugs in Twitter Text Libraries]]></title>
		<link>https://shkspr.mobi/blog/2010/03/bugs-in-twitter-text-libraries/</link>
					<comments>https://shkspr.mobi/blog/2010/03/bugs-in-twitter-text-libraries/#comments</comments>
				<dc:creator><![CDATA[@edent]]></dc:creator>
		<pubDate>Wed, 31 Mar 2010 10:27:50 +0000</pubDate>
				<category><![CDATA[/etc/]]></category>
		<category><![CDATA[mobile]]></category>
		<category><![CDATA[usability]]></category>
		<category><![CDATA[bugs]]></category>
		<category><![CDATA[dabr]]></category>
		<category><![CDATA[parse]]></category>
		<category><![CDATA[regex]]></category>
		<category><![CDATA[twitter]]></category>
		<category><![CDATA[urls]]></category>
		<guid isPermaLink="false">http://shkspr.mobi/blog/?p=1924</guid>

					<description><![CDATA[The Twitter Engineering Team have a set of text processing classes which are meant to simplify and standardise the recognition of URLs, screen names, and hashtags.  Dabr makes use of them to keep in conformance with Twitter&#039;s style.  One of the advantages of the text processing is that it will recognise that www.example.com is a URL and automatically create a hyperlink. Considering that dropping…]]></description>
										<content:encoded><![CDATA[<p>The <a href="https://blog.twitter.com/engineering/en_us/a/2010/introducing-the-open-source-twitter-text-libraries">Twitter Engineering Team have a set of text processing classes</a> which are meant to simplify and standardise the recognition of URLs, screen names, and hashtags.  Dabr makes use of them to keep in conformance with Twitter's style.</p>

<p>One of the advantages of the text processing is that it will recognise that www.example.com is a URL and automatically create a hyperlink. Considering that dropping the "http://" represents 5% saving on Twitter's 140 character limit for messages, this is great.</p>

<p>So, I was mightily surprised to get <a href="http://twitter.com/schmmuck/status/11352406573">this bug report</a> from user "schmmuck"</p>

<p></p><div id="attachment_1927" style="width: 490px" class="wp-caption aligncenter"><img aria-describedby="caption-attachment-1927" class="size-full wp-image-1927" title="Dabr rendering error" src="https://shkspr.mobi/blog/wp-content/uploads/2010/03/Capture8_19_22.jpg" alt="Dabr rendering error" width="480" height="320"><p id="caption-attachment-1927" class="wp-caption-text">Dabr rendering error</p></div><p></p>

<p>How very odd...  This is how it looks on <a href="http://m.twitter.com/">m.twitter.com</a>.</p>

<p></p><div id="attachment_1926" style="width: 490px" class="wp-caption aligncenter"><img aria-describedby="caption-attachment-1926" class="size-full wp-image-1926" title="m.twitter rendering error" src="https://shkspr.mobi/blog/wp-content/uploads/2010/03/Capture8_20_48.jpg" alt="m.twitter rendering error" width="480" height="320"><p id="caption-attachment-1926" class="wp-caption-text">m.twitter rendering error</p></div><p></p>

<p>Twitter also use <a href="http://mobile.twitter.com/">mobile.twitter.com</a> for smartphones.  Here's how that site renders the text.</p>

<p></p><div id="attachment_1925" style="width: 490px" class="wp-caption aligncenter"><img aria-describedby="caption-attachment-1925" class="size-full wp-image-1925" title="mobile.twitter rendering error" src="https://shkspr.mobi/blog/wp-content/uploads/2010/03/Capture8_21_54.jpg" alt="mobile.twitter rendering error" width="480" height="320"><p id="caption-attachment-1925" class="wp-caption-text">mobile.twitter rendering error</p></div><p></p>

<p>Finally, let's take a look at the "canonical" rendering at Twitter.com</p>

<p></p><div id="attachment_1928" style="width: 410px" class="wp-caption aligncenter"><img aria-describedby="caption-attachment-1928" class="size-full wp-image-1928" title="Twitter rendering error" src="https://shkspr.mobi/blog/wp-content/uploads/2010/03/Twitter-rendering-error.jpg" alt="Twitter rendering error" width="400" height="213"><p id="caption-attachment-1928" class="wp-caption-text">Twitter rendering error</p></div><p></p>

<h2 id="the-problems"><a href="https://shkspr.mobi/blog/2010/03/bugs-in-twitter-text-libraries/#the-problems">The Problem(s)</a></h2>

<p>The first issue is inconsistency.&nbsp; Twitter ought to be using the same regex for each of its sites.&nbsp; It doesn't.&nbsp; This means that different developers will get divergent experiences.&nbsp; This leads to confusion, which leads to fear, which, as we all know, leads to anger.... and so forth.</p>

<p>Secondly, and more importantly, parsing is <em>hard</em>.&nbsp; There are so many edge cases that errors inevitably creep in.&nbsp; My post about hashtags explains the problems in defining what <em>should</em> be recognised.</p>

<p>So, based on what we've seen, should Twitter recognise any of the following as URLs?</p>

<p>news.bbc.co.uk - no www there.</p>

<p>invalid.name - a silly URL, but a valid one.</p>

<p>खोज.com - International domains contain more than just ASCII</p>

<p>All the above are valid - yet they're not recognised by Twitter.</p>

<h2 id="a-simple-solution"><a href="https://shkspr.mobi/blog/2010/03/bugs-in-twitter-text-libraries/#a-simple-solution">A (Simple) Solution?</a></h2>

<p>There is a <a href="http://www.iana.org/domains/root/db/">canonical list of TLDs</a> which is also available as a <a href="http://data.iana.org/TLD/tlds-alpha-by-domain.txt">plain text list</a>.</p>

<p>Any string containing a "." followed by a valid TLD, then followed by a space or "/" should be treated as a URL.</p>

<p>Your thoughts?</p>
<img src="https://shkspr.mobi/blog/wp-content/themes/edent-wordpress-theme/info/okgo.php?ID=1924&HTTP_REFERER=RSS" alt="" width="1" height="1" loading="eager">]]></content:encoded>
					
					<wfw:commentRss>https://shkspr.mobi/blog/2010/03/bugs-in-twitter-text-libraries/feed/</wfw:commentRss>
			<slash:comments>5</slash:comments>
		
		
			</item>
	</channel>
</rss>
