<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="https://shkspr.mobi/blog/wp-content/themes/edent-wordpress-theme/rss-style.xsl" type="text/xsl"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	    xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	     xmlns:dc="http://purl.org/dc/elements/1.1/"
	   xmlns:atom="http://www.w3.org/2005/Atom"
	     xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	  xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>
<channel>
	<title>mandarin &#8211; Terence Eden’s Blog</title>
	<atom:link href="https://shkspr.mobi/blog/tag/mandarin/feed/" rel="self" type="application/rss+xml" />
	<link>https://shkspr.mobi/blog</link>
	<description>Regular nonsense about tech and its effects 🙃</description>
	<lastBuildDate>Fri, 08 Nov 2024 07:53:30 +0000</lastBuildDate>
	<language>en-GB</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://shkspr.mobi/blog/wp-content/uploads/2023/07/cropped-avatar-32x32.jpeg</url>
	<title>mandarin &#8211; Terence Eden’s Blog</title>
	<link>https://shkspr.mobi/blog</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title><![CDATA[How Do You Sort Chinese Numbers?]]></title>
		<link>https://shkspr.mobi/blog/2016/11/how-do-you-sort-chinese-numbers/</link>
					<comments>https://shkspr.mobi/blog/2016/11/how-do-you-sort-chinese-numbers/#comments</comments>
				<dc:creator><![CDATA[@edent]]></dc:creator>
		<pubDate>Tue, 08 Nov 2016 11:27:48 +0000</pubDate>
				<category><![CDATA[usability]]></category>
		<category><![CDATA[chinese]]></category>
		<category><![CDATA[mandarin]]></category>
		<category><![CDATA[NaBloPoMo]]></category>
		<category><![CDATA[unicode]]></category>
		<guid isPermaLink="false">https://shkspr.mobi/blog/?p=23428</guid>

					<description><![CDATA[Imagine you have a series of number you wish to sort.  Sorting is a well known computer science problem - generally speaking you compare one value to the next and then move the item either up or down a list.  With &#34;English&#34; characters, that&#039;s fairly easy.  When a computer sees the character 1 it&#039;s really seeing the Unicode character U+0031.  When it sees 2 it&#039;s really seeing the character U+0032…]]></description>
										<content:encoded><![CDATA[<p>Imagine you have a series of number you wish to sort.  Sorting is a well known computer science problem - generally speaking you compare one value to the next and then move the item either up or down a list.</p>

<p>With "English" characters, that's fairly easy.</p>

<p>When a computer sees the character <code>1</code> it's <em>really</em> seeing the Unicode character <code>U+0031</code>.  When it sees <code>2</code> it's <em>really</em> seeing the character <code>U+0032</code> and so on.</p>

<p>The <a href="https://en.wikipedia.org/wiki/Arabic_numerals">Arabic numbers</a> we use (0 - 9) have an identical ordering in Unicode. This makes it very easy for a computer to sort "Western" numbers.</p>

<p>But for Chinese... Well, it's <em>complicated!</em></p>

<h2 id="counting-in-mandarin-chinese"><a href="https://shkspr.mobi/blog/2016/11/how-do-you-sort-chinese-numbers/#counting-in-mandarin-chinese">Counting in Mandarin Chinese</a></h2>

<p>Here's a very quick primer on Chinese numbers.</p>

<p>一 = 1<br>
二 = 2<br>
三 = 3<br>
四 = 4<br>
五 = 5<br>
六 = 6<br>
七 = 7<br>
八 = 8<br>
九 = 9<br>
十 = 10<br>
十一 = 11<br>
十二 = 12<br>
二十 = 20<br>
二十一 = 21<br>
二十二 = 22<br>
一百 = 100<br>
一百一 = 101<br>
一百二十三 = 123</p>

<p>In <a href="http://www.amathsdictionaryforkids.com/qr/b/base10system.html">Base-10</a> the length of a number  reflects its size. A 4 digit number is <em>always</em> bigger than a 3 digit number.</p>

<p>In Chinese, a 3 character number like 四十二 (42) is <em>longer</em> than a 2 character number like 九十 (90), yet its value is <em>smaller</em>.</p>

<p>But that's not the worst of it!</p>

<p>Because of the <a href="https://news.ycombinator.com/item?id=8041288">controversial</a> process of <a href="https://en.wikipedia.org/wiki/Han_unification">Han Unification</a> - a whole bunch of Chinese, Japanese, and Korean characters (CJK) are lumped together in the same Unicode code block  This leaves us with the somewhat weird situation where a number's numerical order doesn't match the order in which they're presented in Unicode.</p>

<p>Here's how the characters are represented:</p>

<table>
<thead>
<tr>
  <th align="right">Character</th>
  <th align="left">Number</th>
  <th align="left">Unicode Codepoint</th>
</tr>
</thead>
<tbody>
<tr>
  <td align="right">一</td>
  <td align="left">1</td>
  <td align="left">U+4E00</td>
</tr>
<tr>
  <td align="right">二</td>
  <td align="left">2</td>
  <td align="left">U+4E8C</td>
</tr>
<tr>
  <td align="right">三</td>
  <td align="left">3</td>
  <td align="left">U+4E09</td>
</tr>
<tr>
  <td align="right">四</td>
  <td align="left">4</td>
  <td align="left">U+56DB</td>
</tr>
<tr>
  <td align="right">五</td>
  <td align="left">5</td>
  <td align="left">U+4E94</td>
</tr>
<tr>
  <td align="right">六</td>
  <td align="left">6</td>
  <td align="left">U+516D</td>
</tr>
<tr>
  <td align="right">七</td>
  <td align="left">7</td>
  <td align="left">U+4E03</td>
</tr>
<tr>
  <td align="right">八</td>
  <td align="left">8</td>
  <td align="left">U+516B</td>
</tr>
<tr>
  <td align="right">九</td>
  <td align="left">9</td>
  <td align="left">U+4E5D</td>
</tr>
<tr>
  <td align="right">十</td>
  <td align="left">10</td>
  <td align="left">U+5341</td>
</tr>
<tr>
  <td align="right">百</td>
  <td align="left">100</td>
  <td align="left">U+767E</td>
</tr>
</tbody>
</table>

<p>Which, if my sorting is correct, gives us an ordering of:
<code>1 7 3 2 5 9 8 6 10 4</code></p>

<p>This makes it <strong>impossible</strong> to perform even a basic sort of a simple list of numbers without first doing some complex fiddling to transform the characters into numbers first.</p>

<h2 id="it-gets-even-more-complicated"><a href="https://shkspr.mobi/blog/2016/11/how-do-you-sort-chinese-numbers/#it-gets-even-more-complicated">It gets even more complicated.</a></h2>

<p>Anyone who has tried to sort a list of files with numbers in their name, knows that computers don't always see the world in the same way as humans.  It's quite common to see a sorted list which looks like this:</p>

<pre><code>10.mp3
11.mp3
1.mp3
20.mp3
2.mp3
3.mp3
4.mp3
...
</code></pre>

<p>Why? Because sorting by "text" is different to sorting by "value".</p>

<p>How do Chinese file names get sorted?  Here's Ubuntu's File manager trying to sort some files with Chinese numbers in them:
<img src="https://shkspr.mobi/blog/wp-content/uploads/2016/10/Chinese-characters-in-file-names-sorted-in-Linux-fs8.png" alt="Chinese characters in filenames sorted in linux - the files are in the wrong order" width="150" height="478" class="aligncenter size-full wp-image-23432"></p>

<p>Yet another ordering!  Why?  It turns out that <a href="https://en.wikipedia.org/wiki/Chinese_characters#Indexing">there are <em>lots</em> of ways to sort Chinese characters</a>.</p>

<p>In this case, the <a href="https://twitter.com/m13253/status/784726363282415617">characters are sorted according to the "English" pronunciation order</a>!  That's the equivalent of sorting the numbers 1 - 10 <em>alphabetically</em>: eight five four nine one seven six ten three two.</p>

<h2 id="can-we-make-it-even-more-complicated"><a href="https://shkspr.mobi/blog/2016/11/how-do-you-sort-chinese-numbers/#can-we-make-it-even-more-complicated">Can we make it even more complicated?</a></h2>

<p>Of course!</p>

<p>Let's include into the mix some <a href="https://en.wikipedia.org/wiki/Gujarati_alphabet#Digits">Gujarati digits</a>.  They look quite similar to our familiar Arabic digits and, like Arabic digits, have a sensible Unicode ordering.</p>

<p>Imagine a folder with the files <code>1</code>, <code>2</code>, <code>3</code>, <code>10</code> - with the numbers in Arabic, Chinese, and Gujarati.  How would you expect the files to be sorted?  Should <code>1</code> and <code>一</code> be grouped with  Gujarati's <code>૧</code>?</p>

<p>Naïvely we might expect the order to be 1, 2, 3, 10, ૧, ૨, ૩, ૧૦, 一, 二, 三, 十.</p>

<p>Ubuntu handles it two different ways.  In the GUI, the files are grouped:
<img src="https://shkspr.mobi/blog/wp-content/uploads/2016/10/Arabic-Chinese-and-Gujarati-numbers-in-filenames-the-ordering-is-inconsistent-fs8.png" alt="Arabic, Chinese, and Gujarati numbers in filenames - the ordering is inconsistent" width="152" height="449" class="aligncenter size-full wp-image-23438"></p>

<p>On the command line, we find yet another weird way to order files:</p>

<pre><code>10.mp3
૧૦.mp3
1.mp3
૧.mp3
2.mp3
૨.mp3
3.mp3
૩.mp3
一.mp3
三.mp3
二.mp3
十.mp3
</code></pre>

<p>Would <em>any</em> human expect an ordering like this?</p>

<h2 id="whats-the-solution"><a href="https://shkspr.mobi/blog/2016/11/how-do-you-sort-chinese-numbers/#whats-the-solution">What's the solution?</a></h2>

<p>I've complained before that <a href="https://shkspr.mobi/blog/2013/06/is-github-racist/">modern computing tools often ignore modern languages</a>.  Usually it's not outright racism - just an ignorance of how the world works and how people interact with machines.</p>

<p>The correct way, in my opinion, is to have <em>context aware</em> tools which empathise with what the user is trying to achieve.</p>

<p>There are several <a href="http://stackoverflow.com/questions/15076443/convert-numbers-in-chinese-characters-to-arabic-numbers">algorithms for converting "Chinese numbers" into "Arabic numbers"</a>.  When a tool encounters a character which represents a number, it should assume that <em>the numerical representation contains semantic meaning</em>.</p>

<p>Yes, it might be hard work - but that's what computers are here for. They do hard work so humans don't have to. And if your computer can't even sort files in the correct order, what else might it be getting wrong?</p>
<img src="https://shkspr.mobi/blog/wp-content/themes/edent-wordpress-theme/info/okgo.php?ID=23428&HTTP_REFERER=RSS" alt="" width="1" height="1" loading="eager">]]></content:encoded>
					
					<wfw:commentRss>https://shkspr.mobi/blog/2016/11/how-do-you-sort-chinese-numbers/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			</item>
	</channel>
</rss>
