How Do You Sort Chinese Numbers?
Imagine you have a series of number you wish to sort. Sorting is a well known computer science problem - generally speaking you compare one value to the next and then move the item either up or down a list.
With "English" characters, that's fairly easy.
When a computer sees the character 1
it's really seeing the Unicode character U+0031
. When it sees 2
it's really seeing the character U+0032
and so on.
The Arabic numbers we use (0 - 9) have an identical ordering in Unicode. This makes it very easy for a computer to sort "Western" numbers.
But for Chinese... Well, it's complicated!
Counting in Mandarin Chinese
Here's a very quick primer on Chinese numbers.
一 = 1 二 = 2 三 = 3 四 = 4 五 = 5 六 = 6 七 = 7 八 = 8 九 = 9 十 = 10 十一 = 11 十二 = 12 二十 = 20 二十一 = 21 二十二 = 22 一百 = 100 一百一 = 101 一百二十三 = 123
In Base-10 the length of a number reflects its size. A 4 digit number is always bigger than a 3 digit number.
In Chinese, a 3 character number like 四十二 (42) is longer than a 2 character number like 九十 (90), yet its value is smaller.
But that's not the worst of it!
Because of the controversial process of Han Unification - a whole bunch of Chinese, Japanese, and Korean characters (CJK) are lumped together in the same Unicode code block This leaves us with the somewhat weird situation where a number's numerical order doesn't match the order in which they're presented in Unicode.
Here's how the characters are represented:
Character | Number | Unicode Codepoint |
---|---|---|
一 | 1 | U+4E00 |
二 | 2 | U+4E8C |
三 | 3 | U+4E09 |
四 | 4 | U+56DB |
五 | 5 | U+4E94 |
六 | 6 | U+516D |
七 | 7 | U+4E03 |
八 | 8 | U+516B |
九 | 9 | U+4E5D |
十 | 10 | U+5341 |
百 | 100 | U+767E |
Which, if my sorting is correct, gives us an ordering of: 1 7 3 2 5 9 8 6 10 4
This makes it impossible to perform even a basic sort of a simple list of numbers without first doing some complex fiddling to transform the characters into numbers first.
It gets even more complicated.
Anyone who has tried to sort a list of files with numbers in their name, knows that computers don't always see the world in the same way as humans. It's quite common to see a sorted list which looks like this:
10.mp3
11.mp3
1.mp3
20.mp3
2.mp3
3.mp3
4.mp3
...
Why? Because sorting by "text" is different to sorting by "value".
How do Chinese file names get sorted? Here's Ubuntu's File manager trying to sort some files with Chinese numbers in them:
Yet another ordering! Why? It turns out that there are lots of ways to sort Chinese characters.
In this case, the characters are sorted according to the "English" pronunciation order! That's the equivalent of sorting the numbers 1 - 10 alphabetically: eight five four nine one seven six ten three two.
Can we make it even more complicated?
Of course!
Let's include into the mix some Gujarati digits. They look quite similar to our familiar Arabic digits and, like Arabic digits, have a sensible Unicode ordering.
Imagine a folder with the files 1
, 2
, 3
, 10
- with the numbers in Arabic, Chinese, and Gujarati. How would you expect the files to be sorted? Should 1
and 一
be grouped with Gujarati's ૧
?
Naïvely we might expect the order to be 1, 2, 3, 10, ૧, ૨, ૩, ૧૦, 一, 二, 三, 十.
Ubuntu handles it two different ways. In the GUI, the files are grouped:
On the command line, we find yet another weird way to order files:
10.mp3
૧૦.mp3
1.mp3
૧.mp3
2.mp3
૨.mp3
3.mp3
૩.mp3
一.mp3
三.mp3
二.mp3
十.mp3
Would any human expect an ordering like this?
What's the solution?
I've complained before that modern computing tools often ignore modern languages. Usually it's not outright racism - just an ignorance of how the world works and how people interact with machines.
The correct way, in my opinion, is to have context aware tools which empathise with what the user is trying to achieve.
There are several algorithms for converting "Chinese numbers" into "Arabic numbers". When a tool encounters a character which represents a number, it should assume that the numerical representation contains semantic meaning.
Yes, it might be hard work - but that's what computers are here for. They do hard work so humans don't have to. And if your computer can't even sort files in the correct order, what else might it be getting wrong?
Marcus Downing says:
The correct behaviour isn't to "convert Chinese numbers into Arabic numbers", for several good reasons:
As you point out, Arabic numbers often aren't sorted correctly either - I've seen too many file listings with [1, 10, 2...], over the years and spanning operating systems.
Arabic numbers aren't inherently more correct than any other encoding. There are ways in which Chinese numbers have an advantage over Arabic ones.
Commas (23,000), decimals (3.92134), exponents (3.04e9), units (3.5 MB) etc mean it's even more complicated.
The upshot is that sorting numbers by the individual characters in a string is never going to be good enough. The answer is to convert the numbers, whatever their form, into binary numbers and sort those.
For some time now (since at least the days of Windows XP), there have been options for numerical file name sorting. But of course, everything that desktop software learned over decades has to be relearned on the web (even today, Google Docs does this wrong).