Should ₹ be part of the Latin font subset?

by @edent | # # # # # | 5 comments | Read ~148 times.

Some background reading. Skip if you’re familiar with fonts.

A font file contains a list of characters (usually letters, numbers, and punctuation) and glyphs (the drawn representation of that character). It is, of course, a lot more complicated than that.

Each character has a codepoint which is represented in hexadecimal. For example, U+0057 is the Latin letter Capital W, U+20AC is the Euro Symbol €, and U+1F600 is the Emoji Smiling Face 😀. These codepoints are assigned by the Unicode Consortium.

A font which contains thousands of characters will be multi-megabytes in size. That’s annoying when downloading a font file to display text on a website in a particular font.

It is possible to create a “subset” of a font which only contains the characters that you want. This makes the font file smaller, which makes downloading things quicker for the user.

Again – it is all a lot more complicated than that, but it’s a good approximation of the truth.

Traditionally, it makes sense to subset fonts by human languages. If you are writing in English, you don’t want the Greek set of characters. If you’re writing in Vietnamese, you don’t want Cyrillic characters.

Once a subset has been created, you can refer to it in CSS like this:

@font-face {
  font-family: 'CoolFont';
  src: url(https://example.com/font.woff2) format('woff2');
  unicode-range: U+0000-00FF, U+0131;
}

This tells the web browser that the font covers the characters U+0000 to U+00FF and, additionally, the character U+0131 – “Latin Small Letter Dotless I”.

U+0000 to U+007F are “Basic Latin”. They contain the traditional English letters, numbers, and some symbols.

U+0080 to U+00FF are “Latin-1 Supplement”. They contain “European” symbols like Ñ, å, and ÿ.

As Unicode has added more languages, they have scattered characters across the specification. And that’s where the problem lies.

The Problem

A user in India has complained that Google’s font subsetting ignores Indian users.

Here’s an example.

Google’s Roboto font has the following characters as part of its Latin subset:

  • U+0000-00FF Basic Latin and Supplement
  • U+0131 ı Latin Small Letter Dotless I
  • U+0152-0153 Œ and œ Ligature Oe
  • U+02BB-02BC ʻ and ʼ Modified Punctuation
  • U+02C6 ˆ Modifier Letter Circumflex Accent
  • U+02DA ˚ Ring Above
  • U+02DC ˜ Small Tilde
  • U+2000-206FGeneral Punctuation
  • U+2074 ⁴ Superscript Four
  • U+20AC € euro sign
  • U+2122 ™ Trade Mark Sign
  • U+2191 ↑ Upwards Arrow
  • U+2193 ↓ Downwards Arrow
  • U+2212 − Minus Sign
  • U+2215 ∕ Division Slash

You can argue how useful or not some of these characters are – but what’s interesting is what’s missing.

India has twice the number of English speakers as the United Kingdom.

A website written for an English speaking audience in India is likely to want the Latin subset of a font. But it will also want one local character – ₹ – U+20B9. The Rupee is the currency of India and its character is part of the “Currency Symbols” Unicode block.

So should the Rupee be part of the “Latin” subset?

Colonialism In Tech

The original complainant says:

this symbol [₹] is excluded (subsetted out) of many Latin fonts that originally included it due to an American assumption that English is not spoken in India.

I don’t know whether their assumption about Google is correct. But it seems odd to specifically include € in Google’s Latin subset, but not the Rupee. Latin is not synonymous with “European”.

Through a quirk of history, the Dollar symbol – $ – is in Basic Latin. The Yen currency symbol – ¥ – is included in the Latin-1 Supplement, as is the Pound – £.

Roboto’s Latin subset contains Old English characters like þ (Thorn) and Ð (Eth).

Are obsolete characters used only in mediaeval text really more important to include than the currency for a billion people?

Google do include the ₹ in their “Latin Extended” subset. So if an Indian user wants to use their currency, they need to download a separate font which includes 1,011 characters they don’t need.

This is inefficient. It increases the download weight and energy usage of billions of people.

Some Simple Solutions

There are a few things which can be done here.

I think that Google probably should include a popular currency symbol in their Latin subset font. Yes, there’s a risk that the font might grow in size as more useful symbols are added. And, no, Latin doesn’t mean English and English doesn’t mean Indian – but we’re all trying to get along on this crowded planet. So let’s be flexible.

Google could create an “en_IN” subset which includes all the popular and useful characters needed by an Indian audience. It seems like there is sufficient demand for it.

Users should use the Google Font API to create a subset which has only the specific characters they want. That way they aren’t reliant on the whims of a megacorp to decide what counts for their language.

Finally, as developers, we should understand that what is “logical” and “orderly” isn’t always how our users see things. We have a huge range of biases and unexamined assumptions. Some of the earliest foundations of computer science are based on a very rigid and limited set of assumptions about the world. Let’s do our best to be more inclusive.

5 thoughts on “Should ₹ be part of the Latin font subset?

  1. Ellis says:

    lovely post. let latin be whatever we all need latin to be

  2. Alex says:

    Wouldn’t another sensible approach to be have a currency symbol grouping that Google (and others) include in standard Latin sets? In an increasingly interconnected world we all wind up using currency symbols from many countries, especially in e-commerce. Given there are definitely fewer than 192 currency symbols (probably far far fewer given the number of $, £ and € users) this a simple way to bound the probably and be inclusive.

    1. @edent says:

      There is a currency group in Unicode – https://en.wikipedia.org/wiki/Currency_Symbols_(Unicode_block) – but it contains a lot of obsolete currencies like the French Franc and Spanish Peseta.

  3. Both the þorn and the eð are used in modern Icelandic, which is also another Latin-orthographed language – obviously used far less than English is in India, though, and therefore doesn’t affect your argument beyond adding weight to your assertion that we do, indeed, all have biases and unexamined assumptions.

Leave a Reply

Your email address will not be published. Required fields are marked *