Accents and eBooks


By and large, the English language doesn't use diacritical marks. Even our loanwords are stripped of them; we drink in a cafe rather than the more pretentious café. This has a consequence for HTML and, by extension, eBooks.

As a quick primer, modern computing gives us two main ways of displaying a letter with an accent. The first is simple - encode every single accented letter as a separate "pre-composed" character. So è (U+00E8), é (U+00E0), ê (U+00EA, and ë (U+00EB) are all stored as different codepoints.

But this seems a little inefficient and can make it hard to search through text for an exact lexical match.

So there is a second way to add accents. You take the base character - e (U+0065) - and then apply a separate "combining" accent character to it. For example the combining accent ◌́ (U+0301). That means you can add an accent to áńý ĺét́t́éŕ!́

Note, the accent ◌́ (U+0301) is separate from the character ´ (U+00B4). In fact, most accents have a pre-composed, combining, and separate form. This, understandably, causes much confusion!

Here's a good example. I was reading the excellent Fallen Idols, when I noticed this typesetting bug.

The phrase "Swords of Qadisiyyah." But the combining macron over the letter "a" has been rendered as a separate dash.

It's always hard to transliterate languages. The Victory Arch in Iraq is known as قوس النصر, and usually written in English as the "Swords of Qādisīyah".

Examining the HTML code in the eBook, it was obvious that the publishers had used a macron ¯ (U+00AF) rather than the combining version ◌̄ (U+0304).

I've reported it to the publisher. I've no idea if they'll fix it in a subsequent re-issue.


Share this post on…

2 thoughts on “Accents and eBooks”

  1. EB says:

    Yes, the pre-composed (normalized) characters may be trickier to find in text, but I think that's why the Unicode docs devote a whole section/chapter to sorting (which involves text comparison just like searching.) I believe that the normalized canonical representation of characters (https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms) is likely the most concise, but it's probably best to always rely on library code for text comparisons than just byte-by-byte comparisons (or strncmp).

    Reply
  2. said on dataare.cool:

    @Edent I read a fair amount of stuff with Vietnamese characters in — in particular character names in @aliettedb’s work. It blows my mind how often people don't account for Viet-only Latin-script characters.

    This (https://dataare.cool/@owenblacker/112247743990634454) is a recent example, where the Viet-only characters i-tilde, a-dot-below and y-dot-below were in roman-type because they didn't load Bookerly Italic for Latin Extended Additional. In a book that only uses 2 fonts and makes copious use of Vietnamese names.

    Reply | Reply to original comment on dataare.cool

What are your reckons?

All comments are moderated and may not be published immediately. Your email address will not be published.Allowed HTML: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <p> <pre> <br> <img src="" alt="" title="" srcset="">