HTML Ruby and Bidirectional Text
The set of HTML <ruby>
elements allow us to add pronunciation above text. For example:
"When you visit the zoo, be sure to see the panda - 熊猫
That is, the word or character which needs text above it is wrapped in <ruby>
. The pronunciation is wrapped in <rt>
. The <rp>
element indicates the presence of a parenthesis - which isn't usually displayed, but will be shown if the browser doesn't support <ruby>
syntax.
That's fairly easy for scripts written left-to-right. But how does it work for scripts like Arabic where the text is written right-to-left, but the user may want the pronunciations left-to-right?
Let's take the phrase "Hello World" in Arabic: مرحبا بالعالم. Google Translate tells me this is pronounced "marhaban bialealami".
For a single word, the directionality can be ignored. The browser should be smart enough to place the pronunciation above the word:
HTML<p>Hello is: <ruby>مرحبا<rp>(</rp><rt>marhaban</rt><rp>)</rp></ruby>. What a useful word!</p>
Hello is: مرحبا. What a useful word!
What about if we have a few words - or a whole sentence - which is entirely RTL?
HTML<p dir="rtl">مرحبا بالعالم</p>
Is displayed aligned to the right side of the screen:
مرحبا بالعالم
There are a few ways to add pronunciation.
Separate The Words
The first is to write each word separately. For example <ruby>1st word</ruby> <ruby>2nd word</ruby>
. Obviously, this isn't normally how you'd write a RTL language! But it does work:
HTML<p dir="rtl"><ruby>مرحبا<rp>(</rp><rt>marhaban</rt><rp>)</rp></ruby> <ruby>بالعالم<rp>(</rp><rt>bialealami</rt><rp>)</rp></ruby></p>
Which displays as:
مرحبا بالعالم
It helps to think of the way the characters of the script are stored in memory.
A word that displays as ABC
is stored as C
B
A
.
So the above is written "correctly" - even though it looks odd in the source-code view.
All At Once
But there is an alternative if you want the source text to look natural - i.e. [2nd word] [1st word]
.
It's a bit messy, but you can write the LTR text in <rt>
"backwards"!
HTML<p dir="rtl"><ruby>مرحبا بالعالم<rt>bialealami marhaban</rt></ruby></p>
مرحبا بالعالم
But, again, that doesn't seem very satisfying! It also divorces the pronunciation from the original word - which is unfortunate for screenreaders.
The Ruby layout algorithm is usually clever enough to group words separated by spaces:
مرحبا بالعالم
مرحبا بالعالم
Although, if the pronunciations have a significantly different length than each other, it can get a bit messy:
مرحبا بالعالم
مرحبا بالعالم
In which case, you probably need to go for the first technique and wrap each word in its own <ruby>
element:
مرحبا بالعالم
BDO
It's tempting to think that simply using the <bdo>
element can help us here. It can't!
Using the bidirectional override will display characters RTL, rather than words.
HTML<p dir="rtl"><ruby>مرحبا بالعالم<rt><bdo dir="rtl">marhaban bialealami</bdo></rt></ruby></p>
Becomes:
مرحبا بالعالم
I guess you could spell each word backwards. Which would be extremely annoying for everyone and a complete nightmare for screen readers!
Instead, it can be fixed if each word is then given an explicit LTR direction:
HTML<p dir="rtl"><ruby>مرحبا بالعالم<rt>
<bdo dir="rtl">
<span dir="ltr">marhaban</span> <span dir="ltr">bialealami</span>
</bdo></rt></ruby></p>
مرحبا بالعالم
Is that it?
So, I think those are the only ways to achieving mixing bidirectional text pronunciation. But I'd welcome any corrections and suggestions!
Paul Battley says:
I'd suggest that you're overcomplicating this. If you want to gloss a romanisation for each word, wrap each word in its own ruby element. It will always appear above the correct word, even if there's line wrapping. If, on the other hand, you want to add a semantic gloss, for example to writeحَوّامتي مُمْتِلئة بِأَنْقَلَيْسون with "my hovercraft is full of eels" above, then you'd wrap the whole sentence in a ruby element
@edent says:
Me, overcomplicate things?!?! 🙂
If I wrap the whole sentence in ruby, the text above goes the wrong way.
Displays as حَوّامتي مُمْتِلئة بِأَنْقَلَيْسون
Whereas the pronunciation should be c b a - so the pronunciation is above the right word. Does that make sense?
Paul Battley says:
Yes, but that's exactly why ruby should be used to gloss only the exact thing to which it corresponds. In your example, you have another problem: even if you reverse the English word order, if the length of romanisation and Arabic words differs significantly (which, in my experience, is quite often true, because of the unmarked or diacritic vowels, hamzah, etc.) then you still don't have the pronunciation above the correct word. I accept that it looks weird to have the Arabic sentence with visibly reversed word order in your editor, but that's an artefact of the display in your editor. The semantic ordering of codepoints remains correct.
If what you really want is to have a sentence in Arabic in the normal order, with the words above, then you can mark the whole section RTL with U+002F/U+002E. The non-Arabic elements are then in a pretty confusing order, though, compounded by the fact that punctuation doesn't change direction (and angle brackets are visibly mirrored) but Latin letters do! But this is a more general problem: how do you handle mixed editing of HTML and content in right-to-left languages? I don't know how people normally deal with that.
More comments on Mastodon.