HTML Ruby and Bidirectional Text
The set of HTML <ruby>
elements allow us to add pronunciation above text. For example:
"When you visit the zoo, be sure to see the panda - ēē«."
This is written as:
HTML
<ruby>ē<rp>(</rp><rt>Xióng</rt><rp>)</rp></ruby><ruby>ē«<rp>(</rp><rt>mÄo</rt><rp>)</rp></ruby>.
That is, the word or character which needs text above it is wrapped in <ruby>
. The pronunciation is wrapped in <rt>
. The <rp>
element indicates the presence of a parenthesis - which isn't usually displayed, but will be shown if the browser doesn't support <ruby>
syntax.
That's fairly easy for scripts written left-to-right. But how does it work for scripts like Arabic where the text is written right-to-left, but the user may want the pronunciations left-to-right?
Let's take the phrase "Hello World" in Arabic: Ł Ų±ŲŲØŲ§ ŲØŲ§ŁŲ¹Ų§ŁŁ . Google Translate tells me this is pronounced "marhaban bialealami".
For a single word, the directionality can be ignored. The browser should be smart enough to place the pronunciation above the word:
HTML
<p>Hello is: <ruby>Ł Ų±ŲŲØŲ§<rp>(</rp><rt>marhaban</rt><rp>)</rp></ruby>. What a useful word!</p>
Hello is: Ł Ų±ŲŲØŲ§. What a useful word!
What about if we have a few words - or a whole sentence - which is entirely RTL?
HTML
<p dir="rtl">Ł Ų±ŲŲØŲ§ ŲØŲ§ŁŲ¹Ų§ŁŁ </p>
Is displayed aligned to the right side of the screen:
Ł Ų±ŲŲØŲ§ ŲØŲ§ŁŲ¹Ų§ŁŁ
There are a few ways to add pronunciation.
Separate The Words
The first is to write each word separately. For example <ruby>1st word</ruby> <ruby>2nd word</ruby>
. Obviously, this isn't normally how you'd write a RTL language! But it does work:
HTML
<p dir="rtl"><ruby>Ł Ų±ŲŲØŲ§<rp>(</rp><rt>marhaban</rt><rp>)</rp></ruby> <ruby>ŲØŲ§ŁŲ¹Ų§ŁŁ <rp>(</rp><rt>bialealami</rt><rp>)</rp></ruby></p>
Which displays as:
Ł Ų±ŲŲØŲ§ ŲØŲ§ŁŲ¹Ų§ŁŁ
It helps to think of the way the characters of the script are stored in memory.
A word that displays as ABC
is stored as C
B
A
.
So the above is written "correctly" - even though it looks odd in the source-code view.
All At Once
But there is an alternative if you want the source text to look natural - i.e. [2nd word] [1st word]
.
It's a bit messy, but you can write the LTR text in <rt>
"backwards"!
HTML
<p dir="rtl"><ruby>Ł Ų±ŲŲØŲ§ ŲØŲ§ŁŲ¹Ų§ŁŁ <rt>bialealami marhaban</rt></ruby></p>
Ł Ų±ŲŲØŲ§ ŲØŲ§ŁŲ¹Ų§ŁŁ
But, again, that doesn't seem very satisfying! It also divorces the pronunciation from the original word - which is unfortunate for screenreaders.
The Ruby layout algorithm is usually clever enough to group words separated by spaces:
Ł Ų±ŲŲØŲ§ ŲØŲ§ŁŲ¹Ų§ŁŁ
Ł Ų±ŲŲØŲ§ ŲØŲ§ŁŲ¹Ų§ŁŁ
Although, if the pronunciations have a significantly different length than each other, it can get a bit messy:
Ł Ų±ŲŲØŲ§ ŲØŲ§ŁŲ¹Ų§ŁŁ
Ł Ų±ŲŲØŲ§ ŲØŲ§ŁŲ¹Ų§ŁŁ
In which case, you probably need to go for the first technique and wrap each word in its own <ruby>
element:
Ł Ų±ŲŲØŲ§ ŲØŲ§ŁŲ¹Ų§ŁŁ
BDO
It's tempting to think that simply using the <bdo>
element can help us here. It can't!
Using the bidirectional override will display characters RTL, rather than words.
HTML
<p dir="rtl"><ruby>Ł Ų±ŲŲØŲ§ ŲØŲ§ŁŲ¹Ų§ŁŁ <rt><bdo dir="rtl">marhaban bialealami</bdo></rt></ruby></p>
Becomes:
Ł Ų±ŲŲØŲ§ ŲØŲ§ŁŲ¹Ų§ŁŁ
I guess you could spell each word backwards. Which would be extremely annoying for everyone and a complete nightmare for screen readers!
Instead, it can be fixed if each word is then given an explicit LTR direction:
HTML
<p dir="rtl"><ruby>Ł Ų±ŲŲØŲ§ ŲØŲ§ŁŲ¹Ų§ŁŁ <rt> <bdo dir="rtl"> <span dir="ltr">marhaban</span> <span dir="ltr">bialealami</span> </bdo></rt></ruby></p>
Ł Ų±ŲŲØŲ§ ŲØŲ§ŁŲ¹Ų§ŁŁ
Is that it?
So, I think those are the only ways to achieving mixing bidirectional text pronunciation. But I'd welcome any corrections and suggestions!
@edent says:
If I wrap the whole sentence in ruby, the text above goes the wrong way.
Displays as ŲŁŁŁŲ§Ł ŲŖŁ Ł ŁŁ ŁŲŖŁŁŲ¦Ų© ŲØŁŲ£ŁŁŁŁŁŁŁŁŁŲ³ŁŁ
Whereas the pronunciation should be c b a - so the pronunciation is above the right word. Does that make sense?
If what you really want is to have a sentence in Arabic in the normal order, with the words above, then you can mark the whole section RTL with U+002F/U+002E. The non-Arabic elements are then in a pretty confusing order, though, compounded by the fact that punctuation doesn't change direction (and angle brackets are visibly mirrored) but Latin letters do! But this is a more general problem: how do you handle mixed editing of HTML and content in right-to-left languages? I don't know how people normally deal with that.
More comments on Mastodon.