tts – Terence Eden’s Blog

Better TTS on Linux

@edent — Tue, 21 Apr 2026 11:34:07 +0000

The venerable eSpeak is a mainstay of Linux distributions. It is a clever Text-To-Speech (TTS) program which will read aloud the written word using a phenomenally wide variety of languages and accents.

The only problem is that it sounds robotic. It has the same vocal fidelity as a 1980s Speak 'n' Spell toy. Monotonous, clipped, and painful to listen to. For some people, this is a feature, not a bug. I have blind friends who are so used to eSpeak that they can crank it up to hundreds of words per minute and navigate through complex documents with ease.

For the rest of us, it is a steep and unpleasant learning curve.

There are lots of modern TTS programs using all sorts of advanced AI. Many of them are paywalled or require you to post your text to a webserver - with all the privacy and latency problems that causes. Some are restricted to high-powered GPUs or other expensive equipment.

Piper is different. It is local first, runs quickly on modest hardware, and is open source.

The easiest way to install it on Linux is to use Pied - a simple GUI which allows you to select languages, listen to accents, and then install them.

It will change your speech-dispatcher to use the new Piper voice. That means it is immediately available to your Linux DE's accessibility service and to apps like Firefox.

I now have a reassuring Scottish lady speaking out everything on my computer.

1KB JS Numbers Station

@edent — Sun, 20 Jul 2025 11:34:53 +0000

Code Golf is the art/science of creating wonderful little demos in an artificially constrained environment. This year the js1024 competition was looking for entries with the theme of "Creepy".

I am not a serious bit-twiddler. I can't create JS shaders which produce intricate 3D worlds in a scrap of code. But I can use slightly obscure JavaScript APIs!

There's something deliciously creepy about Numbers Stations - the weird radio frequencies which broadcast seemingly random numbers and words. Are they spies communicating? Commands for nuclear missiles? Long range radio propagation tests? Who knows!

So I decided to build one. Play with the demo.

Obviously, even the most extreme opus compression can't fit much audio into 1KB. Luckily, JavaScript has you covered! Most modern browsers have a built-in Text-To-Speech (TTS) API.

Here's the most basic example:

m = new SpeechSynthesisUtterance;
m.text = "Hello";
speechSynthesis.speak(m);

Run that JS and your computer will speak to you!

In order to make it creepy, I played about with the rate (how fast or slow it speaks) and the pitch (how high or low).

m.rate=Math.random();
m.pitch=Math.random()*2;

It worked disturbingly well! High pitched drawls, rumbling gabbling, the languid cadence of a chattering friend. All rather creepy.

But what could I make it say? Getting it to read out numbers is pretty easy - this will generate a random integer:

s = Math.ceil( Math.random()*1000 );

But a list of words would be tricky. There's not much space in 1,024 bytes for anything complex. The rules say I can't use any external resources; so are there any internal sources of words? Yes!

Object.getOwnPropertyNames( globalThis );

That gets all the properties of the global object which are available to the browser! Depending on your browser, that's over 1,000 words!

But there's a slight problem. Many of them are quite "computery" words like "ReferenceError", "URIError", "Float16Array". I wanted all the single words - that is, anything which only has one capital letter and that's at the start.

const l = (n) => {
    return ((n.match(/[A-Z]/g) || []).length === 1 && (n.charAt(0).match(/[A-Z]/g) || []).length === 1);
};

//   Get a random result from the filter
s = Object.getOwnPropertyNames( globalThis ).filter( l ).sort( ()=>.5-Math.random() )[0]

Rather pleasingly, that brings back creepy words like "Event", "Atomics", and "Geolocation".

Of course, Numbers Stations don't just broadcast in English. The TTS system can vocalise in multiple languages.

//   Set the language to Russian
m.lang = "ru-RU";

OK, but where do we get all those language strings from? Again, they're built in and can be retrieved randomly.

var e = window.speechSynthesis.getVoices();
m.lang = e[ (Math.random()*e.length) |0 ]

If you pass the TTS the number 555 and ask it to speak German, it will read out fünfhundertfünfundfünfzig.

And, if you tell the TTS to speak an English word like "Worker" in a foreign language, it will pronounce it with an accent.

Randomly altering the pitch, speed, and voice to read out numbers and dissociated words produces, I think, a rather creepy effect.

If you want to test it out, you can press this button. I find that it works best in browsers with a good TTS engine - let me know how it sounds on your machine.

With the remaining few bytes at my disposal, I produced a quick-and-dirty random pattern using Unicode drawing blocks. It isn't very sophisticated, but it does have a little random animation to it.

You can play with all the js1024 entries - I would be delighted if you voted for mine.

Unicode Roman Numerals and Screen Readers

@edent — Wed, 15 Mar 2023 12:34:02 +0000

How would you read this sentence out aloud?

"In Hamlet, Act Ⅳ, Scene Ⅸ..."

Most people with a grasp of the interplay between English and Latin would say "In Hamlet, Act four, scene nine". And they'd be right! But screen-readers - computer programs which convert text into speech - often get this wrong.

Why? Well, because I didn't just type "Uppercase Letter i, Uppercase Letter v". Instead, I used the Unicode symbol for the Roman numeral 4 - Ⅳ. And, it turns out, lots of screen-readers have a problem with those characters.

Don't Know Much About History

Unicode contains the range of Roman numbers from 1 - 10, plus a couple of compound numbers, 50, 100, 500, and 1000 - in a variety of forms.

Why does Unicode contain these number which, to most people, are just squashed together Latin letter? As ever with Unicode, it is a mix of legacy and practicality.

The Unicode standard says:

Roman Numerals. For most purposes, it is preferable to compose the Roman numerals from sequences of the appropriate Latin letters. However, the uppercase and lowercase variants of the Roman numerals through 12, plus L, C, D, and M, have been encoded for compatibility with East Asian standards. Unlike sequences of Latin letters, these symbols remain upright in vertical layout. Additionally, in certain locales, compact date formats use Roman numerals for the month, but may expect the use of a single character.

Far be it for me to disagree with the learned authors of the spec, but I think they may have erred slightly on this one. While it may be preferable to re-use Latin letters, it leads to ambiguity which can be confusing for a screen-reader.

Practical Examples

Let's write out the numbers using regular letters. Suppose you were talking about "Romeo and Juliet, Act III, Scene I". Most screen readers will see the "III" and correctly speak aloud "Roman three" or similar. But when they get to the "I" it becomes ambiguous. Most will read out "Eye".

Screen-readers rarely look at the whole sentence for context. Which means they get confused. It's fairly obvious that XIV should be "fourteen" as there's no English word "xiv"⁰. But what about "MIX" - is that 1009 or the word "mix"?

Anyone who has watched the BBC knows about their fondness for displaying in Latin the year a programme was made. MCMXCVI is particularly challenging for a screen-reader!

Testing It

I took the following sample sentence - using both letters and Roman numerals.

Text. In Hamlet, Act I, Scene XI the year is MCMXCVI and they are watching Rocky V.

Roman. In Hamlet, Act Ⅰ, Scene Ⅺ the year is ⅯⅭⅯⅩⅭⅥ and they are watching Rocky Ⅴ.

Here's how various services coped:

Amazon Polly

First, the good news. Amazon's Polly read the Roman numerals perfectly. It even pronounced ⅯⅭⅯⅩⅭⅥ as "nineteen ninety six".

🔊

💾 Download this audio file.

But it gets rather confused with the ambiguous English text.

Microsoft Edge Read Aloud

I tried with Microsoft Edge's Read Aloud TTS.

🔊

💾 Download this audio file.

It and makes a bit of a hash of the English and just skips the Roman numerals.

Google Text To Speech

The same was also true with Google's TTS products.

🔊

💾 Download this audio file.

Espeak NG

The venerable Linux utility came out with this.

🔊

💾 Download this audio file.

It gets the "Capital i" incorrect, and reads the Roman numerals as their Unicode code points.

Jaws

My good friend Léonie Watson who writes extensively about accessibility was kind enough to record some other samples for me.

Here are Jaws' "Expressive":

🔊

💾 Download this audio file.

And Jaws' "Eloquence:

🔊

💾 Download this audio file.

NVDA

Léonie also provided a recording of NVDA Microsoft One Core

🔊

💾 Download this audio file.

Narrator

And here's Narrator making a right mess of it.

🔊

💾 Download this audio file.

Others

If you know of any other screen-readers, or text-to-speech engines which can cope with this, please let me know!

Fixing it

On Linux, I raised a Pull Request to fix espeak-ng.

The rest of the services don't seem to have a way to easily report bugs to them. If you know a way to raise issues with these screen readers - please do so!

I'm sure there's some obscure Scrabble word, but we're talking everyday use here. ↩︎

Blog To Speech

@edent — Sun, 02 Oct 2022 11:34:41 +0000

Listen to this blog post in your browser: Download MP3 audio. Powered by Amazon Polly. I've noticed an interesting trend on some of the blogs I follow. More of them - though by no means the majority - are including audio versions of the content. The usually look something like this: or The ones which have this are mostly using commercial Text-To-Speech (TTS) engines.…

Is it faster to read or to listen?

@edent — Sat, 29 Jan 2022 12:34:39 +0000

Fourteen years ago, I blogged about the future of voice. In the post, I asked these two questions - which I'd nicked from someone else:

Are you faster at speaking or typing?
Are you faster at reading or listening?

Lots of us now use Siri, Alexa, Bixby, and the like because it is quicker to speak than type. For long-form wordsmithing - it's still probably easier to type-and-edit than it is to speak-then-edit. And the way humans speak is markedly different from how they write.

But the bottleneck has always been that listening to speech is slower than reading text.

The average reading speed is around 238 words per minute. Obviously there are a lot of caveats around the age of the reader, the difficulty of the material, whether one is reading for leisure or work. But it will do as a comparator.

The average speaking speed is around 150 words per minute. Again, that depends on the age of the speaker, urgency of their talk, familiarity with the language, and so on.

Therefore it is faster to read academic papers rather than to listen to academic lectures. Case closed!

Except…

There's a fascinating new paper out - Learning in double time: The effect of lecture video speed on immediate and delayed comprehension.

Here's the quote I found most interesting - with emphasis added:

Collectively, the present experiments indicate that increased video speed (up to 2x) does not negatively impact learning outcomes and watching at faster speeds can be a more efficient use of study time.
Thus, as long as to-be-remembered information can be effectively perceived and encoded, learning outcomes may not be affected by playback speed.
However, previous work has indicated that speech comprehension begins to decline at around 275 words per minute (Foulke & Sticht, 1969; see also Goldhaber, 1970; Pastore & Ritzhaupt, 2015; Vemuri et al., 2004) and the videos in the current study exceeded this threshold when played at 2x speed.
Although the elevated speech rates at 2x speed may initially be less comprehensible to students, researchers have been able to train participants to understand speech at rates up to 475 WPM (Orr et al., 1965).
Therefore, with practice, higher rates of speech may not be completely incomprehensible and since 85% of students reported watching lecture videos at quicker than normal speeds (see Figure 3a), they may be better able to process the material as a result of experience.

I guess this shouldn't come as a surprise to me. I tend to watch my MSc lectures at 1.75x with subtitles - and have been doing the same with podcasts and tutorial videos for years. Looks like I am in the majority.

If the average person speaks at ~150 Words Per Minute, increasing playback speed to 1.5x gives a listening rate of ~225 WPM. That's about the same as reading speed.

Going to 475 WPM means listening at 3x normal speed.

My mate Léonie Watson is blind and has written extensively about the use of text-to-speech technology. Because she listens to a synthetic voice, with predictable and consistent pronunciation, she's able to listen at about 520 WPM! That's 3.5x faster than the speech of a biological human.

I'm not suggesting that you can speed-listen your way through any complicated topic and retain perfect understanding of subject and nuance. But it is becoming clear that synchronous teaching has limitations when it comes to efficiently teaching people. There's no substitute for being able to stop an expert mid-lecture and saying "sorry Prof, I don't get that - could you please help me understand?" But the reality is, most people never stick their hand up in class. So listening to lectures on playback - at double speed - is simply a better "user experience" for the student.

Learning, of course, isn't just listening to people drone on in front of a blackboard. The student still needs to do the exercises, write their essays, consolidate their knowledge, reflect on what they've learned, and so on.

But the ability to "speed" your way through a (well edited and professionally recorded) lecture is something to be welcomed. It gives students more time to spend on their studies with, apparently, no ill effects.

TTSF (Text To Shipping Forecast)

@edent — Wed, 13 Oct 2021 11:34:05 +0000

The BBC Shipping Forecast is one of those strange bits of national tradition which, somehow, bridges the gap between infrastructure and folklore.

You can listen listen to the latest forecast on the BBC - read by professional newscasters.

But what if we wanted a robot to read it? If our speaker is sick, bored, or too expensive - how would we automate the audio version of the Shipping Forecast?

The BBC publishes the general forecast - but it's important to note that this is not what is read out on air. Instead, they use this compressed version published by the Met Office.

The Met's version doesn't have an API - or any other way to get structured information out of it - but the HTML is relatively basic and easy to extract the data from.

Once done, it can be passed to a TTS (Text To Speech) service like Amazon Polly.

Here are the (quick and dirty) results:

Female

https://shkspr.mobi/blog/wp-content/uploads/2021/10/sf2.mp4

Male

https://shkspr.mobi/blog/wp-content/uploads/2021/10/sf1.mp4

Thoughts

I've previously experimented with Synthetic Poetry. Robots aren't great at reading out verse - they lack emphasis and emotion. But something like the Shipping Forecast is perfect for them. It requires a calm, even tone. No particular need for words or phrases to be stressed. Each syllable needs to be clearly and well enunciated. When dealing with life-and-death matters, there's no room for error.

Text to speech is - for some very specific use-cases - indistinguishable from organic speech. Although, amusingly, Amazon's system was unable to correctly pronounce "Utsire" - so a little manual intervention was needed on that!

Synthetic Poetry

@edent — Wed, 21 Jul 2021 11:48:51 +0000

I've been experimenting with Amazon's Polly service. It's their fancy text-to-sort-of-human-style-speech system. Think "Alexa" but with a variety of voices, genders, and accents.

Here's "Brian" - their English, male, received pronunciation voice - reading John Betjeman's poem "Slough":

https://shkspr.mobi/blog/wp-content/uploads/2021/07/slough.mp4

The pronunciation of all the words is incredibly lifelike. If you heard it on the radio, it might sound like a half-familiar BBC presenter. It has a calm, even tone which suits the poem splendidly.

The rhythm is also spot on. That's mostly a function of the short lines and helpful punctuation the poem contains. Much like iambic pentameter, or a limerick, the syllables lend themselves to a specific and identifiable cadence.

But the emphasis is all wrong. The poem just... ends. There's no sense of finality in the tone. You'd expect a competent reader to recognise "tinned minds" as being worthy of stressing. Polly does have some capability to mark specific words for emphasis, but it's all very manual.

There's no synthetic emotion. Do you feel the rage, desperation, sadness, hopelessness of the poem? While Polly has some SSML (Speech Synthesis Markup Language) support - the range of emotions it can express are severely limited. And, again, must be applied manually.

"I used to be an adventurer like you, but then i took an arrow in the knee!"

One of the reasons stock phrases pop up so often in video games is that it is expensive to write and record thousands of different lines of dialogue.

We're almost at a stage where a computer can procedurally generate lines for background characters to speak, and then "record" an audio version in an array of styles. No more expensive voice actors, no more memetic references for in-group homophily. Each player of a game will have a completely different dialogue experience.

But the bit that we're still missing is the automation of emphasis and emotion and comic timing and understatement and... all the things which trained actors spend years learning how to do successfully.

In 2011, the film critic Roger Ebert had surgery which eliminated his voice. He proposed the following "Ebert Test" for synthetic voices:

If the computer can successfully tell a joke, and do the timing and delivery, as well as Henny Youngman, then that’s the voice I want.

We're so close, I can taste it. The Turing Test for realistic voices is whether they can move the audience to tears with poetry.