I've been experimenting with Amazon's Polly service. It's their fancy text-to-sort-of-human-style-speech system. Think "Alexa" but with a variety of voices, genders, and accents.
Here's "Brian" - their English, male, received pronunciation voice - reading John Betjeman's poem "Slough":
The pronunciation of all the words is incredibly lifelike. If you heard it on the radio, it might sound like a half-familiar BBC presenter. It has a calm, even tone which suits the poem splendidly.
The rhythm is also spot on. That's mostly a function of the short lines and helpful punctuation the poem contains. Much like iambic pentameter, or a limerick, the syllables lend themselves to a specific and identifiable cadence.
But the emphasis is all wrong. The poem just... ends. There's no sense of finality in the tone. You'd expect a competent reader to recognise "tinned minds" as being worthy of stressing. Polly does have some capability to mark specific words for emphasis, but it's all very manual.
There's no synthetic emotion. Do you feel the rage, desperation, sadness, hopelessness of the poem? While Polly has some SSML (Speech Synthesis Markup Language) support - the range of emotions it can express are severely limited. And, again, must be applied manually.
"I used to be an adventurer like you, but then i took an arrow in the knee!"
One of the reasons stock phrases pop up so often in video games is that it is expensive to write and record thousands of different lines of dialogue.
We're almost at a stage where a computer can procedurally generate lines for background characters to speak, and then "record" an audio version in an array of styles. No more expensive voice actors, no more memetic references for in-group homophily. Each player of a game will have a completely different dialogue experience.
But the bit that we're still missing is the automation of emphasis and emotion and comic timing and understatement and... all the things which trained actors spend years learning how to do successfully.
In 2011, the film critic Roger Ebert had surgery which eliminated his voice. He proposed the following "Ebert Test" for synthetic voices:
If the computer can successfully tell a joke, and do the timing and delivery, as well as Henny Youngman, then that’s the voice I want.
We're so close, I can taste it. The Turing Test for realistic voices is whether they can move the audience to tears with poetry.