The pronunciation of words by computers has gotten a lot better — at least in the movies. One of the latest futurist films — Ex Machina — has actress Alicia Vikander as the voice of the humanoid robot Ava.
Meanwhile, in the real world, computer voices such as those used for Siri, Cortana and Google Now don’t seem to be able to say some words correctly.
At first, I wondered if it had something to do with the fact that a lot of research on computer voicing is done in England, where they pronounce common words like “schedule” very differently.
But Andrew Breen, the research director for text to speech at Nuance Communications, says the location is not relevant. Nuance helped create Siri for Apple.
Breen says the computer uses dictionaries to determine the correct way to say a word in a particular dialect or accent.
“We have a dictionary, and that’s the first port of call,” he says. But text-to-speech systems don’t draw from full words, and mispronunciations can ensue. We reached out to our listeners for some of their favorite computer-voice flubs:
Susan Bennett, the original voice of Siri, spent four hours a day for five weeks laying down voice tracks. “Ninety percent of the phrases I recorded were nonsensical, created solely to get the sound combinations in the language,” she says. “So I had to read things like: ‘Say the shroding again. Say the shreeding again. Say the shrading again.’ ”
Breen says the computer draws from these sounds to assemble words based on pronunciations in a dictionary of American speech.
“You’ve got to go to that unique individual, that voice talent that’s given you their basic sound system,” he says. “And you’ve got to try and map this representation into something that you can speak back.”
And this all happens in a fraction of a second.
Unfortunately, the computer can’t always distinguish between words that have the same spelling but are pronounced differently depending on the meaning. For example, Mobile is a place and an auto is mobile.
“There is never really a hundred percent guarantee that when you’ve got a word that’s in common language use and is also a place name that you’re not going to choose the wrong one, unless you’ve got full context,” Breen says.
Bennett says that when she was recording for Nuance they did record some place names. But there was a lot of guesswork.
“We’d say well, this street is in New Mexico, it has a Spanish name so I bet they pronounce it correctly the Spanish way,” she says. “One day I said, ‘Well, why are we guessing? Don’t you guys have interns or someone that can look this information up and get it right?’ ”
She says that never happened.
Breen says pronunciations will improve as devices have better connections to the Internet where they can retrieve information more quickly.
But, he says, “Even with that context you have to take into account the situation where for whatever reason — maybe the device is in a tunnel or maybe it’s in a room where it just can’t get online — it can’t just say no.”
And, in case you were wondering, Bennett does have an iPhone — but she never uses Siri. “It’s difficult to hear my voice saying certain things that I would never in a million years say.”
Besides, Bennett says, she talks to herself enough already.
Copyright 2016 NPR. To see more, visit NPR.