Talk Acme to Me: How Alexa and Siri Learn to Understand the Philly Accent

Why your new device may understand you better than your cousin from Connecticut does.

Illustration by GLUEKIT

Illustration by GLUEKIT

After a couple of weeks of our casually getting to know each other, my Amazon Echo is really starting to understand me. She just learned a new word: “wudder.” Now Alexa can tell me what year Waterworld came out and how long the spaghetti pot will take to boil. (That’s not to say she gets me completely — at least once a week she plays “the news” when I ask her to “snooze.”) Alexa, Siri, Cortana, your TV, even your car — in recent years, there’s been a remarkable rise in technology you can talk to, with more on the way. And considering that our local brand of English is so, um, distinctive, conversing with these systems can be a challenge. Philly English (perhaps even more so than Boston or Chicago English) is full of complex rules that are picked up on the schoolyard rather than in textbooks, which is why “mad” doesn’t rhyme with “sad” here and why so many movies feature Philadelphians who sound more like New Yorkers. To make our extraordinary accent work with this new genre of talky tech, researchers have had to create complex AI technology that can understand people better than humans can.

“The quote-unquote ‘dirty secret’ of speech recognition is, there’s a wide variety of accuracies,” says Marsal Gavaldà, director of engineering, machine intelligence at Yik Yak, an Atlanta-based social media app. “For some users, you get to almost 100 percent accuracy, whereas for other users it’s almost unusable.” That’s because speech recognition is aided by neural networks — computer systems modeled on the ways the human brain processes information. To help these networks understand, researchers “teach” them in a way that’s similar to how we learn: by presenting them with thousands of hours of speech. Only these networks are also fed detailed data transcripts, so the sounds can match the correct meaning.

While this allows networks to “learn,” a device’s ability to recognize speech is limited to the examples it’s been given. Which is why colloquial phrases require special attention. Some speech-recognition software can quickly adapt to the specific quirks of, say, a 215 twang after hearing short samples of speech, by identifying a subgroup a speaker belongs to — a Philadelphian might fall into a mid-Atlantic accent model. In addition, by working with linguists to identify local vocabulary, engineers can program the system so it knows you mean “sprinkles” when you say “jimmies.” (Oddly enough, Philadelphians might have a leg up here: Early speech recognition software was informed by samples that included interviews conducted at Penn.)

Some scientists, however, are quick to point out that continually improving regional nuances goes beyond just giving Philadelphians directions to the Acme. Samples, words and accents must be collected in large swaths lest the technology leave out entire age and socioeconomic groups. “Sometimes it’s said that linguistic discrimination is sort of the last acceptable form of bigotry,” says Meredith Tamminga, director of the Language Variation and Cognition Lab at Penn. “If we set up something as standard, we say, ‘This is the good way of speaking, because it’s spoken by the good kind of people.’ I’m speaking in broad strokes, but essentially, we assign social prestige to a variety based on the social prestige of the variety’s speakers.”

Developers are working on avoiding that by requesting samples from people with specific accents — and, in the case of Google, traveling to Scotland to get samples of brogue. Thanks to this wide-ranging but highly specific research, technology could understand the variety of language even better than people do. May we suggest those developers visit Mayfair next?

Published as “Talk Acme to Me” in the May 2017 issue of Philadelphia magazine.