Introducing Amazon Nova Sonic: A New Gen AI Model for Building Voice Applications and Agents
“From the invention of the world’s best personal AI assistant with Alexa, to developing AWS services like Connect, Lex, and Polly that are used across a wide range of industries,
Traditional approaches to building voice-enabled applications involve complex orchestration of multiple models, such as speech recognition to convert speech to text, large language models (LLMs) to understand and generate responses, and text-to-speech to convert text back to audio. This fragmented approach not only increases development complexity but also fails to preserve crucial acoustic context and nuances like tone, prosody, and speaking style that are essential for natural conversations.
Nova Sonic solves these challenges through a unified model architecture that delivers speech understanding and generation, without requiring a separate model for each of these steps. This unification enables the model to adapt the generated voice response to the acoustic context (e.g. tone, style) and the spoken input, resulting in more natural dialog. Nova Sonic even understands the nuances of human conversation, including the speaker’s natural pauses and hesitations, waiting to speak until the appropriate time, and gracefully handling barge-ins. It also generates a text transcript for the user’s speech, enabling developers to use that text to call specific tools and APIs for building voice-enabled AI agents (e.g., an AI-powered travel agent that can book flights by retrieving up to date flight information). These capabilities, along with its lightning-fast inference, make voice applications powered by Nova Sonic more natural and useful.
State-of-the-art accuracy and quality
Nova Sonic has been rigorously tested against a wide range of industry standard benchmarks for speech understanding and generation, demonstrating exceptional quality and accuracy for human-like, real-time voice conversations.
The model excels in natural dialog handling, seamlessly understanding and adapting to pauses, hesitations, and interruptions while maintaining conversational context throughout the interaction. This capability contributed to strong performance for overall quality and accuracy in turn-taking tests.
Nova Sonic demonstrates strong performance on overall conversation quality compared to other models in the industry, which at this time include a select few with similar real-time conversational speech capabilities, such as
Since recognizing spoken words is critical in generating accurate responses, measuring Nova Sonic's speech recognition accuracy in terms of word error rate (WER) across a wide range of languages, dialects, and accents is also critical. On the Multilingual LibriSpeech, Nova Sonic achieved a WER of 4.2%, which is 36.4% relative lower than
On English utterances of the Multilingual LibriSpeech (MLS) data set, it has 24.2% relative lower WER compared to OpenAI’s GPT-4o Transcribe model.
Nova Sonic is also robust to noisy conditions, with 46.7% relative lower WER for English compared to OpenAI’s GPT-4o Transcribe model measured on Augmented Multi Party Interaction (AMI) meeting benchmark that consists of real-world noisy and multi-speaker interactions.
Tool-use for function calling and agentic workflows
Nova Sonic also supports tool-use for applications—like customer service call automation—that require the responses to be factually grounded in enterprise data, such as pricing plans, available inventory, and schedule availability. Nova Sonic’s native tool-use also enables the model to resolve complex customer queries and complete tasks on behalf of customers, for example, “make a reservation” or “find alternate flights.”
Multiple native voices and speaking styles
Nova Sonic supports three expressive voices, including both masculine-sounding and feminine-sounding voices now generally available in English, and supports speech generation in different English accents including American and British. Support for additional languages and accents will be coming soon.
Industry-leading speed and price performance
Nova Sonic delivers an average customer-perceived latency of 1.09 seconds from the time the customer is done talking to the time the system generates the first speech response. This is compared to 1.18 seconds for OpenAI’s GPT-4o (Realtime), and 1.41 seconds for Google’s Gemini Flash 2.0 (available via Gemini’s experimental live API), per benchmarking by Artificial Analysis.
Nova Sonic is the most cost-efficient model in the industry, when compared to models that have similar functionality of real-time speech conversations and have public pricing available. For example, it is nearly 80% less expensive than OpenAI’s GPT-4o (Realtime).
ASAPP empowers enterprise customers’ contact centers to deliver unmatched customer service through GenerativeAgent, a fully conversational generative Al voice agent. “At ASAPP, we are focused on using generative AI to deliver reliable, secure, and high-performing solutions for improving customer service in contact centers. We’ve been particularly impressed by
Education First (EF) is a leader in international education through its networks of schools and offices in over 50 countries. “Amazon Nova Sonic enables EF students to practice new vocabulary and refine their pronunciation in a dynamic learning environment, while the interactive nature of the model allows students to receive immediate feedback on their pronunciation attempts, contributing to a more efficient and effective learning process. The model is capable of accurately understanding non-native English speakers with a variety of accents. We were also impressed with the barge-in feature of Nova Sonic, where the model quickly reacts to interruptions,” said
Stats Perform is a sports data and AI technology provider, serving global media organizations, betting operators, and professional sports teams. “At Stats Perform, our goal is to empower the world’s top sports broadcasters, media, federations and teams with magic in the detail of our vast live and historical Opta sports dataset, to help them win audiences, customers and trophies. With the Opta AI Chat they can generate unique, accurate, and contextual responses, driven by live data insights with remarkable speed, in multiple formats and languages, to find a winning analytical or storytelling edge,” said
To get started with
To learn more, visit: About
View source version on businesswire.com: https://www.businesswire.com/news/home/20250408227167/en/
Media Hotline
Amazon-pr@amazon.com
www.amazon.com/pr
Source: