The architecture is based on the Whisper model, with some modifications to allow real-time streaming.
In addition to transcribing speech, the listening head outputs audio embeddings from the encoder that are fed back to the backbone.
The listening head is also responsible for predicting when the model should speak, acting as a form of smart VAD (Voice Activity Detection).
The listening head works in a way surprisingly similar to how humans understand speech.
If you provide it with a new speaker with a specific accent and no prior context, there is a high chance it will not get the transcription right for the first couple of words.
However, as more of the transcription is collected, the model understands the speaker's accent and pronunciation better, and the transcription becomes very accurate.
Since the model handles long context well, the more context provided to the listening head, the better it gets at understanding the speaker.
Furthermore, because the audio embeddings are fed back to the backbone, the model can utilize them to detect mismatches between a transcribed word and the actual audio input. This allows the model to understand that a given word was misheard and potentially ask for clarification.
Existing dedicated ASR models do not support prompting and do not have access to the conversation history, which makes it difficult to achieve high accuracy in real world applications.
The input audio is translated into token probability distribution but because human speech is often ambiguous without extra context it leads to errors in transcription.
For example a license plate "YUN 811" might get transcribed as "Why you in 811"
This issue is especially exacerbated in voice agent applications where the model might be provided very short audio clips every turn without any context.
Because the model is trained in mixed modality, it can handle both text and audio input simultaneously.
We can leverage that by applying some prompt engineering to further improve the transcription accuracy.
For example if we are expecting the audio to contain speech of a license plate we can prompt it like this:
<text>Expect a license plate (3 letters, 3 numbers). For example ABC123.</text><|start|>
<text>Expect a person's name. It could also contain last name. For example John Doe.</text><|start|>
<text>Boost words: [Australian cities, food names, TV shows]</text><|start|>
Prompting with text is very useful when using the model as a standalone ASR. Although this is usually not necessary when used for voice agents, since the model has access to the whole conversation history and can pick up on relevant information without explicit prompting. For example, if the AI agent asks for specific information, it is essentially "self-prompting" to pay attention to that information in the subsequent audio input.
In the next post, I'll be sharing some of the experiments with the model trained on a conversational dataset.