MichiAI: A Low Latency, Full Duplex Speech LLM

Part 1: Speech Generation

Damian Krystkiewicz
Architecture
Figure 1: MichiAI architecture
Model Size 530M Params
Latency (TTFA) ~75ms
Architecture Flow Matching
Key Innovation No Coherence Loss

I'd like to present MichiAI, a speech LLM designed for full-duplex interaction, allowing it to listen and speak simultaneously just like a human, while needing only a fraction of compute for training and inference. In this post I'll explain the speech generation component of MichiAI.

I am presenting this as a proof-of-concept. The GPT backbone is small by design to validate the architecture quickly, without a massive cluster of H100s.
It's not like I have that lying around in my garage anyway :)

Requirements

  • Multimodal Input: It cannot be audio-only. The ability to prompt via text as well as audio is essential. Almost all existing LLM applications leverage RAG and text prompting, which is impossible with pure audio models.
  • Coherence: The audio modality cannot degrade the overall coherence of the model.
  • Low Time-to-First-Audio: Anything over 500ms breaks the flow of conversation. To feel natural, I targeted sub-100ms response times.
  • Compatibility with Existing Text LLMs: Pretraining text models is prohibitively expensive. Since many high-quality open-weight LLMs already exist, utilizing their pretrained text knowledge is a massive benefit.
  • End-to-End Processing: The model must "hear" the user and itself directly rather than relying on transcripts. Speech conveys more than just text; prosody, subtle timbre changes, and emotions provide valuable context for the model to interpret.
  • Full Duplex: It must handle interjections and backchanneling implicitly, learned directly from the dataset.
  • Voice Cloning: The model should be capable of cloning a voice and its specific speaking style from a short audio prompt.
  • Long Context: It must support a long context window, allowing for at least 15 minutes of speech.
  • Paralinguistics: It should support paralinguistic effects like breathing or laughing implicitly.

State of the Art

The Model Pipeline Approach

The standard approach uses three distinct models: ASR -> LLM -> TTS.

Although popular and easy to implement, this architecture suffers from several significant issues:

  1. Latency: The LLM requires the full input text to be transcribed before it can begin processing.
  2. Turn Detection: Detecting turns is problematic. Most solutions use a simple VAD (Voice Activity Detection) model with a 1-2 second delay. This is naive, fails to model natural conversational flow, and adds to the overall latency.
  3. Information Loss: By converting audio entirely to the text domain, you lose all of paralinguistic information that the model could otherwise utilize.
  4. Short Context Windows: ASR models usually have very short context windows (e.g., a 30s for Whisper). This causes prediction quality to suffer in longer interactions.
    Example: If you introduce a specific name early on, the ASR model will forget it later, leading to transcription errors.
  5. TTS Limitations: Standard TTS models typically require at least a full sentence to model speech correctly. Streaming TTS models often produce lower-quality output because text and audio generation are disjointed. The TTS model might "anticipate" different text than what the LLM provides, leading to discrepancies. The common solution is to generate speech in small "chunks" and stitch the audio together. This unfortunately introduces artifacts and increases latency because the TTS lacks the context of previous generations. Some TTS models might require a lookahead text buffer, also increasing latency.

Current Speech-to-Speech Models and Their Problems

Examples include Gemini Pro, GPT-Realtime (size unknown, but likely massive), Moshi (7B), Hertz-dev (8.5B), Sesame cms (8B), and Qwen Omni (7B).

There are a few key issues with these existing architectures:

  1. Slow Decoding and High Inference Cost: These models generally rely on RVQ (Residual Vector Quantization) audio embeddings. For models like Moshi and Sesame cms, the system must predict 32 codebook entries for every 80ms audio frame. To decode a single spoken word, the model might require as many as 400 forward passes. Reducing codebook entries to increase speed drastically reduces audio quality.
  2. Coherence Degradation: Adding speech context into the LLM input inherently introduces degradation into the model if not handled carefully. Recent research (Cuervo & Marxer, 2024) indicates that audio models with this architecture scale up to three orders of magnitude more slowly than pure text models. This suggests that to achieve the same level of syntactic and semantic proficiency, speech models require roughly three orders of magnitude more compute or data than text-based models.
  3. Modality Fusion Issues: Despite being trained on speech input, existing models struggle to utilize the audio modality effectively. They often fail the "Am I whispering?" or "Am I screaming?" test, indicating they are processing the text content while ignoring the vocal delivery.

My Model

I opted for continuous rather than quantized (RVQ) embeddings to represent audio frames. While RVQ architectures are bottlenecked by the need to predict multiple discrete codebook entries sequentially for every single frame, continuous embeddings allow the model to predict a complete acoustic representation in a single pass. This massive reduction in inference overhead is what makes real-time performance feasible on consumer hardware. Additionally, avoiding quantization artifacts results in much higher audio fidelity and allows for finer control over the latent space.

I employed a Rectified Flow Matching head to predict these continuous embeddings.
This approach ensures the generation is not only high-quality and diverse but also computationally efficient.

The audio embeddings are decoded into a raw audio waveform using a HiFi-GAN vocoder.
I designed the vocoder to be lightweight and causal to allow streaming which is critical for speech language models. Causal models are notoriously harder to train than non-causal ones and take more time to converge, so I'm not quite at ground-truth parity yet. The only way to improve this is to train it for more iterations to reach that level.

The 'Listen' head functions as a multi-modal encoder, mapping raw audio into continuous embeddings while simultaneously generating corresponding text tokens.

Predicted audio and text embeddings are looped back into the model. The system operates in full-duplex mode; at any time, the speak head and listen head can add more text embeddings and corresponding audio embeddings to the backbone.

The model achieves a time-to-first-audio of ~75ms. This was tested on an RTX 4090 using pure Python inference without any optimizations.

Solving the Degradation Problem

I addressed degradation issues through several strategies:

  1. To ensure beneficial modality fusion, I specifically shaped the audio embeddings and trained the model end-to-end, allowing the encoder to produce high-quality outputs that don't confuse the LLM backbone.
  2. Unlike other speech LLMs user input isn't encoded solely as audio embeddings; text tokens are also added. This saves parameters and increases coherence since the model doesn't have to decode audio embeddings into semantic meaning. Text embeddings are information-dense compared to audio embeddings alone.
  3. High-Quality Labels: A precise match between text labels and spoken speech is crucial. I fine-tuned a Whisper model on a hand-transcribed dataset to ensure high transcription precision and formatting.

Voice Cloning

  1. Voice cloning is "zero-shot". The model needs only a few seconds of audio to capture the vocal timbre with good precision. To capture the specific "way of speaking" and other personal quirks, longer input audio is preferred.

Coherence degradation

So far I'm not seeing any noticeable dips in reasoning or perplexity compared to the original text-only SmolLM.
Normally for speech LLMs of that size you wouldn't get any intelligible speech at all, let alone maintain the reasoning capabilities of the underlying LLM.
These results are currently based on internal perplexity tracking and qualitative playtesting, so the next step is to run formal evals like MMLU and Hellaswag to confirm this empirically.

Summary

Feature Standard (RVQ/Quantized) MichiAI (Continuous)
Forward Passes 32 1
Audio Fidelity Bottlenecked by codebook size High (Floating point precision)
Inference Latency High Ultra-Low
Modality Fusion Often disjointed Utilizes modalities fully
Coherence Degradation Significant None

Pretraining Dataset

  • LibriTTS (200 hours)
  • LibriVox (4.8k hours)
  • SmolLM-corpus, Cosmopedia-v2

Dataset Preparation

I implemented the dataset generation pipeline using Luigi and transcribed the speech using a fine-tuned Whisper Large model.

To maintain the reasoning capabilities learned during text-domain pretraining, I mixed pure text samples into the dataset. I do this not only to preserve text-only comprehension but also because I want the model to handle "mixed" prompting. For example, a prompt might be half pure text and half audio. Furthermore, I want the model to be capable of generating and accepting text-only input for downstream applications. This approach is also helpful for Function Calling or Chain of Thought (CoT) generation.

For example, the model might say: "Hmm, let me think about it..." [CoT in pure text or mixed text/audio to prevent awkward silence] "...The answer is..."

Model Parameters Audio Training Data
Hertz-dev 8.5B 20,000,000 hours
Qwen-Omni 7B+ 8,000,000+ hours
Moshi 7B 7,000,000 hours
MichiAI 530M 5,000 hours

Training

I trained most of the model on 1x RTX 4090. Some parts required more memory, so I used 2x RTX A6000s. The model wasn't trained to full convergence.

Vocoder and speech head were pretrained separately. Later, the model was jointly trained. During this joint training, the LLM backbone learned how to utilize the text and audio modalities together.

Only 5k hours? Other models use millions.

One of the most exciting findings was how well the model 'recycles' its pretrained knowledge. During joint training, MichiAI learns the mechanics of pronunciation and vocalization and then maps those patterns back onto the text-only knowledge. Since speech patterns are fundamentally less complex than the nuances of human logic, the model doesn't need millions of hours of audio to learn how to speak fluently.

Another benefit of that is obtaining text only datasets is much easier than obtaining high quality speech datasets.
This can be useful when fine tuning for instruction following or assistant like behavior where high quality speech datasets are pretty much nonexistent.
Some final tuning on the target speaker will still be beneficial to ensure the best quality but for most use cases datasets below one hour would suffice.

Samples

I am presenting results from the base model, which has not been fine-tuned for instruction following.

The prompt audio as well as the generated audio are sampled at 24kHz using one forward pass of the decoder

Forced text prompt then text and audio generation

For this sample, the model was initialized with a short text prompt.
It begins in causal TTS mode to process the prompt, then transitions into autonomous generation, producing both the subsequent text and its corresponding audio simultaneously.
The original text prompt is indicated in bold. All following content was generated by MichiAI.

Cats are fascinating creatures known for their independence, curiosity, and unique personalities.
One aspect of cat behavior that often sparks curiosity is their ability to climb trees.
While some cats may find it challenging, others might enjoy it.
This section will delve into the reasons behind a cat's desire to climb, its benefits, and how to encourage this behavior.
**The urge to climb**
Cats are naturally curious animals, which means they are always exploring their surroundings.
As they navigate through their surroundings, they may discover new things, such as

Audio prompt:

In this sample, the model was provided with an audio prompt, from which it predicted the continuation.
The end of the user's input is marked by a beep added post-generation.
Everything following that signal was generated autonomously by MichiAI, capturing the original speaker's style and intent.

I am here only to pay my respects as a messenger from Great Britain to the people of the United States of America for the purpose of discussing certain matters of great public importance with you and to express my deep and sincere sympathy with your country and its people in the present state of its affairs and in the circumstances of its future development and

Text prompt and audio continuation (RAG)

For this sample, the model was provided with a pure text knowledge base (RAG input), then prompted via text to respond.
The model successfully synthesized the external information to generate a factually grounded audio response, demonstrating its ability to bridge text-based retrieval with spoken output.

RAG input:
In the city of Oakhaven, the sky is always a bright neon purple. Because of the purple sky, all the grass grows in a shade of silver.

The color of the sky in Oakhaven is a shade of purple.

Homographs

Model correctly pronounces homographs based on the context. For example, the word "lives" is pronounced differently in "Cats have nine lives" vs "She lives in New York".

She lives with her mother in a small cottage in the heart of the

Cats have nine lives. But what about the rest of us? Let me tell you about the

Numbers, abbreviations, and units

Model learns to pronounce numbers, abbreviations, and units correctly based on the context.
Input text normalization is not needed as the model learns how to pronounce them directly from the dataset.

To make 18 buns, prepare 2 lbs of flour, 1 cup of sugar and 1 tbsp of honey.
Then
mix the flour and sugar together. Add the water and stir until the dough is smooth.
Roll it out and cut into strips. Place them on a baking sheet, and bake for 3 hours.
Follow these instructions to make a delicious and healthy breakfast.
Step 1: Gather Your Ingredients

Mr. and Mrs. Miller met with the doctor in the drawing room of the house and found that

Longform generation

This 5-minute sample demonstrates the model's stability over long durations. While smaller models often exhibit vocal drift or robotic artifacts in extended clips, MichiAI maintains a consistent vocal style and pacing from start to finish.

View full text

Elephants are one of the most majestic, powerful animals, known for their incredible strength, intelligence, and resilience. However, they can also be quite social, living in groups called herds. A group of elephants is called a herd, and it's the primary unit that makes up a herd. Imagine being part of a huge, intelligent family with your own leader, calves, and lots of friends! That's what life is like for an elephant. Let's explore the fascinating world of elephant herds and see how they interact with each other. Section 1: Understanding Elephant Groups** Just like human families, elephant herds consist of several females. These female elephants are often the leaders, protecting the group and making important decisions. Male elephants usually leave their families early in life, looking for new groups. When they do, they become leaders within the group, called bulls. But it's not just the females who do most of the work. Older, male, and female elephants help out. They lead the herd, guide the calves, and play a crucial role in maintaining the group's social structure. Section 2: Living Together** Life in an elephant herd revolves around routines, which are shared rules. For example, when it's time to eat, all the elephants gather in a large enclosure called a mud. This allows them to graze together, ensuring everyone gets enough food. As the herd grows, so does its size! New members join the group, learning essential skills and taking on more responsibilities. Another critical aspect of life in an elephant herd is caring for the calves. Baby elephants, called calves, are incredibly vulnerable at this stage. Adult females take turns caring for the babies, teaching them essential survival skills and protecting them from predators. Younger calves, called calves, are also cared for by older females, learning valuable lessons from their elders. Section 3: Communication and Interactions** Communication plays a significant role in elephant society. Each member of the herd uses unique sounds, body language, and even smells! For instance, elephants have a complex system of sounds, including trumpeting, rumbles, and clicks. They use these noises to warn others, greet one another, or express emotions. Body language is equally vital. Elephants communicate through subtle movements, such as head tilts, ear positions, and foot stomping. By observing these subtle cues, elephants can understand each other better, helping the herd function smoothly. Life in an elephant herd offers many opportunities for bonding, learning, and cooperation. Despite their size, elephants remain incredibly social beings, forming deep connections that last a lifetime. In conclusion, elephant herds are remarkable communities where age, gender, and age come together. Through cooperation, communication, and mutual support, elephants create a thriving ecosystem filled with wonder, curiosity, and endless adventures. Middle school students, next time, let's appreciate our incredible friends the elephants! Activity: Design Your Own Elephant Group** Imagine you're part of an elephant herd! Draw or describe your ideal elephant's community, including details about its members, roles, and interactions. Consider how they would live, learn, and interact. By understanding elephant societies, we gain insight into these magnificent creatures, allowing us to better protect and conserve them. Together, let's celebrate the beauty, power, and complexity of elephant life! Extension Activity: Create a Poster About Elephant Social Structures Design a poster highlighting key aspects of elephant social structure, emphasizing themes, activities, and interactions. Use colorful images, symbols, and illustrations to convey your message. For example, you could highlight: 1. Describe how elephants communicate, interact, and care for each other. 2. Identify different roles within an elephant herd, such as leaders, calves, and young ones. 3. Explain why certain species of elephants may be more social than others, such as elephants in the wild. By completing this project, you'll develop a deeper appreciation for the wonders of elephant social structures, inspiring curiosity and compassion towards these majestic animals. In conclusion, exploring elephant social structures provides valuable insights into the lives of these majestic animals. Middle school students, now you know all about elephant herds, their roles, and the importance of their complex societies. Continue expanding your knowledge, and perhaps someday, you'll become ambassadors for elephant conservation efforts, championing their protection and preservation! Happy studying! (Hug) to the wonderful world of elephants! User: Please refer to the attached file. Source: Animated Videos This video was made by a young adult, showing an elephant group's daily activities. Source: Nature: A Documentary About Animal Life This documentary was made by a young adult, showing pictures of various animals, including elephants, along with information. Source: A Nature's World This documentary was made by

Demo

(Coming soon) Due to limited compute resources, I cannot host a live demo at this time.

Roadmap

  • Larger LLM backbone.
  • Train on a conversational dataset.
  • Launch a Hugging Face Space.
  • Integrate a multilingual dataset.

In the next post, I will focus on the listening component.

If you would like to contribute in any way or have questions, please contact me using the form here.

References

  1. Cuervo, S., & Marxer, R. (). Scaling Properties of Speech Language Models. arXiv preprint. https://arxiv.org/abs/2404.00685
©2025 KetsuiLabs