MichiAI: A Low Latency, Full Duplex Speech LLM

Part 1: Speech Generation

Damian Krystkiewicz
February 4, 2026
Architecture
Figure 1: MichiAI architecture
Model Size 530M Params
Latency (TTFA) ~75ms
Architecture Flow Matching
Key Innovation No Coherence Loss

I'd like to present MichiAI, a speech LLM designed for full-duplex interaction, allowing it to listen and speak simultaneously just like a human, while needing only a fraction of compute for training and inference. In this post I'll explain the speech generation component of MichiAI.

I am presenting this as a proof-of-concept. The GPT backbone is small by design to validate the architecture quickly, without a massive cluster of H100s.
It's not like I have that lying around in my garage anyway :)

Requirements

  • Multimodal Input: It cannot be audio-only. The ability to prompt via text as well as audio is essential. Almost all existing LLM applications leverage RAG and text prompting, which is impossible with pure audio models.
  • Coherence: The audio modality cannot degrade the overall coherence of the model.
  • Low Time-to-First-Audio: Anything over 500ms breaks the flow of conversation. To feel natural, I targeted sub-100ms response times.
  • Compatibility with Existing Text LLMs: Pretraining text models is prohibitively expensive. Since many high-quality open-weight LLMs already exist, utilizing their pretrained text knowledge is a massive benefit.
  • End-to-End Processing: The model must "hear" the user and itself directly rather than relying on transcripts. Speech conveys more than just text; prosody, subtle timbre changes, and emotions provide valuable context for the model to interpret.
  • Full Duplex: It must handle interjections and backchanneling implicitly, learned directly from the dataset.
  • Voice Cloning: The model should be capable of cloning a voice and its specific speaking style from a short audio prompt.
  • Paralinguistics: It should support paralinguistic effects like breathing or laughing implicitly.

State of the Art

The Model Pipeline Approach

The standard approach uses three distinct models: ASR -> LLM -> TTS.

Although popular and easy to implement, this architecture suffers from several significant issues:

  1. Latency: The LLM requires the full input text to be transcribed before it can begin processing.
  2. Turn Detection: Detecting turns is problematic. Most solutions use a simple VAD (Voice Activity Detection) model with a 1-2 second delay. This is naive, fails to model natural conversational flow, and adds to the overall latency.
  3. Information Loss: By converting audio entirely to the text domain, you lose all of paralinguistic information that the model could otherwise utilize.
  4. Short Context Windows: ASR models usually have very short context windows (e.g., a 30s for Whisper). This causes prediction quality to suffer in longer interactions.
    Example: If you introduce a specific name early on, the ASR model will forget it later, leading to transcription errors.
  5. TTS Limitations: Standard TTS models typically require at least a full sentence to model speech correctly. Streaming TTS models often produce lower-quality output because text and audio generation are disjointed. The TTS model might "anticipate" different text than what the LLM provides, leading to discrepancies. The common solution is to generate speech in small "chunks" and stitch the audio together. This unfortunately introduces artifacts and increases latency because the TTS lacks the context of previous generations. Some TTS models might require a lookahead text buffer, also increasing latency.

Current Speech-to-Speech Models and Their Problems

Examples include Gemini Pro, GPT-Realtime (size unknown, but likely massive), Moshi (7B), Hertz-dev (8.5B), Sesame cms (8B), and Qwen Omni (7B).

There are a few key issues with these existing architectures:

  1. Slow Decoding and High Inference Cost: These models generally rely on RVQ (Residual Vector Quantization) audio embeddings. For models like Moshi and Sesame cms, the system must predict 32 codebook entries for every 80ms audio frame. To decode a single spoken word, the model might require as many as 400 forward passes. Reducing codebook entries to increase speed drastically reduces audio quality.
  2. Coherence Degradation: Adding speech context into the LLM input inherently introduces degradation into the model if not handled carefully. Recent research (Cuervo & Marxer, 2024) indicates that audio models with this architecture scale up to three orders of magnitude more slowly than pure text models. This suggests that to achieve the same level of syntactic and semantic proficiency, speech models require roughly three orders of magnitude more compute or data than text-based models.
  3. Modality Fusion Issues: Despite being trained on speech input, existing models struggle to utilize the audio modality effectively. They often fail the "Am I whispering?" or "Am I screaming?" test, indicating they are processing the text content while ignoring the vocal delivery.

My Model

I opted for continuous rather than quantized (RVQ) embeddings to represent audio frames. While RVQ architectures are bottlenecked by the need to predict multiple discrete codebook entries sequentially, continuous embeddings allow the model to predict the same information in a single pass. This massive reduction in inference overhead is what makes real-time performance feasible on consumer hardware. Additionally, avoiding quantization artifacts results in much higher audio fidelity and allows for finer control over the latent space.

I employed a Rectified Flow Matching head to predict these continuous embeddings.
This approach ensures the generation is not only high-quality and diverse but also computationally efficient.

The audio embeddings are decoded into a raw audio waveform using a HiFi-GAN vocoder.
I designed the vocoder to be lightweight and causal to allow streaming which is critical for speech language models. Causal models are notoriously harder to train than non-causal ones and take more time to converge, so I'm not quite at ground-truth parity yet. The only way to improve this is to train it for more iterations to reach that level.

The 'Listen' head functions as a multi-modal encoder, mapping raw audio into continuous embeddings while simultaneously generating corresponding text tokens.

Predicted audio and text embeddings are looped back into the model. The system operates in full-duplex mode; at any time, the speak head and listen head can add more text embeddings and corresponding audio embeddings to the backbone.

The model achieves a time-to-first-audio of ~75ms. This was tested on an RTX 4090 using pure Python inference without any optimizations.

Solving the Degradation Problem

I addressed degradation issues through several strategies:

  1. To ensure beneficial modality fusion, I specifically shaped the audio embeddings and trained the model end-to-end, allowing the encoder to produce high-quality outputs that don't confuse the LLM backbone.
  2. Unlike other speech LLMs user input isn't encoded solely as audio embeddings; text tokens are also added. This saves parameters and increases coherence since the model doesn't have to decode audio embeddings into semantic meaning. Text embeddings are information-dense compared to audio embeddings alone.
  3. High-Quality Labels: A precise match between text labels and spoken speech is crucial. I fine-tuned a Whisper model on a hand-transcribed dataset to ensure high transcription precision and formatting.

Voice Cloning

  1. Voice cloning is "zero-shot". The model needs only a few seconds of audio to capture the vocal timbre with good precision. To capture the specific "way of speaking" and other personal quirks, longer input audio is preferred.

Coherence degradation

So far I'm not seeing any noticeable dips in reasoning or perplexity compared to the original text-only SmolLM.
Normally for speech LLMs of that size you wouldn't get any intelligible speech at all, let alone maintain the reasoning capabilities of the underlying LLM.
These results are currently based on internal perplexity tracking and qualitative playtesting, so the next step is to run formal evals like MMLU and Hellaswag to confirm this empirically.

Summary

Feature Standard (RVQ/Quantized) MichiAI (Continuous)
Forward Passes 32 1
Audio Fidelity Bottlenecked by codebook size High (Floating point precision)
Inference Latency High Ultra-Low
Modality Fusion Often disjointed Utilizes modalities fully
Coherence Degradation Significant None

Pretraining Dataset

  • LibriTTS (200 hours)
  • LibriVox (4.8k hours)
  • SmolLM-corpus, Cosmopedia-v2

Dataset Preparation

I implemented the dataset generation pipeline using Luigi and transcribed the speech using a fine-tuned Whisper Large model.

To maintain the reasoning capabilities learned during text-domain pretraining, I mixed pure text samples into the dataset. I do this not only to preserve text-only comprehension but also because I want the model to handle "mixed" prompting. For example, a prompt might be half pure text and half audio. Furthermore, I want the model to be capable of generating and accepting text-only input for downstream applications. This approach is also helpful for Function Calling or Chain of Thought (CoT) generation.

For example, the model might say: "Hmm, let me think about it..." [CoT in pure text or mixed text/audio to prevent awkward silence] "...The answer is..."

Model Parameters Audio Training Data
Hertz-dev 8.5B 20,000,000 hours
Qwen-Omni 7B+ 8,000,000+ hours
Moshi 7B 7,000,000 hours
MichiAI 530M 5,000 hours

Training

I trained most of the model on 1x RTX 4090. Some parts required more memory, so I used 2x RTX A6000s for these. The model wasn't trained to full convergence.

Vocoder and speech head were pretrained separately. Later, the model was jointly trained. During this joint training, the LLM backbone learned how to utilize the text and audio modalities together.

Only 5k hours? Other models use millions.

One of the most exciting findings was how well the model 'recycles' its pretrained knowledge. During joint training, MichiAI learns the mechanics of pronunciation and vocalization and then maps those patterns back onto the text-only knowledge. Since speech patterns are fundamentally less complex than the nuances of human logic, the model doesn't need millions of hours of audio to learn how to speak fluently.

Another benefit of that is obtaining text only datasets is much easier than obtaining high quality speech datasets.
This can be useful when fine tuning for instruction following or assistant like behavior where high quality speech datasets are pretty much nonexistent.
Some final tuning on the target speaker will still be beneficial to ensure the best quality but for most use cases datasets below one hour would suffice.

Samples

I am presenting results from the base model, which has not been fine-tuned for instruction following.

The prompt audio as well as the generated audio are sampled at 24kHz using one forward pass of the decoder

Forced text prompt then text and audio generation

For this sample, the model was initialized with a short text prompt.
It begins in causal TTS mode to process the prompt, then transitions into autonomous generation, producing both the subsequent text and its corresponding audio simultaneously.
The original text prompt is indicated in bold. All following content was generated by MichiAI.

Cats are fascinating creatures known for their independence, curiosity, and unique personalities.
One aspect of cat behavior that often sparks curiosity is their ability to climb trees.
While some cats may find it challenging, others might enjoy it.
This section will delve into the reasons behind a cat's desire to climb, its benefits, and how to encourage this behavior.
**The urge to climb**
Cats are naturally curious animals, which means they are always exploring their surroundings.
As they navigate through their surroundings, they may discover new things, such as

Audio prompt:

In this sample, the model was provided with an audio prompt, from which it predicted the continuation.
The end of the user's input is marked by a beep added post-generation.
Everything following that signal was generated autonomously by MichiAI, capturing the original speaker's style and intent.

I am here only to pay my respects as a messenger from Great Britain to the people of the United States of America for the purpose of discussing certain matters of great public importance with you and to express my deep and sincere sympathy with your country and its people in the present state of its affairs and in the circumstances of its future development and

Text prompt and audio continuation (RAG)

For this sample, the model was provided with a pure text knowledge base (RAG input), then prompted via text to respond.
The model successfully synthesized the external information to generate a factually grounded audio response, demonstrating its ability to bridge text-based retrieval with spoken output.

RAG input:
In the city of Oakhaven, the sky is always a bright neon purple. Because of the purple sky, all the grass grows in a shade of silver.

The color of the sky in Oakhaven is a shade of purple.

Homographs

Model correctly pronounces homographs based on the context. For example, the word "lives" is pronounced differently in "Cats have nine lives" vs "She lives in New York".

She lives with her mother in a small cottage in the heart of the

Cats have nine lives. But what about the rest of us? Let me tell you about the

Numbers, abbreviations, and units

Model learns to pronounce numbers, abbreviations, and units correctly based on the context.
Input text normalization is not needed as the model learns how to pronounce them directly from the dataset.

To make 18 buns, prepare 2 lbs of flour, 1 cup of sugar and 1 tbsp of honey.
Then
mix the flour and sugar together. Add the water and stir until the dough is smooth.
Roll it out and cut into strips. Place them on a baking sheet, and bake for 3 hours.
Follow these instructions to make a delicious and healthy breakfast.
Step 1: Gather Your Ingredients

Mr. and Mrs. Miller met with the doctor in the drawing room of the house and found that

Demo

(Coming soon) Due to limited compute resources, I cannot host a live demo at this time.

Roadmap

  • Larger LLM backbone.
  • Train on a conversational dataset.
  • Launch a Hugging Face Space.
  • Integrate a multilingual dataset.

In the next post, I will focus on the listening component.

If you would like to contribute in any way or have questions, please contact me using the form here.

References

  1. Cuervo, S., & Marxer, R. (). Scaling Properties of Speech Language Models. arXiv preprint. https://arxiv.org/abs/2404.00685
©2026 KetsuiLabs