I'd like to present MichiAI, a speech LLM designed for full-duplex interaction, allowing it to listen and speak simultaneously just like a human, while needing only a fraction of compute for training and inference. In this post I'll explain the speech generation component of MichiAI.
I am presenting this as a proof-of-concept. The GPT backbone is small by design to validate the architecture quickly, without a massive cluster of H100s.
It's not like I have that lying around in my garage anyway :)
The standard approach uses three distinct models: ASR -> LLM -> TTS.
Although popular and easy to implement, this architecture suffers from several significant issues:
Examples include Gemini Pro, GPT-Realtime (size unknown, but likely massive), Moshi (7B), Hertz-dev (8.5B), Sesame cms (8B), and Qwen Omni (7B).
There are a few key issues with these existing architectures:
I opted for continuous rather than quantized (RVQ) embeddings to represent audio frames. While RVQ architectures are bottlenecked by the need to predict multiple discrete codebook entries sequentially, continuous embeddings allow the model to predict the same information in a single pass. This massive reduction in inference overhead is what makes real-time performance feasible on consumer hardware. Additionally, avoiding quantization artifacts results in much higher audio fidelity and allows for finer control over the latent space.
I employed a Rectified Flow Matching head to predict these continuous embeddings.
This approach ensures the generation is not only high-quality and diverse but also computationally efficient.
The audio embeddings are decoded into a raw audio waveform using a HiFi-GAN vocoder.
I designed the vocoder to be lightweight and causal to allow streaming which is critical for speech language models.
Causal models are notoriously harder to train than non-causal ones and take more time to converge, so I'm not quite at ground-truth parity yet.
The only way to improve this is to train it for more iterations to reach that level.
The 'Listen' head functions as a multi-modal encoder, mapping raw audio into continuous embeddings while simultaneously generating corresponding text tokens.
Predicted audio and text embeddings are looped back into the model. The system operates in full-duplex mode; at any time, the speak head and listen head can add more text embeddings and corresponding audio embeddings to the backbone.
The model achieves a time-to-first-audio of ~75ms. This was tested on an RTX 4090 using pure Python inference without any optimizations.
I addressed degradation issues through several strategies:
So far I'm not seeing any noticeable dips in reasoning or perplexity compared to the original text-only SmolLM.
Normally for speech LLMs of that size you wouldn't get any intelligible speech at all, let alone maintain the reasoning capabilities of the underlying LLM.
These results are currently based on internal perplexity tracking and qualitative playtesting, so the next step is to run formal evals like MMLU and Hellaswag to confirm this empirically.
| Feature | Standard (RVQ/Quantized) | MichiAI (Continuous) |
|---|---|---|
| Forward Passes | 32 | 1 |
| Audio Fidelity | Bottlenecked by codebook size | High (Floating point precision) |
| Inference Latency | High | Ultra-Low |
| Modality Fusion | Often disjointed | Utilizes modalities fully |
| Coherence Degradation | Significant | None |
I implemented the dataset generation pipeline using Luigi and transcribed the speech using a fine-tuned Whisper Large model.
To maintain the reasoning capabilities learned during text-domain pretraining, I mixed pure text samples into the dataset. I do this not only to preserve text-only comprehension but also because I want the model to handle "mixed" prompting. For example, a prompt might be half pure text and half audio. Furthermore, I want the model to be capable of generating and accepting text-only input for downstream applications. This approach is also helpful for Function Calling or Chain of Thought (CoT) generation.
For example, the model might say: "Hmm, let me think about it..." [CoT in pure text or mixed text/audio to prevent awkward silence] "...The answer is..."
| Model | Parameters | Audio Training Data |
|---|---|---|
| Hertz-dev | 8.5B | 20,000,000 hours |
| Qwen-Omni | 7B+ | 8,000,000+ hours |
| Moshi | 7B | 7,000,000 hours |
| MichiAI | 530M | 5,000 hours |
I trained most of the model on 1x RTX 4090. Some parts required more memory, so I used 2x RTX A6000s for these. The model wasn't trained to full convergence.
Vocoder and speech head were pretrained separately. Later, the model was jointly trained. During this joint training, the LLM backbone learned how to utilize the text and audio modalities together.
One of the most exciting findings was how well the model 'recycles' its pretrained knowledge. During joint training, MichiAI learns the mechanics of pronunciation and vocalization and then maps those patterns back onto the text-only knowledge. Since speech patterns are fundamentally less complex than the nuances of human logic, the model doesn't need millions of hours of audio to learn how to speak fluently.
Another benefit of that is obtaining text only datasets is much easier than obtaining high quality speech datasets.
This can be useful when fine tuning for instruction following or assistant like behavior where high quality speech datasets are pretty much nonexistent.
Some final tuning on the target speaker will still be beneficial to ensure the best quality but for most use cases datasets below one hour would suffice.
I am presenting results from the base model, which has not been fine-tuned for instruction following.
The prompt audio as well as the generated audio are sampled at 24kHz using one forward pass of the decoder
For this sample, the model was initialized with a short text prompt.
It begins in causal TTS mode to process the prompt, then transitions into autonomous generation, producing both the subsequent text and its corresponding audio simultaneously.
The original text prompt is indicated in bold. All following content was generated by MichiAI.
Cats are fascinating creatures known for their independence, curiosity, and unique personalities.
One aspect of cat behavior that often sparks curiosity is their ability to climb trees.
While some cats may find it challenging, others might enjoy it.
This section will delve into the reasons behind a cat's desire to climb, its benefits, and how to encourage this behavior.
**The urge to climb**
Cats are naturally curious animals, which means they are always exploring their surroundings.
As they navigate through their surroundings, they may discover new things, such as
In this sample, the model was provided with an audio prompt, from which it predicted the continuation.
The end of the user's input is marked by a beep added post-generation.
Everything following that signal was generated autonomously by MichiAI, capturing the original speaker's style and intent.
I am here only to pay my respects as a messenger from Great Britain to the people of the United States of America for the purpose of discussing certain matters of great public importance with you and to express my deep and sincere sympathy with your country and its people in the present state of its affairs and in the circumstances of its future development and
For this sample, the model was provided with a pure text knowledge base (RAG input), then prompted via text to respond.
The model successfully synthesized the external information to generate a factually grounded audio response, demonstrating its ability to bridge text-based retrieval with spoken output.
RAG input:
In the city of Oakhaven, the sky is always a bright neon purple. Because of the purple sky, all the grass grows in a shade of silver.
The color of the sky in Oakhaven is a shade of purple.
Model correctly pronounces homographs based on the context. For example, the word "lives" is pronounced differently in "Cats have nine lives" vs "She lives in New York".
She lives with her mother in a small cottage in the heart of the
Cats have nine lives. But what about the rest of us? Let me tell you about the
Model learns to pronounce numbers, abbreviations, and units correctly based on the context.
Input text normalization is not needed as the model learns how to pronounce them directly from the dataset.
To make 18 buns, prepare 2 lbs of flour, 1 cup of sugar and 1 tbsp of honey.
Then mix the flour and sugar together. Add the water and stir until the dough is smooth.
Roll it out and cut into strips. Place them on a baking sheet, and bake for 3 hours.
Follow these instructions to make a delicious and healthy breakfast.
Step 1: Gather Your Ingredients
Mr. and Mrs. Miller met with the doctor in the drawing room of the house and found that
(Coming soon) Due to limited compute resources, I cannot host a live demo at this time.
In the next post, I will focus on the listening component.
If you would like to contribute in any way or have questions, please contact me using the form here.