This is the first part of a series detailing MichiAI, a speech LLM designed for full-duplex interaction, allowing it to listen and speak simultaneously just like a human. In this opening post, I’ll be diving deep into the speech generation component.
I am presenting this as a proof-of-concept. The GPT backbone is small by design to validate the architecture quickly, without a massive cluster of H100s, my goal was to see how much performance I could squeeze out of a lightweight, efficient setup in a reasonable timeframe.
My goal was to develop a novel, built from ground up, low-latency speech LLM that is lightweight and intelligent enough to run in real-time on consumer-grade hardware or mobile devices. Potential applications include voice agents, AI companions, NPC characters in games, language learning apps, and more.
The standard approach uses three distinct models: ASR -> LLM -> TTS.
Although popular and easy to implement, this architecture suffers from several significant issues:
Examples include Gemini Pro, GPT-Realtime (size unknown, but likely massive), Moshi (7B), Hertz-dev (8.5B), Sesame cms (8B), and Qwen Omni (7B).
There are a few key issues with these existing architectures:
To represent audio frames, I opted for continuous embeddings instead of quantized (RVQ) ones. While RVQ architectures are bottlenecked by the need to predict multiple discrete codebook entries sequentially for every single frame, continuous embeddings allow the model to predict a complete acoustic representation in a single pass. This massive reduction in inference overhead is what makes real-time performance feasible on consumer hardware. Additionally, avoiding quantization artifacts results in much higher audio fidelity and allows for finer control over the latent space.
I employed a Rectified Flow Matching head to predict these continuous embeddings. This approach ensures the generation is not only high-quality and diverse but also computationally efficient. Finally, these embeddings, which represent a compressed mel representation are decoded into a raw audio waveform using a HiFi-GAN vocoder.
I designed the vocoder to be lightweight and causal to allow streaming which is critical for speech language models. Causal models are notoriously harder to train than non-causal ones and take more time to converge, so I'm not quite at ground-truth parity yet. The only way to improve this is to train it for more iterations to reach that level. This is a low hanging fruit, the vocoder is a big factor in perceived audio quality, it's essentially the 'finish' that makes a voice sound like a real human rather than a robot.
The 'Listen' head functions as a multi-modal encoder, mapping raw audio into a continuous embedding while simultaneously generating corresponding text tokens.
Predicted audio and text embeddings are looped back into the model. The system operates in full-duplex mode; at any time, the speak head and listen head can add more text embeddings and corresponding audio embeddings to the backbone.
The model achieves a time-to-first-audio of ~75ms. This was tested on an RTX 4090 using pure Python inference without any optimizations.
I addressed degradation issues through several strategies:
So far it doesn't seem like the model has lost its 'brain'. I'm seeing no noticeable dip in reasoning or perplexity compared to the original text-only SmolLM.
Normally for speech LLMs of that size you wouldn't get any intelligible speech at all, let alone maintain the reasoning capabilities of the underlying LLM.
These results are currently based on internal perplexity tracking and qualitative playtesting, so the next step is to run formal evals like MMLU and Hellaswag to confirm this empirically
| Feature | Standard (RVQ/Quantized) | MichiAI (Continuous) |
|---|---|---|
| Forward Passes (per 1s of audio) | 400 | 1 |
| Audio Fidelity | Bottlenecked by codebook size | High (Floating point precision) |
| Inference Latency | High | Ultra-Low |
| Modality Fusion | Often disjointed | Utilizes modalities fully |
| Coherence Degradation | Significant | None |
I implemented the dataset generation pipeline using Luigi and transcribed the speech using a fine-tuned Whisper Large model.
To maintain the reasoning capabilities learned during text-domain pretraining, I mixed pure text samples into the dataset. I do this not only to preserve text-only comprehension but also because I want the model to handle "mixed" prompting. For example, a prompt might be half pure text and half audio. Furthermore, I want the model to be capable of generating and accepting text-only input for downstream applications. This approach is also helpful for Function Calling or Chain of Thought (CoT) generation.
For example, the model might say: "Hmm, let me think about it..." [CoT in pure text or mixed text/audio to prevent awkward silence] "...The answer is..."
| Model | Parameters | Audio Training Data |
|---|---|---|
| Hertz-dev | 8.5B | 20,000,000 hours |
| Qwen-Omni | 7B+ | 8,000,000+ hours |
| Moshi | 7B | 7,000,000 hours |
| MichiAI | 540M | 5,000 hours |
I trained most of the model on 1x RTX 4090 and 2x RTX A6000s for parts requiring more memory. Since my available compute and budget are highly limited, I did not train the model to full convergence.
Vocoder and speech head were pretrained separately. Later, the model was jointly trained. During this joint training, the LLM backbone learned how to utilize the text and audio modalities together.
The model utilizes the pretrained knowledge surprisingly well so during the joint training the model extracts knowledge how to pronounce and vocalize words and applies the patterns to the text only dataset. Since speech patterns are not as complex as natural language patterns, I found out that the model can learn how to speak from a relatively small speech dataset. This is great news because this way the model doesn't need to see millions of hours of speech to learn how to speak coherently.
Other benefit of that is obtaining text only datasets is much easier than obtaining high quality speech datasets. This can be useful when fine tuning for instruction following or assistant like behavior where high quality speech datasets are pretty much nonexistent.
I am presenting results from the base model, which has not been fine-tuned for instruction following.
The input audio as well as the generated audio are sampled at 24kHz using one forward pass of the decoder
The end of the prompt is marked with a beep added post generation.
After the beep, the model generates the speech autonomously.
No text is provided!
Cats are fascinating creatures that have been companions to humans for thousands of years.
Cats are fascinating creatures that have been companions to humans for thousands of years.
For these samples, model was provided the prompt text for which audio was generated.
Simply speaking starts in causal TTS mode then generates speech fully.
Text prompt is marked with bold text
Cats are fascinating creatures that have been companions to humans for thousands of years.
Cats are fascinating creatures that have been companions to humans for thousands of years.
For these samples, the model was provided a pure text prompt and audio continuation was generated.
Model correctly pronounces homographs based on the context. For example, word "lives" is pronounced differently in "Cats have nine lives" vs "She lives in New York".
(Coming soon) Due to limited compute resources, I cannot host a live demo at this time.
In the next post, I will focus on the listening component.
If you would like to contribute in any way or have questions, please contact me using the form here.