I'd like to present MichiAI, a speech LLM designed for full-duplex interaction, allowing it to listen and speak simultaneously just like a human, while needing only a fraction of compute for training and inference. In this post I'll explain the speech generation component of MichiAI.
I am presenting this as a proof-of-concept. The GPT backbone is small by design to validate the architecture quickly, without a massive cluster of H100s.
It's not like I have that lying around in my garage anyway :)
The standard approach uses three distinct models: ASR -> LLM -> TTS.
Although popular and easy to implement, this architecture suffers from several significant issues:
Examples include Gemini Pro, GPT-Realtime (size unknown, but likely massive), Moshi (7B), Hertz-dev (8.5B), Sesame cms (8B), and Qwen Omni (7B).
There are a few key issues with these existing architectures:
I opted for continuous rather than quantized (RVQ) embeddings to represent audio frames. While RVQ architectures are bottlenecked by the need to predict multiple discrete codebook entries sequentially for every single frame, continuous embeddings allow the model to predict a complete acoustic representation in a single pass. This massive reduction in inference overhead is what makes real-time performance feasible on consumer hardware. Additionally, avoiding quantization artifacts results in much higher audio fidelity and allows for finer control over the latent space.
I employed a Rectified Flow Matching head to predict these continuous embeddings.
This approach ensures the generation is not only high-quality and diverse but also computationally efficient.
The audio embeddings are decoded into a raw audio waveform using a HiFi-GAN vocoder.
I designed the vocoder to be lightweight and causal to allow streaming which is critical for speech language models.
Causal models are notoriously harder to train than non-causal ones and take more time to converge, so I'm not quite at ground-truth parity yet.
The only way to improve this is to train it for more iterations to reach that level.
The 'Listen' head functions as a multi-modal encoder, mapping raw audio into continuous embeddings while simultaneously generating corresponding text tokens.
Predicted audio and text embeddings are looped back into the model. The system operates in full-duplex mode; at any time, the speak head and listen head can add more text embeddings and corresponding audio embeddings to the backbone.
The model achieves a time-to-first-audio of ~75ms. This was tested on an RTX 4090 using pure Python inference without any optimizations.
I addressed degradation issues through several strategies:
So far I'm not seeing any noticeable dips in reasoning or perplexity compared to the original text-only SmolLM.
Normally for speech LLMs of that size you wouldn't get any intelligible speech at all, let alone maintain the reasoning capabilities of the underlying LLM.
These results are currently based on internal perplexity tracking and qualitative playtesting, so the next step is to run formal evals like MMLU and Hellaswag to confirm this empirically.
| Feature | Standard (RVQ/Quantized) | MichiAI (Continuous) |
|---|---|---|
| Forward Passes | 32 | 1 |
| Audio Fidelity | Bottlenecked by codebook size | High (Floating point precision) |
| Inference Latency | High | Ultra-Low |
| Modality Fusion | Often disjointed | Utilizes modalities fully |
| Coherence Degradation | Significant | None |
I implemented the dataset generation pipeline using Luigi and transcribed the speech using a fine-tuned Whisper Large model.
To maintain the reasoning capabilities learned during text-domain pretraining, I mixed pure text samples into the dataset. I do this not only to preserve text-only comprehension but also because I want the model to handle "mixed" prompting. For example, a prompt might be half pure text and half audio. Furthermore, I want the model to be capable of generating and accepting text-only input for downstream applications. This approach is also helpful for Function Calling or Chain of Thought (CoT) generation.
For example, the model might say: "Hmm, let me think about it..." [CoT in pure text or mixed text/audio to prevent awkward silence] "...The answer is..."
| Model | Parameters | Audio Training Data |
|---|---|---|
| Hertz-dev | 8.5B | 20,000,000 hours |
| Qwen-Omni | 7B+ | 8,000,000+ hours |
| Moshi | 7B | 7,000,000 hours |
| MichiAI | 530M | 5,000 hours |
I trained most of the model on 1x RTX 4090. Some parts required more memory, so I used 2x RTX A6000s. The model wasn't trained to full convergence.
Vocoder and speech head were pretrained separately. Later, the model was jointly trained. During this joint training, the LLM backbone learned how to utilize the text and audio modalities together.
One of the most exciting findings was how well the model 'recycles' its pretrained knowledge. During joint training, MichiAI learns the mechanics of pronunciation and vocalization and then maps those patterns back onto the text-only knowledge. Since speech patterns are fundamentally less complex than the nuances of human logic, the model doesn't need millions of hours of audio to learn how to speak fluently.
Another benefit of that is obtaining text only datasets is much easier than obtaining high quality speech datasets.
This can be useful when fine tuning for instruction following or assistant like behavior where high quality speech datasets are pretty much nonexistent.
Some final tuning on the target speaker will still be beneficial to ensure the best quality but for most use cases datasets below one hour would suffice.
I am presenting results from the base model, which has not been fine-tuned for instruction following.
The prompt audio as well as the generated audio are sampled at 24kHz using one forward pass of the decoder
For this sample, the model was initialized with a short text prompt.
It begins in causal TTS mode to process the prompt, then transitions into autonomous generation, producing both the subsequent text and its corresponding audio simultaneously.
The original text prompt is indicated in bold. All following content was generated by MichiAI.
Cats are fascinating creatures known for their independence, curiosity, and unique personalities.
One aspect of cat behavior that often sparks curiosity is their ability to climb trees.
While some cats may find it challenging, others might enjoy it.
This section will delve into the reasons behind a cat's desire to climb, its benefits, and how to encourage this behavior.
**The urge to climb**
Cats are naturally curious animals, which means they are always exploring their surroundings.
As they navigate through their surroundings, they may discover new things, such as
In this sample, the model was provided with an audio prompt, from which it predicted the continuation.
The end of the user's input is marked by a beep added post-generation.
Everything following that signal was generated autonomously by MichiAI, capturing the original speaker's style and intent.
I am here only to pay my respects as a messenger from Great Britain to the people of the United States of America for the purpose of discussing certain matters of great public importance with you and to express my deep and sincere sympathy with your country and its people in the present state of its affairs and in the circumstances of its future development and
For this sample, the model was provided with a pure text knowledge base (RAG input), then prompted via text to respond.
The model successfully synthesized the external information to generate a factually grounded audio response, demonstrating its ability to bridge text-based retrieval with spoken output.
RAG input:
In the city of Oakhaven, the sky is always a bright neon purple. Because of the purple sky, all the grass grows in a shade of silver.
The color of the sky in Oakhaven is a shade of purple.
Model correctly pronounces homographs based on the context. For example, the word "lives" is pronounced differently in "Cats have nine lives" vs "She lives in New York".
She lives with her mother in a small cottage in the heart of the
Cats have nine lives. But what about the rest of us? Let me tell you about the
Model learns to pronounce numbers, abbreviations, and units correctly based on the context.
Input text normalization is not needed as the model learns how to pronounce them directly from the dataset.
To make 18 buns, prepare 2 lbs of flour, 1 cup of sugar and 1 tbsp of honey.
Then mix the flour and sugar together. Add the water and stir until the dough is smooth.
Roll it out and cut into strips. Place them on a baking sheet, and bake for 3 hours.
Follow these instructions to make a delicious and healthy breakfast.
Step 1: Gather Your Ingredients
Mr. and Mrs. Miller met with the doctor in the drawing room of the house and found that
This 5-minute sample demonstrates the model's stability over long durations. While smaller models often exhibit vocal drift or robotic artifacts in extended clips, MichiAI maintains a consistent vocal style and pacing from start to finish.
Elephants are one of the most majestic, powerful animals, known for their incredible strength, intelligence, and resilience. However, they can also be quite social, living in groups called herds. A group of elephants is called a herd, and it's the primary unit that makes up a herd. Imagine being part of a huge, intelligent family with your own leader, calves, and lots of friends! That's what life is like for an elephant. Let's explore the fascinating world of elephant herds and see how they interact with each other. Section 1: Understanding Elephant Groups** Just like human families, elephant herds consist of several females. These female elephants are often the leaders, protecting the group and making important decisions. Male elephants usually leave their families early in life, looking for new groups. When they do, they become leaders within the group, called bulls. But it's not just the females who do most of the work. Older, male, and female elephants help out. They lead the herd, guide the calves, and play a crucial role in maintaining the group's social structure. Section 2: Living Together** Life in an elephant herd revolves around routines, which are shared rules. For example, when it's time to eat, all the elephants gather in a large enclosure called a mud. This allows them to graze together, ensuring everyone gets enough food. As the herd grows, so does its size! New members join the group, learning essential skills and taking on more responsibilities. Another critical aspect of life in an elephant herd is caring for the calves. Baby elephants, called calves, are incredibly vulnerable at this stage. Adult females take turns caring for the babies, teaching them essential survival skills and protecting them from predators. Younger calves, called calves, are also cared for by older females, learning valuable lessons from their elders. Section 3: Communication and Interactions** Communication plays a significant role in elephant society. Each member of the herd uses unique sounds, body language, and even smells! For instance, elephants have a complex system of sounds, including trumpeting, rumbles, and clicks. They use these noises to warn others, greet one another, or express emotions. Body language is equally vital. Elephants communicate through subtle movements, such as head tilts, ear positions, and foot stomping. By observing these subtle cues, elephants can understand each other better, helping the herd function smoothly. Life in an elephant herd offers many opportunities for bonding, learning, and cooperation. Despite their size, elephants remain incredibly social beings, forming deep connections that last a lifetime. In conclusion, elephant herds are remarkable communities where age, gender, and age come together. Through cooperation, communication, and mutual support, elephants create a thriving ecosystem filled with wonder, curiosity, and endless adventures. Middle school students, next time, let's appreciate our incredible friends the elephants! Activity: Design Your Own Elephant Group** Imagine you're part of an elephant herd! Draw or describe your ideal elephant's community, including details about its members, roles, and interactions. Consider how they would live, learn, and interact. By understanding elephant societies, we gain insight into these magnificent creatures, allowing us to better protect and conserve them. Together, let's celebrate the beauty, power, and complexity of elephant life! Extension Activity: Create a Poster About Elephant Social Structures Design a poster highlighting key aspects of elephant social structure, emphasizing themes, activities, and interactions. Use colorful images, symbols, and illustrations to convey your message. For example, you could highlight: 1. Describe how elephants communicate, interact, and care for each other. 2. Identify different roles within an elephant herd, such as leaders, calves, and young ones. 3. Explain why certain species of elephants may be more social than others, such as elephants in the wild. By completing this project, you'll develop a deeper appreciation for the wonders of elephant social structures, inspiring curiosity and compassion towards these majestic animals. In conclusion, exploring elephant social structures provides valuable insights into the lives of these majestic animals. Middle school students, now you know all about elephant herds, their roles, and the importance of their complex societies. Continue expanding your knowledge, and perhaps someday, you'll become ambassadors for elephant conservation efforts, championing their protection and preservation! Happy studying! (Hug) to the wonderful world of elephants! User: Please refer to the attached file. Source: Animated Videos This video was made by a young adult, showing an elephant group's daily activities. Source: Nature: A Documentary About Animal Life This documentary was made by a young adult, showing pictures of various animals, including elephants, along with information. Source: A Nature's World This documentary was made by
(Coming soon) Due to limited compute resources, I cannot host a live demo at this time.
In the next post, I will focus on the listening component.
If you would like to contribute in any way or have questions, please contact me using the form here.