MichiAI: A Low Latency, Full Duplex Speech LLM

This is the first part of a series detailing MichiAI, a speech LLM designed for full-duplex interaction, allowing it to listen and speak simultaneously just like a human. In this opening post, I’ll be diving deep into the speech generation component.

I am presenting this as a proof-of-concept. The GPT backbone is small by design to validate the architecture quickly, without a massive cluster of H100s, my goal was to see how much performance I could squeeze out of a lightweight, efficient setup in a reasonable timeframe.

Goal

My goal was to develop a novel, built from ground up, low-latency speech LLM that is lightweight and intelligent enough to run in real-time on consumer-grade hardware or mobile devices. Potential applications include voice agents, AI companions, NPC characters in games, language learning apps, and more.

Requirements

Multimodal Input: It cannot be audio-only. The ability to prompt via text as well as audio is essential. Almost all existing LLM applications leverage RAG and text prompting, which is impossible with pure audio models.
Coherence: The audio modality cannot degrade the overall coherence of the model.
Low Time-to-First-Audio: This is critical for natural conversations. Waiting even one second for a response creates a poor user experience.
Compatibility with Existing Text LLMs: Pretraining text models is prohibitively expensive. Since many high-quality open-weight LLMs already exist, utilizing their pretrained text knowledge is a massive benefit.
End-to-End Processing: The model must "hear" the user and itself directly rather than relying on transcripts. Speech conveys more than just text; prosody, subtle timbre changes, and emotions provide valuable context for the model to interpret.
Full Duplex: It must handle interjections and backchanneling implicitly, learned directly from the dataset.
Voice Cloning: The model should be capable of cloning a voice and its specific speaking style from a short audio prompt.
Long Context: It must support a long context window, allowing for at least one hour of speech.
Paralinguistics: It should support paralinguistic effects like breathing or laughing implicitly.

State of the Art

The Model Pipeline Approach

The standard approach uses three distinct models: ASR -> LLM -> TTS.

Although popular and easy to implement, this architecture suffers from several significant issues:

Latency: The LLM requires the full input text to be transcribed before it can begin processing.
Turn Detection: Detecting turns is problematic. Most solutions use a simple VAD (Voice Activity Detection) model with a 1-2 second delay. This is naive, fails to model natural conversational flow, and adds to the overall latency.
Information Loss: By converting audio entirely to the text domain, you lose all of paralinguistic information that the model could otherwise utilize.
Short Context Windows: ASR models usually have very short context windows (e.g., 30s for Whisper). This causes prediction quality to suffer in longer interactions.
Example: If you introduce a specific name early on, the ASR model will forget it later, leading to transcription errors.
TTS Limitations: TTS models typically require at least a full sentence to model speech correctly. Streaming TTS models often produce lower-quality output because text and audio generation are disjointed. The TTS model might "anticipate" different text than what the LLM provides, leading to discrepancies. The common solution is to generate speech in small "chunks" and stitch the audio together. This unfortunately introduces artifacts and increases latency because the TTS lacks the context of previous generations.

Current Speech-to-Speech Models and Their Problems

Examples include Gemini Pro, GPT-Realtime (size unknown, but likely massive), Moshi (7B), Hertz-dev (8.5B), Sesame cms (8B), and Qwen Omni (7B).

There are a few key issues with these existing architectures:

Slow Decoding and High Inference Cost: These models generally rely on RVQ (Residual Vector Quantization) audio embeddings. For models like Moshi and Sesame cms, the system must predict 32 codebook entries for every 80ms audio frame. To decode a single spoken word, the model might require as many as 400 forward passes. Reducing codebook entries to increase speed drastically reduces audio quality.
Coherence Degradation: Adding speech context into the LLM input inherently introduces degradation into the model if not handled carefully. Recent research (Cuervo & Marxer, 2024) indicates that audio models with this architecture scale up to three orders of magnitude more slowly than pure text models. This suggests that to achieve the same level of syntactic and semantic proficiency, speech models require roughly three orders of magnitude more compute or data than text-based models.
Modality Fusion Issues: Despite being trained on speech input, existing models struggle to utilize the audio modality effectively. They often fail the "Am I whispering?" or "Am I screaming?" test, indicating they are processing the text content while disregarding the vocal delivery.

My Model

To represent audio frames, I opted for continuous embeddings instead of quantized (RVQ) ones. While RVQ architectures are bottlenecked by the need to predict multiple discrete codebook entries sequentially for every single frame, continuous embeddings allow the model to predict a complete acoustic representation in a single pass. This massive reduction in inference overhead is what makes real-time performance feasible on consumer hardware. Additionally, avoiding quantization artifacts results in much higher audio fidelity and allows for finer control over the latent space.

I employed a Rectified Flow Matching head to predict these continuous embeddings. This approach ensures the generation is not only high-quality and diverse but also computationally efficient. Finally, these embeddings, which represent a compressed mel representation are decoded into a raw audio waveform using a HiFi-GAN vocoder.

I designed the vocoder to be lightweight and causal to allow streaming which is critical for speech language models. Causal models are notoriously harder to train than non-causal ones and take more time to converge, so I'm not quite at ground-truth parity yet. The only way to improve this is to train it for more iterations to reach that level. This is a low hanging fruit, the vocoder is a big factor in perceived audio quality, it's essentially the 'finish' that makes a voice sound like a real human rather than a robot.

The 'Listen' head functions as a multi-modal encoder, mapping raw audio into a continuous embedding while simultaneously generating corresponding text tokens.

Predicted audio and text embeddings are looped back into the model. The system operates in full-duplex mode; at any time, the speak head and listen head can add more text embeddings and corresponding audio embeddings to the backbone.

The model achieves a time-to-first-audio of ~75ms. This was tested on an RTX 4090 using pure Python inference without any optimizations.

Solving the Degradation Problem

I addressed degradation issues through several strategies:

To ensure beneficial modality fusion, I specifically shaped the audio embeddings and trained the model end-to-end, allowing the encoder to produce high-quality outputs that don't confuse the LLM backbone.
Unlike other speech LLMs user input isn't encoded solely as audio embeddings; text tokens are also added. This saves parameters and increases coherence since the model doesn't have to decode audio embeddings into semantic meaning. Text embeddings are information-dense compared to audio embeddings alone.
High-Quality Labels: A precise match between text labels and spoken speech is crucial. I fine-tuned a Whisper model on a hand-transcribed dataset to ensure high transcription precision and formatting.
I dedicated specific parameters for the GPT architecture improvements, achieving surprising results in language modeling task (I will delve into this in the future posts).

Voice Cloning

Voice cloning is "zero-shot"; the model needs only a few seconds of audio to capture the vocal timbre with good precision. To capture the specific "way of speaking" and other personal quirks, longer input audio is preferred.

Coherence degradation

So far it doesn't seem like the model has lost its 'brain'. I'm seeing no noticeable dip in reasoning or perplexity compared to the original text-only SmolLM.
Normally for speech LLMs of that size you wouldn't get any intelligible speech at all, let alone maintain the reasoning capabilities of the underlying LLM.
These results are currently based on internal perplexity tracking and qualitative playtesting, so the next step is to run formal evals like MMLU and Hellaswag to confirm this empirically

Summary

Feature	Standard (RVQ/Quantized)	MichiAI (Continuous)
Forward Passes (per 1s of audio)	400	1
Audio Fidelity	Bottlenecked by codebook size	High (Floating point precision)
Inference Latency	High	Ultra-Low
Modality Fusion	Often disjointed	Utilizes modalities fully
Coherence Degradation	Significant	None

Pretraining Dataset

LibriTTS (200 hours)
LibriVox (4.8k hours)
SmolLM-corpus, Cosmopedia-v2

Dataset Preparation

I implemented the dataset generation pipeline using Luigi and transcribed the speech using a fine-tuned Whisper Large model.

To maintain the reasoning capabilities learned during text-domain pretraining, I mixed pure text samples into the dataset. I do this not only to preserve text-only comprehension but also because I want the model to handle "mixed" prompting. For example, a prompt might be half pure text and half audio. Furthermore, I want the model to be capable of generating and accepting text-only input for downstream applications. This approach is also helpful for Function Calling or Chain of Thought (CoT) generation.

For example, the model might say: "Hmm, let me think about it..." [CoT in pure text or mixed text/audio to prevent awkward silence] "...The answer is..."

Model	Parameters	Audio Training Data
Hertz-dev	8.5B	20,000,000 hours
Qwen-Omni	7B+	8,000,000+ hours
Moshi	7B	7,000,000 hours
MichiAI	540M	5,000 hours

Training

I trained most of the model on 1x RTX 4090 and 2x RTX A6000s for parts requiring more memory. Since my available compute and budget are highly limited, I did not train the model to full convergence.

Vocoder and speech head were pretrained separately. Later, the model was jointly trained. During this joint training, the LLM backbone learned how to utilize the text and audio modalities together.

Only 5k hours? Other models use millions.

The model utilizes the pretrained knowledge surprisingly well so during the joint training the model extracts knowledge how to pronounce and vocalize words and applies the patterns to the text only dataset. Since speech patterns are not as complex as natural language patterns, I found out that the model can learn how to speak from a relatively small speech dataset. This is great news because this way the model doesn't need to see millions of hours of speech to learn how to speak coherently.

Other benefit of that is obtaining text only datasets is much easier than obtaining high quality speech datasets. This can be useful when fine tuning for instruction following or assistant like behavior where high quality speech datasets are pretty much nonexistent.

Samples

I am presenting results from the base model, which has not been fine-tuned for instruction following.

The input audio as well as the generated audio are sampled at 24kHz using one forward pass of the decoder

Audio prompt:

The end of the prompt is marked with a beep added post generation.
After the beep, the model generates the speech autonomously.
No text is provided!

Cats are fascinating creatures that have been companions to humans for thousands of years.

Forced text prompt then text and audio generation: [5 examples]

For these samples, model was provided the prompt text for which audio was generated.
Simply speaking starts in causal TTS mode then generates speech fully.
Text prompt is marked with bold text

Cats are fascinating creatures that have been companions to humans for thousands of years.

Text prompt and audio continuation

For these samples, the model was provided a pure text prompt and audio continuation was generated.

Homographs

Model correctly pronounces homographs based on the context. For example, word "lives" is pronounced differently in "Cats have nine lives" vs "She lives in New York".

Demo

(Coming soon) Due to limited compute resources, I cannot host a live demo at this time.

Roadmap

Larger LLM backbone.
Train on a conversational dataset.
Launch a Hugging Face Space.
Integrate a multilingual dataset.
Open-sourcing the model(?)

If you would like to contribute in any way or have questions, please contact me using the form here.

Ketsui

Labs

Part 1: Speech Generation