Code-Along With Mega-ASR: Robust In-the-Wild Speech Recognition on Top of Qwen3-ASR-1.7B
- 5 days ago
- 2 min read
Mega-ASR was published on Hugging Face as zhifeixie/Mega-ASR on May 19, 2026, with the companion paper at arXiv:2605.19833 (Xie, Pang, Zhang, Ye, Hu, Yan, Miao, 2026). It is licensed under Apache 2.0.
What problem it targets
Mega-ASR is built for what the authors call in-the-wild squared audio: real-world recordings that combine multiple distortions at once, including noise, reverberation, clipping, band limiting, and overlapping speakers. These conditions are where standard ASR models tend to fail in the worst possible way, producing empty outputs, omissions, repetitions, or hallucinated text rather than just degraded accuracy.
Architecture
The system has three components: the Qwen3-ASR-1.7B foundation model as the backbone, a set of Mega-ASR adaptation weights on top, and an audio quality router that decides per input whether to use the robust Mega-ASR path or the cleaner base recognition path. The router threshold defaults to 0.5 and is configurable. The training pipeline uses acoustic-to-semantic supervised fine-tuning, exposing the model to progressively harder examples so it learns to recover both local acoustic detail and sentence-level semantics under degradation.
Install
git clone https://github.com/xzf-thu/Mega-ASR.git
cd Mega-ASR
conda create -n mega-asr python=3.10 -y
conda activate mega-asr
pip install -r requirements.txtFour-line inference call
The official quickstart from the model card:
from MegaASR.model.megaASR import MegaASR
model = MegaASR(
model_path="ckpt/Mega-ASR/Qwen3-ASR-1.7B",
router_checkpoint="ckpt/Mega-ASR/audio_quality_router/best_acc_model.pt",
routing_enabled=True,
)
result = model.infer("/path/to/audio.wav", return_route=True)
print(result)Defaults to know
Max tokens default to 256
Decoding defaults to greedy
Router threshold defaults to 0.5, lower means more inputs go through the robust path
Evaluation uses WER for English and whitespace-tokenized languages, CER for Chinese and other character-based languages
Why this matters
Most ASR benchmarks evaluate on relatively clean speech. The Mega-ASR contribution is a system that explicitly trains for compound-degradation audio and ships an adaptive router so you do not pay the robustness tax on clean inputs. If you have ever shipped voice into a product and discovered that the failure modes are dominated by ten percent of recordings made in cars, kitchens, or call centers, this is the kind of framework worth piloting.
Things to verify before relying on it
The model card mainly summarizes evaluation setup and includes result figures, while the arXiv paper carries the detailed benchmark tables and methodology. Read the paper before claiming any specific WER or CER number.
The audio_quality_router checkpoint and Qwen3-ASR-1.7B weights must be downloaded into the ckpt directory before inference works
The framework is research-grade; expect to vendor it as code rather than as a pinned package



