top of page

Code-Along With Mega-ASR: Robust In-the-Wild Speech Recognition on Top of Qwen3-ASR-1.7B

  • 5 days ago
  • 2 min read

Mega-ASR was published on Hugging Face as zhifeixie/Mega-ASR on May 19, 2026, with the companion paper at arXiv:2605.19833 (Xie, Pang, Zhang, Ye, Hu, Yan, Miao, 2026). It is licensed under Apache 2.0.

What problem it targets

Mega-ASR is built for what the authors call in-the-wild squared audio: real-world recordings that combine multiple distortions at once, including noise, reverberation, clipping, band limiting, and overlapping speakers. These conditions are where standard ASR models tend to fail in the worst possible way, producing empty outputs, omissions, repetitions, or hallucinated text rather than just degraded accuracy.

Architecture

The system has three components: the Qwen3-ASR-1.7B foundation model as the backbone, a set of Mega-ASR adaptation weights on top, and an audio quality router that decides per input whether to use the robust Mega-ASR path or the cleaner base recognition path. The router threshold defaults to 0.5 and is configurable. The training pipeline uses acoustic-to-semantic supervised fine-tuning, exposing the model to progressively harder examples so it learns to recover both local acoustic detail and sentence-level semantics under degradation.

Install

git clone https://github.com/xzf-thu/Mega-ASR.git
cd Mega-ASR

conda create -n mega-asr python=3.10 -y
conda activate mega-asr
pip install -r requirements.txt

Four-line inference call

The official quickstart from the model card:

from MegaASR.model.megaASR import MegaASR

model = MegaASR(
    model_path="ckpt/Mega-ASR/Qwen3-ASR-1.7B",
    router_checkpoint="ckpt/Mega-ASR/audio_quality_router/best_acc_model.pt",
    routing_enabled=True,
)

result = model.infer("/path/to/audio.wav", return_route=True)
print(result)

Defaults to know

  • Max tokens default to 256

  • Decoding defaults to greedy

  • Router threshold defaults to 0.5, lower means more inputs go through the robust path

  • Evaluation uses WER for English and whitespace-tokenized languages, CER for Chinese and other character-based languages

Why this matters

Most ASR benchmarks evaluate on relatively clean speech. The Mega-ASR contribution is a system that explicitly trains for compound-degradation audio and ships an adaptive router so you do not pay the robustness tax on clean inputs. If you have ever shipped voice into a product and discovered that the failure modes are dominated by ten percent of recordings made in cars, kitchens, or call centers, this is the kind of framework worth piloting.

Things to verify before relying on it

  • The model card mainly summarizes evaluation setup and includes result figures, while the arXiv paper carries the detailed benchmark tables and methodology. Read the paper before claiming any specific WER or CER number.

  • The audio_quality_router checkpoint and Qwen3-ASR-1.7B weights must be downloaded into the ckpt directory before inference works

  • The framework is research-grade; expect to vendor it as code rather than as a pinned package

Sources

 
 
bottom of page