Speech Recognition / Low-Resource ASR
Fine-tuning Nepali ASR on 157 hours for $7
I fine-tuned Qwen3-ASR-1.7B on a single public Nepali speech dataset, evaluated against 8 open-source models across 3 held-out benchmarks, and found a dtype bug that invalidated prior Whisper evaluations for Nepali.
The problem with single-dataset ASR benchmarks
Most Nepali ASR models report WER on a single dataset. That number can look good while hiding severe weaknesses on other speech styles. A model that scores 5% on one dataset but 70% on another is not a 5% WER model.
I evaluated every model on three datasets with different recording conditions, speaker counts, and speech styles. None of these datasets were used during training. Cross-dataset consistency is the metric that matters.
Macro-average WER across all 3 datasets
Lower is better. Excludes anomalous OpenSLR-43 scores from contaminated models.
IndicVoices-R: 2060 speakers, natural conversational audio. The most realistic test of ASR quality. We beat MMS-1B by 6.6 WER points.
OpenSLR-43: TTS-generated speech. We beat MMS-1B by 9.1 WER points. Models with anomalously low scores here are excluded from this comparison.
FLEURS: Studio-quality recordings. MMS-1B wins by 3.4 WER points. This is MMS's strongest domain given its massive multilingual pretraining.
Averaging across all 3 datasets, our model achieves the lowest WER among all tested models. Cross-dataset consistency matters more than any single benchmark.
Cross-dataset evaluation
100 samples per dataset, WER % (lower is better)
| Model | FLEURS? | IndicVoices-R? | OpenSLR-43? |
|---|---|---|---|
| Qwen3-ASR-Nepali (ours) | 37.0% | 55.8% | 31.4% |
| Meta MMS-1B (npi) | 33.6% | 62.4% | 40.5% |
| Whisper large-v3 | 94.0% | 96.7% | 105.8% |
| Whisper-small-Nepali (amitpant7) | 64.5% | 77.7% | 2.3%* |
| wav2vec2-xlsr-300m (shniranjan) | 43.3% | 59.5% | 33.9%* |
| wav2vec2-nepali (anish) | 54.3% | 73.7% | 4.6%* |
| wav2vec2-xlsr (gagan) | 70.8% | 86.1% | 5.0%* |
| Qwen3-ASR-0.6B Base | 116.0% | 112.5% | 100.4% |
*Models marked with * show anomalously low OpenSLR-43 WER despite high WER on FLEURS and IndicVoices-R, suggesting dataset-specific overfitting or training-set overlap. See contamination analysis below.
Evaluation datasets
Each dataset tests a different failure mode. A model that only works on clean read speech will fail on real conversational audio. A model that only works on synthetic speech may have memorized its training data.
FLEURS
Studio-quality recordings with clear pronunciation. The easiest benchmark and the most commonly reported in ASR papers.
IndicVoices-R
Natural speech with hesitations, interruptions, background noise, and diverse accents. The hardest and most realistic benchmark.
OpenSLR-43
Machine-generated speech. Tests whether models generalize beyond human recordings. Several models show anomalously low WER here.
OpenSLR-43 contamination analysis
Three models show WER below 5% on OpenSLR-43 while scoring 54-86% on FLEURS and IndicVoices-R. A model cannot legitimately generalize at 2% on one speech dataset while failing at 65-78% on others of comparable difficulty.
OpenSLR-43 is TTS-generated speech, making it more likely these models were trained on the same synthetic data or near-identical distributions. The cross-dataset performance gap itself is the evidence. Including these scores in a generalization comparison would be more misleading than excluding them.
| Model | FLEURS | IVR | SLR-43 | Gap? |
|---|---|---|---|---|
| amitpant7 whisper-small | 64.5% | 77.7% | 2.3% | 28x |
| wav2vec2-nepali (anish) | 54.3% | 73.7% | 4.6% | 12x |
| wav2vec2-xlsr (gagan) | 70.8% | 86.1% | 5.0% | 14x |
A 12-28x gap between OpenSLR-43 and other datasets indicates these scores reflect dataset memorization, not general Nepali ASR capability.
Finding: Whisper float16 dtype bug in Nepali evaluation
Prior benchmarks reported Whisper large-v3 scoring 100% WER on Nepali, suggesting the model could not produce Nepali output at all. This turned out to be a bug, not a real result.
The standard HuggingFace Whisper inference pattern loads models in float16 for GPU efficiency. But the WhisperProcessor returns float32 input features. This causes a dtype mismatch RuntimeError in the encoder's conv1d layer.
If the evaluation script catches exceptions silently (a common pattern: except: text = ""), every sample produces an empty prediction. Empty predictions against real references give exactly 100% WER and 100% CER.
The fix
Loading the model in float32 instead of float16 resolves the dtype mismatch. After fixing this, Whisper large-v3 produces Nepali output but still has high WER (94% on FLEURS) due to word boundary and Devanagari spelling issues.
Any Nepali ASR benchmark that reported Whisper results using float16 loading likely has invalid numbers for that model.
Data efficiency
Meta MMS-1B was pretrained on 500K+ hours across 1,100+ languages. This model was fine-tuned on 157 hours of a single language on one A100 GPU for approximately $7.
On spontaneous speech (IndicVoices-R), the focused single-language fine-tune outperformed the multilingual model by 6.6 WER points. Domain-matched training data and language-specific fine-tuning can compensate for orders of magnitude less compute.
The training data (OpenSLR-54) is entirely read speech. Despite never seeing spontaneous conversational audio during training, the model generalized well enough to beat MMS on IndicVoices-R. This suggests the Qwen3-ASR architecture handles domain shift better than expected.
Result
IndicVoices-R: 55.8% vs 62.4% (ours wins by 6.6 pts)
OpenSLR-43: 31.4% vs 40.5% (ours wins by 9.1 pts)
FLEURS: 37.0% vs 33.6% (MMS wins by 3.4 pts)
Training setup
The training data is OpenSLR-54: 157 hours of Nepali read speech with transcripts, approximately 37,000 utterances. A 95/5 train/validation split was used with seed 42. Common Voice Nepali was originally planned but was removed from HuggingFace by Mozilla in October 2025, so the final training used OpenSLR-54 only.
IndicVoices-R, OpenSLR-43, and FLEURS were used exclusively for evaluation. No samples from any evaluation dataset appeared in training.
Limitations
Sample size: 100 samples per dataset is directionally valid but small. Full test-set evaluation would strengthen the results.
Text normalization: WER can vary depending on punctuation, numeral formatting, Unicode normalization, and whitespace handling. No cross-model normalization was applied.
Decoding config: All models used greedy decoding with default settings. Beam search or temperature tuning could change individual results.
Missing baselines: IndicConformer (AI4Bharat) errored during evaluation. Other Nepali-specific models may exist that were not tested.
Base model contamination: The Qwen3-ASR-1.7B base model was pretrained on large-scale data that may include some evaluation data. This applies equally to MMS and Whisper.
Error rate: 37-56% WER is not production-ready. This model needs more diverse training data, code-switching coverage, and noise robustness before real-world deployment.
Published model
The fine-tuned model is published on HuggingFace. The benchmark script, training code, and evaluation pipeline are open-source.
What would improve this
Training on all three datasets (OpenSLR-54 + OpenSLR-43 + IndicVoices-R = ~267 hours) instead of just one.
Full test-set evaluation instead of 100-sample subsets.
Text normalization pipeline standardized across all models.
Additional baselines: IndicConformer, faster-whisper, WhisperX.
Noise augmentation and code-switching data for robustness.