LLM Tokenizers / Low-Resource NLP
Measuring and reducing the Nepali token tax in modern LLMs
I benchmarked 17 model tokenizers, assembled a 7.49GB Nepali tokenizer-training corpus, trained a Devanagari-optimized tokenizer, extended production model tokenizers, and ran a Qwen3-4B CPT/SFT proof of concept.
The problem
Tokenizers are part of the cost model of LLMs. If Nepali takes more tokens than English for the same amount of content, Nepali users pay more, fit less into context windows, and spend more compute on prompt processing.
The benchmark measured Nepali and English tokens per word, then reported the ratio as a Nepali token tax. A 3x tax means the same amount of text takes roughly three times as many tokens.
Nepali token tax by tokenizer
lower is better| Model | Tokenizer | Tax? | NE tok/word? | EN tok/word? | Dev tokens? |
|---|---|---|---|---|---|
| Gemma 4 | Gemma BPE | 2.0x | 2.52 | 1.26 | 13,754 |
| OpenAI o200k | o200k_base | 2.2x | 2.68 | 1.22 | 3,985 |
| OpenAI harmony | o200k_harmony | 2.2x | 2.68 | 1.22 | 3,985 |
| LLaMA 4 | BPE | 2.4x | 2.99 | 1.23 | 2,696 |
| Qwen3-4B Nepali extended | Qwen2 BPE + 15K Nepali tokens | 2.5x | 3.18 | 1.25 | 15,256 |
| Mistral Small 4 | BPE | 2.6x | 3.30 | 1.27 | 1,569 |
| Gemma 2 | Gemma BPE | 2.6x | 3.21 | 1.25 | 1,516 |
| LLaMA 3 | BPE | 3.0x | 3.72 | 1.24 | 1,018 |
| DeepSeek V4 | BPE | 3.3x | 4.10 | 1.24 | 303 |
| Kimi K2.6 | tiktoken model | 3.3x | 3.99 | 1.23 | 318 |
| Kimi K2 | tiktoken model | 3.3x | 3.99 | 1.23 | 318 |
| Qwen 3.5 tokenizer | Qwen2 BPE | 3.7x | 4.65 | 1.25 | 959 |
| GLM-5 | BPE | 4.8x | 5.89 | 1.24 | 138 |
| Qwen 3 base | Qwen2 BPE | 4.9x | 6.10 | 1.25 | 71 |
| Mistral v0.3 | Metaspace BPE | 5.0x | 6.79 | 1.35 | 44 |
| Phi-3.5 | ByteLevel BPE | 5.0x | 7.05 | 1.40 | 39 |
| GLM-4 | SentencePiece model | 5.3x | 6.57 | 1.24 | 27 |
| Phi-4 | GPT2/ByteLevel BPE | 5.7x | 7.17 | 1.25 | 27 |
One sentence, three tokenizers
नेपालको राजधानी काठमाडौं हो
What I built
I assembled, cleaned, filtered, and deduplicated a 7.49GB Nepali tokenizer-training corpus from CulturaX, Sangraha, CC-100, and publicly available Nepali book/document text. Then I trained SentencePiece BPE tokenizers at 32K, 48K, and 64K vocab sizes. The 32K model reached 1.34 tokens per Nepali word on the benchmark split.
From that tokenizer, I selected high-value Nepali pieces and extended several existing model tokenizers. Tokenizer extension alone improves encoding efficiency, but the model still needs continued pretraining to learn the new token embeddings.
Data journey
The corpus was not hand-authored. It was assembled from existing Nepali sources, then normalized into one tokenizer-training file. That pipeline matters because tokenizer quality is mostly decided before training starts: source mix, script filtering, Unicode handling, and deduplication all affect the final vocabulary.
CulturaX Nepali
Streamed from Hugging Face and capped around 800M cleaned characters.
Sangraha verified Nepali
Streamed from ai4bharat/sangraha and capped around 500M cleaned characters.
CC-100 Nepali
Loaded from a local CC-100 Nepali archive and filtered for usable documents.
Nepali books/documents
Publicly available Nepali book and document text from the local corpus folder.
Cleaning and filtering
Before tokenizer training, every source passed through the same normalization and filtering path. The goal was to keep real Nepali text, preserve Devanagari-specific joiners, and remove repeated paragraphs before the tokenizer saw the corpus.
Tokenizer training sweep
I trained multiple SentencePiece BPE sizes instead of assuming a single vocabulary size. The 32K tokenizer was already close to the larger 48K and 64K variants, so I used it as the source of candidate Nepali pieces for model-tokenizer extension.
| Tokenizer? | Vocab | Nepali tok/word |
|---|---|---|
| Nepali BPE (ours)? | 32K | 1.34 |
| Nepali BPE (ours)? | 48K | 1.29 |
| Nepali BPE (ours)? | 64K | 1.26 |
| NepaliBPE baseline | 50K | 1.28 |
Tokenizer extension results
Measured on the local Nepali CC-100 benchmark split
Phi-4
51.9% fewer tokens per word
Qwen 3.5 tokenizer
44.3% fewer tokens per word
DeepSeek V4
37.3% fewer tokens per word
Kimi K2.6
35.5% fewer tokens per word
End-to-end model experiment
For the full model experiment, I used Qwen3-4B. I added Nepali tokens, initialized the new embeddings from their base-token decompositions, ran LoRA continued pretraining, then followed with Nepali supervised fine-tuning.
The continued-pretraining mix included Nepali text plus an English component to reduce catastrophic forgetting. The goal was to teach the model the new Nepali token embeddings without wiping out the base model's general-language behavior.
Verified local result
On the 2,000-document Nepali benchmark split, the saved final tokenizer reduced token count from 2,724,999 tokens to 1,415,539 tokens against the local Qwen3 tokenizer.
CPT setup
3,000 steps on packed 2,048-token sequences.
217,769 packed sequences, 445,990,912 training tokens.
LoRA rank 64, alpha 128, dropout 0.05.
132.1M LoRA params plus 38.4M new-token params.
CPT loss
Training loss dropped sharply early, then stabilized around the low 1.5s as the cosine schedule decayed.
SFT run
1,500 steps on 50K Nepali instruction examples.
Finished in 2h 46m at roughly 6.6s per step.
Observed loss moved from about 1.50 to the low 1.1s.
Final aggregate train loss reported as 1.16.
SFT loss progression
SFT started from the CPT adapter and moved the model toward instruction-following behavior in Nepali. The loss curve is not a downstream benchmark, but it is useful evidence that the run converged instead of diverging.
Quantitative evaluation
Evaluated on 500 held-out CC-100 documents not used in training. BPC (bits-per-character) is the fair comparison metric across different tokenizers because it normalizes by actual text length, not token count.
Generation comparison
Greedy decoding on the same prompts, both models.
नेपालको राजधानी केको छ? यो प्रश्न छोटो छ। तपाईंको अनुसार, नेपालको राजधानी लखनऊ हो।
नेपालको राजधानी काठमाडौँ हो। काठमाडौँ उपत्यका नेपालको राजधानी काठमाडौँ उपत्यकाको केन्द्र हो।
सगरमाथा जम्मू-काश्मीरको एक उच्च तालाको नाम हो।
सगरमाथाको उचाइ 8,848 मिटर (29,029 फिट) रहेको आधिकारिक रूपमा मान्यता प्राप्त गरिएको छ।
दशैं छोटकरीमा वर्णन गरेको छ जसमा छोटकरी छोटकरी छोटकरी छोटकरी...
दशैं हिन्दु धर्मका अनुयायीहरूले मनाउने महान् उत्सव हो। यो वर्षभरिको दुःख र सङ्घर्षबाट छुटकारा पाउन र आनन्द र शान्ति प्राप्त गर्ने उत्सव हो।
Downstream task scoring
Nepali script ratio measures what fraction of the response is Devanagari. Higher means the model stays in Nepali instead of drifting to English or Hindi.
| Task | Base NE ratio? | Tuned NE ratio? | Token F1? |
|---|---|---|---|
| QA (Constitution) | 1.00 | 0.997 | 0.488 |
| QA (Sagarmatha) | 0.00 | 0.972 | 0.722 |
| Summarization | 0.848 | 0.987 | — |
| Translation (1) | 0.211 | 0.764 | — |
| Translation (2) | 0.112 | 0.719 | — |
| Instructions | 0.955 | 0.975 | — |
| Creative writing | 0.421 | 0.972 | — |
Evaluated on a fixed 9-task eval set with greedy decoding. The base model produced an empty response for the Sagarmatha QA task (NE ratio 0.00). Full results saved to results/model_eval_results.json.