LLM Tokenizers / Low-Resource NLP

Measuring and reducing the Nepali token tax in modern LLMs

I benchmarked 17 model tokenizers, assembled a 7.49GB Nepali tokenizer-training corpus, trained a Devanagari-optimized tokenizer, extended production model tokenizers, and ran a Qwen3-4B CPT/SFT proof of concept.

17
Tokenizers benchmarked
45.6%
Devanagari BPC improvement
48.1%
Token count reduction
$7
Total compute cost

The problem

Tokenizers are part of the cost model of LLMs. If Nepali takes more tokens than English for the same amount of content, Nepali users pay more, fit less into context windows, and spend more compute on prompt processing.

The benchmark measured Nepali and English tokens per word, then reported the ratio as a Nepali token tax. A 3x tax means the same amount of text takes roughly three times as many tokens.

Nepali token tax by tokenizer

lower is better
Gemma 4
2.0x
OpenAI o200k
2.2x
OpenAI harmony
2.2x
LLaMA 4
2.4x
Qwen3-4B Nepali extended
2.5x
Mistral Small 4
2.6x
Gemma 2
2.6x
LLaMA 3
3.0x
DeepSeek V4
3.3x
Kimi K2.6
3.3x
Kimi K2
3.3x
Qwen 3.5 tokenizer
3.7x
GLM-5
4.8x
Qwen 3 base
4.9x
Mistral v0.3
5.0x
Phi-3.5
5.0x
GLM-4
5.3x
Phi-4
5.7x
ModelTokenizerTax?NE tok/word?EN tok/word?Dev tokens?
Gemma 4Gemma BPE2.0x2.521.2613,754
OpenAI o200ko200k_base2.2x2.681.223,985
OpenAI harmonyo200k_harmony2.2x2.681.223,985
LLaMA 4BPE2.4x2.991.232,696
Qwen3-4B Nepali extendedQwen2 BPE + 15K Nepali tokens2.5x3.181.2515,256
Mistral Small 4BPE2.6x3.301.271,569
Gemma 2Gemma BPE2.6x3.211.251,516
LLaMA 3BPE3.0x3.721.241,018
DeepSeek V4BPE3.3x4.101.24303
Kimi K2.6tiktoken model3.3x3.991.23318
Kimi K2tiktoken model3.3x3.991.23318
Qwen 3.5 tokenizerQwen2 BPE3.7x4.651.25959
GLM-5BPE4.8x5.891.24138
Qwen 3 baseQwen2 BPE4.9x6.101.2571
Mistral v0.3Metaspace BPE5.0x6.791.3544
Phi-3.5ByteLevel BPE5.0x7.051.4039
GLM-4SentencePiece model5.3x6.571.2427
Phi-4GPT2/ByteLevel BPE5.7x7.171.2527

One sentence, three tokenizers

नेपालको राजधानी काठमाडौं हो

Qwen3 base tokenizer26 tokens
Extended tokenizer?7 tokens
Our 32K Nepali BPE?4 tokens

What I built

I assembled, cleaned, filtered, and deduplicated a 7.49GB Nepali tokenizer-training corpus from CulturaX, Sangraha, CC-100, and publicly available Nepali book/document text. Then I trained SentencePiece BPE tokenizers at 32K, 48K, and 64K vocab sizes. The 32K model reached 1.34 tokens per Nepali word on the benchmark split.

From that tokenizer, I selected high-value Nepali pieces and extended several existing model tokenizers. Tokenizer extension alone improves encoding efficiency, but the model still needs continued pretraining to learn the new token embeddings.

Data journey

The corpus was not hand-authored. It was assembled from existing Nepali sources, then normalized into one tokenizer-training file. That pipeline matters because tokenizer quality is mostly decided before training starts: source mix, script filtering, Unicode handling, and deduplication all affect the final vocabulary.

CulturaX Nepali

Streamed from Hugging Face and capped around 800M cleaned characters.

Sangraha verified Nepali

Streamed from ai4bharat/sangraha and capped around 500M cleaned characters.

CC-100 Nepali

Loaded from a local CC-100 Nepali archive and filtered for usable documents.

Nepali books/documents

Publicly available Nepali book and document text from the local corpus folder.

Cleaning and filtering

Before tokenizer training, every source passed through the same normalization and filtering path. The goal was to keep real Nepali text, preserve Devanagari-specific joiners, and remove repeated paragraphs before the tokenizer saw the corpus.

Unicode NFC normalization
Control-character removal while preserving newline, tab, ZWJ, and ZWNJ
Length filtering for short or unusable documents
At least 50% Devanagari ratio per document
Single combined training file at data/corpus/nepali_training_corpus.txt
Paragraph-level hash deduplication
7,487,717,516
Bytes
2,825,450,662
Characters
1,658,663
Paragraphs
18,709,110
Lines

Tokenizer training sweep

I trained multiple SentencePiece BPE sizes instead of assuming a single vocabulary size. The 32K tokenizer was already close to the larger 48K and 64K variants, so I used it as the source of candidate Nepali pieces for model-tokenizer extension.

Tokenizer?VocabNepali tok/word
Nepali BPE (ours)?32K1.34
Nepali BPE (ours)?48K1.29
Nepali BPE (ours)?64K1.26
NepaliBPE baseline50K1.28

Tokenizer extension results

Measured on the local Nepali CC-100 benchmark split

Phi-4

7.10to3.41

51.9% fewer tokens per word

Qwen 3.5 tokenizer

4.49to2.50

44.3% fewer tokens per word

DeepSeek V4

4.01to2.52

37.3% fewer tokens per word

Kimi K2.6

3.89to2.51

35.5% fewer tokens per word

End-to-end model experiment

For the full model experiment, I used Qwen3-4B. I added Nepali tokens, initialized the new embeddings from their base-token decompositions, ran LoRA continued pretraining, then followed with Nepali supervised fine-tuning.

The continued-pretraining mix included Nepali text plus an English component to reduce catastrophic forgetting. The goal was to teach the model the new Nepali token embeddings without wiping out the base model's general-language behavior.

Verified local result

On the 2,000-document Nepali benchmark split, the saved final tokenizer reduced token count from 2,724,999 tokens to 1,415,539 tokens against the local Qwen3 tokenizer.

48.1%
fewer tokens on the benchmark split

CPT setup

3,000 steps on packed 2,048-token sequences.

217,769 packed sequences, 445,990,912 training tokens.

LoRA rank 64, alpha 128, dropout 0.05.

132.1M LoRA params plus 38.4M new-token params.

CPT loss

Training loss dropped sharply early, then stabilized around the low 1.5s as the cosine schedule decayed.

early3.861
warmup2.220
middle1.656
late1.528
train avg1.673

SFT run

1,500 steps on 50K Nepali instruction examples.

Finished in 2h 46m at roughly 6.6s per step.

Observed loss moved from about 1.50 to the low 1.1s.

Final aggregate train loss reported as 1.16.

SFT loss progression

SFT started from the CPT adapter and moved the model toward instruction-following behavior in Nepali. The loss curve is not a downstream benchmark, but it is useful evidence that the run converged instead of diverging.

early1.500
step 1K1.090
late low1.084
train avg1.160

Quantitative evaluation

Evaluated on 500 held-out CC-100 documents not used in training. BPC (bits-per-character) is the fair comparison metric across different tokenizers because it normalizes by actual text length, not token count.

Devanagari BPC
1.9601.066
-45.6%
English BPC
1.2131.123
-7.4%
New token usage
17.6%

Generation comparison

Greedy decoding on the same prompts, both models.

नेपालको राजधानी के हो?(Capital of Nepal)
Base Qwen3-4B

नेपालको राजधानी केको छ? यो प्रश्न छोटो छ। तपाईंको अनुसार, नेपालको राजधानी लखनऊ हो।

Says Lucknow (India)
CPT + SFT

नेपालको राजधानी काठमाडौँ हो। काठमाडौँ उपत्यका नेपालको राजधानी काठमाडौँ उपत्यकाको केन्द्र हो।

Correct: Kathmandu
सगरमाथाको उचाइ कति हो?(Height of Everest)
Base Qwen3-4B

सगरमाथा जम्मू-काश्मीरको एक उच्च तालाको नाम हो।

Says it is in Jammu-Kashmir
CPT + SFT

सगरमाथाको उचाइ 8,848 मिटर (29,029 फिट) रहेको आधिकारिक रूपमा मान्यता प्राप्त गरिएको छ।

Correct: 8,848m
दशैं भनेको के हो?(What is Dashain?)
Base Qwen3-4B

दशैं छोटकरीमा वर्णन गरेको छ जसमा छोटकरी छोटकरी छोटकरी छोटकरी...

Degenerates into repetition
CPT + SFT

दशैं हिन्दु धर्मका अनुयायीहरूले मनाउने महान् उत्सव हो। यो वर्षभरिको दुःख र सङ्घर्षबाट छुटकारा पाउन र आनन्द र शान्ति प्राप्त गर्ने उत्सव हो।

Coherent cultural description

Downstream task scoring

Nepali script ratio measures what fraction of the response is Devanagari. Higher means the model stays in Nepali instead of drifting to English or Hindi.

TaskBase NE ratio?Tuned NE ratio?Token F1?
QA (Constitution)1.000.9970.488
QA (Sagarmatha)0.000.9720.722
Summarization0.8480.987
Translation (1)0.2110.764
Translation (2)0.1120.719
Instructions0.9550.975
Creative writing0.4210.972

Evaluated on a fixed 9-task eval set with greedy decoding. The base model produced an empty response for the Sagarmatha QA task (NE ratio 0.00). Full results saved to results/model_eval_results.json.