LLM Tokenizers / Low-Resource NLP

Measuring and reducing the Nepali token tax in modern LLMs

I benchmarked 17 model tokenizers, assembled a 7.49GB Nepali tokenizer-training corpus, trained a Devanagari-optimized tokenizer, extended production model tokenizers, and ran a Qwen3-4B CPT/SFT proof of concept.

Tokenizers benchmarked

45.6%

Devanagari BPC improvement

48.1%

Token count reduction

Total compute cost

The problem

Tokenizers are part of the cost model of LLMs. If Nepali takes more tokens than English for the same amount of content, Nepali users pay more, fit less into context windows, and spend more compute on prompt processing.

The benchmark measured Nepali and English tokens per word, then reported the ratio as a Nepali token tax. A 3x tax means the same amount of text takes roughly three times as many tokens.

Nepali token tax by tokenizer

lower is better

Gemma 4

2.0x

OpenAI o200k

2.2x

OpenAI harmony

2.2x

LLaMA 4

2.4x

Qwen3-4B Nepali extended

2.5x

Mistral Small 4

2.6x

Gemma 2

2.6x

LLaMA 3

3.0x

DeepSeek V4

3.3x

Kimi K2.6

3.3x

Kimi K2

3.3x

Qwen 3.5 tokenizer

3.7x

GLM-5

4.8x

Qwen 3 base

4.9x

Mistral v0.3

5.0x

Phi-3.5

5.0x

GLM-4

5.3x

Phi-4

5.7x

Model	Tokenizer	Tax?Nepali tokens per word divided by English tokens per word. Higher means Nepali costs more tokens for similar text volume.	NE tok/word?Average number of tokens needed for each whitespace-separated Nepali word in the benchmark corpus.	EN tok/word?Average number of tokens needed for each whitespace-separated English word in the matched English corpus.	Dev tokens?Vocabulary entries that decode to text containing at least one Devanagari character.
Gemma 4	Gemma BPE	2.0x	2.52	1.26	13,754
OpenAI o200k	o200k_base	2.2x	2.68	1.22	3,985
OpenAI harmony	o200k_harmony	2.2x	2.68	1.22	3,985
LLaMA 4	BPE	2.4x	2.99	1.23	2,696
Qwen3-4B Nepali extended	Qwen2 BPE + 15K Nepali tokens	2.5x	3.18	1.25	15,256
Mistral Small 4	BPE	2.6x	3.30	1.27	1,569
Gemma 2	Gemma BPE	2.6x	3.21	1.25	1,516
LLaMA 3	BPE	3.0x	3.72	1.24	1,018
DeepSeek V4	BPE	3.3x	4.10	1.24	303
Kimi K2.6	tiktoken model	3.3x	3.99	1.23	318
Kimi K2	tiktoken model	3.3x	3.99	1.23	318
Qwen 3.5 tokenizer	Qwen2 BPE	3.7x	4.65	1.25	959
GLM-5	BPE	4.8x	5.89	1.24	138
Qwen 3 base	Qwen2 BPE	4.9x	6.10	1.25	71
Mistral v0.3	Metaspace BPE	5.0x	6.79	1.35	44
Phi-3.5	ByteLevel BPE	5.0x	7.05	1.40	39
GLM-4	SentencePiece model	5.3x	6.57	1.24	27
Phi-4	GPT2/ByteLevel BPE	5.7x	7.17	1.25	27

One sentence, three tokenizers

नेपालको राजधानी काठमाडौं हो

Qwen3 base tokenizer26 tokens

Extended tokenizer?7 tokens

Our 32K Nepali BPE?4 tokens

What I built

I assembled, cleaned, filtered, and deduplicated a 7.49GB Nepali tokenizer-training corpus from CulturaX, Sangraha, CC-100, and publicly available Nepali book/document text. Then I trained SentencePiece BPE tokenizers at 32K, 48K, and 64K vocab sizes. The 32K model reached 1.34 tokens per Nepali word on the benchmark split.

From that tokenizer, I selected high-value Nepali pieces and extended several existing model tokenizers. Tokenizer extension alone improves encoding efficiency, but the model still needs continued pretraining to learn the new token embeddings.

Data journey

The corpus was not hand-authored. It was assembled from existing Nepali sources, then normalized into one tokenizer-training file. That pipeline matters because tokenizer quality is mostly decided before training starts: source mix, script filtering, Unicode handling, and deduplication all affect the final vocabulary.

CulturaX Nepali

Streamed from Hugging Face and capped around 800M cleaned characters.

Sangraha verified Nepali

Streamed from ai4bharat/sangraha and capped around 500M cleaned characters.

CC-100 Nepali

Loaded from a local CC-100 Nepali archive and filtered for usable documents.

Nepali books/documents

Publicly available Nepali book and document text from the local corpus folder.

Cleaning and filtering

Before tokenizer training, every source passed through the same normalization and filtering path. The goal was to keep real Nepali text, preserve Devanagari-specific joiners, and remove repeated paragraphs before the tokenizer saw the corpus.

Unicode NFC normalization

Control-character removal while preserving newline, tab, ZWJ, and ZWNJ

Length filtering for short or unusable documents

At least 50% Devanagari ratio per document

Single combined training file at data/corpus/nepali_training_corpus.txt

Paragraph-level hash deduplication

7,487,717,516

Bytes

2,825,450,662

Characters

1,658,663

Paragraphs

18,709,110

Lines

Tokenizer training sweep

I trained multiple SentencePiece BPE sizes instead of assuming a single vocabulary size. The 32K tokenizer was already close to the larger 48K and 64K variants, so I used it as the source of candidate Nepali pieces for model-tokenizer extension.

Tokenizer?Rows marked '(ours)' are SentencePiece BPE tokenizers trained in this project on the assembled Nepali corpus.	Vocab	Nepali tok/word
Nepali BPE (ours)?Trained in this project on the 7.49GB Nepali tokenizer-training corpus.	32K	1.34
Nepali BPE (ours)?Trained in this project on the 7.49GB Nepali tokenizer-training corpus.	48K	1.29
Nepali BPE (ours)?Trained in this project on the 7.49GB Nepali tokenizer-training corpus.	64K	1.26
NepaliBPE baseline	50K	1.28

Tokenizer extension results

Measured on the local Nepali CC-100 benchmark split

Phi-4

7.10to3.41

51.9% fewer tokens per word

Qwen 3.5 tokenizer

4.49to2.50

44.3% fewer tokens per word

DeepSeek V4

4.01to2.52

37.3% fewer tokens per word

Kimi K2.6

3.89to2.51

35.5% fewer tokens per word

End-to-end model experiment

For the full model experiment, I used Qwen3-4B. I added Nepali tokens, initialized the new embeddings from their base-token decompositions, ran LoRA continued pretraining, then followed with Nepali supervised fine-tuning.

The continued-pretraining mix included Nepali text plus an English component to reduce catastrophic forgetting. The goal was to teach the model the new Nepali token embeddings without wiping out the base model's general-language behavior.

Verified local result

On the 2,000-document Nepali benchmark split, the saved final tokenizer reduced token count from 2,724,999 tokens to 1,415,539 tokens against the local Qwen3 tokenizer.

48.1%

fewer tokens on the benchmark split

CPT setup

3,000 steps on packed 2,048-token sequences.

217,769 packed sequences, 445,990,912 training tokens.

LoRA rank 64, alpha 128, dropout 0.05.

132.1M LoRA params plus 38.4M new-token params.

CPT loss

Training loss dropped sharply early, then stabilized around the low 1.5s as the cosine schedule decayed.

early3.861

warmup2.220

middle1.656

late1.528

train avg1.673

SFT run

1,500 steps on 50K Nepali instruction examples.

Finished in 2h 46m at roughly 6.6s per step.

Observed loss moved from about 1.50 to the low 1.1s.

Final aggregate train loss reported as 1.16.

SFT loss progression

SFT started from the CPT adapter and moved the model toward instruction-following behavior in Nepali. The loss curve is not a downstream benchmark, but it is useful evidence that the run converged instead of diverging.

early1.500

step 1K1.090

late low1.084

train avg1.160

Quantitative evaluation

Evaluated on 500 held-out CC-100 documents not used in training. BPC (bits-per-character) is the fair comparison metric across different tokenizers because it normalizes by actual text length, not token count.

Devanagari BPC

1.960→1.066

-45.6%

English BPC

1.213→1.123

-7.4%

New token usage

—→17.6%

Generation comparison

Greedy decoding on the same prompts, both models.

नेपालको राजधानी के हो?(Capital of Nepal)

Base Qwen3-4B

नेपालको राजधानी केको छ? यो प्रश्न छोटो छ। तपाईंको अनुसार, नेपालको राजधानी लखनऊ हो।

Says Lucknow (India)

CPT + SFT

नेपालको राजधानी काठमाडौँ हो। काठमाडौँ उपत्यका नेपालको राजधानी काठमाडौँ उपत्यकाको केन्द्र हो।

Correct: Kathmandu

सगरमाथाको उचाइ कति हो?(Height of Everest)

Base Qwen3-4B

सगरमाथा जम्मू-काश्मीरको एक उच्च तालाको नाम हो।

Says it is in Jammu-Kashmir

CPT + SFT

सगरमाथाको उचाइ 8,848 मिटर (29,029 फिट) रहेको आधिकारिक रूपमा मान्यता प्राप्त गरिएको छ।

Correct: 8,848m

दशैं भनेको के हो?(What is Dashain?)

Base Qwen3-4B

दशैं छोटकरीमा वर्णन गरेको छ जसमा छोटकरी छोटकरी छोटकरी छोटकरी...

Degenerates into repetition

CPT + SFT

दशैं हिन्दु धर्मका अनुयायीहरूले मनाउने महान् उत्सव हो। यो वर्षभरिको दुःख र सङ्घर्षबाट छुटकारा पाउन र आनन्द र शान्ति प्राप्त गर्ने उत्सव हो।

Coherent cultural description

Downstream task scoring

Nepali script ratio measures what fraction of the response is Devanagari. Higher means the model stays in Nepali instead of drifting to English or Hindi.

Task	Base NE ratio?Fraction of the base model's response that is Devanagari script. 0 = entirely non-Nepali, 1 = entirely Nepali.	Tuned NE ratio?Fraction of the tuned model's response that is Devanagari script.	Token F1?Overlap between the model's response tokens and the reference answer tokens. Only measured on QA tasks with known answers.
QA (Constitution)	1.00	0.997	0.488
QA (Sagarmatha)	0.00	0.972	0.722
Summarization	0.848	0.987	—
Translation (1)	0.211	0.764	—
Translation (2)	0.112	0.719	—
Instructions	0.955	0.975	—
Creative writing	0.421	0.972	—

Evaluated on a fixed 9-task eval set with greedy decoding. The base model produced an empty response for the Sagarmatha QA task (NE ratio 0.00). Full results saved to results/model_eval_results.json.