Training a Tokenizer for Llama Model

The Llama family of models are large language models released by Meta (formerly Facebook). These decoder-only transformer models are used for generation tasks. Almost all decoder-only models nowadays use the Byte-Pair Encoding (BPE) algorithm for tokenization. In this article, you will learn about BPE. In particular, you will learn: What BPE is compared to other…

What BPE is compared to other tokenization algorithms
How to prepare a dataset and train a BPE tokenizer
How to use the tokenizer

Training a Tokenizer for Llama Model
Photo by Joss Woodhead. Some rights reserved.

Let’s get started.

Overview

This article is divided into four parts; they are:

Understanding BPE
Training a BPE tokenizer with Hugging Face tokenizers library
Training a BPE tokenizer with SentencePiece library
Training a BPE tokenizer with tiktoken library

Understanding BPE

Byte-Pair Encoding (BPE) is a tokenization algorithm used to tokenize text into sub-word units. Instead of splitting text into only words and punctuation, BPE can further split the prefixes and suffixes of words so that prefixes, stems, and suffixes can each be associated with meaning in the language model. Without sub-word tokenization, a language model would find it difficult to learn that “happy” and “unhappy” are antonyms of each other.

BPE is not the only sub-word tokenization algorithm. WordPiece, which is the default for BERT, is another one. A well-implemented BPE does not need “unknown” in the vocabulary, and nothing is OOV (Out of Vocabulary) in BPE. This is because BPE can start with 256 byte values (hence known as byte-level BPE) and then merge the most frequent pairs of tokens into a new vocabulary until the desired vocabulary size is reached.

Nowadays, BPE is the tokenization algorithm of choice for most decoder-only models. However, you do not want to implement your own BPE tokenizer from scratch. Instead, you can use tokenizer libraries such as Hugging Face’s tokenizers, OpenAI’s tiktoken, or Google’s sentencepiece.

Training a BPE tokenizer with Hugging Face tokenizers Library

To train a BPE tokenizer, you need to prepare a dataset so the tokenizer algorithm can determine the most frequent pair of tokens to merge. For decoder-only models, a subset of the model’s training data is usually appropriate.

Training a tokenizer is time-consuming, especially for large datasets. However, unlike a language model, a tokenizer does not need to learn the language context of the text, only how often tokens appear in a typical text corpus. While you may need trillions of tokens to train a good language model, you only need a few million tokens to train a good tokenizer.

As mentioned in a previous article, there are several well-known text datasets for language model training. For a toy project, you may want a smaller dataset for faster experimentation. The HuggingFaceFW/fineweb dataset is a good choice for this purpose. In its full size, it is a 15 trillion token dataset, but it also has 10B, 100B, and 350B sizes for smaller projects. The dataset is derived from Common Crawl and filtered by Hugging Face to improve data quality.

Below is how you can print a few samples from the dataset:

import datasets dataset = datasets.load_dataset("HuggingFaceFW/fineweb", name="sample-10BT", split="train", streaming=True) count = 0 for sample in dataset: print(sample) count += 1 if count >= 5: break

import datasets

dataset = datasets.load_dataset("HuggingFaceFW/fineweb", name="sample-10BT", split="train", streaming=True)

count = 0

for sample in dataset:

print(sample)

count += 1

if count >= 5:

break

Running this code will print the following:

{'text': '|Viewing Single Post From: Spoilers for the Week of February 11th|\n|Lil||F...', 'id': '<urn:uuid:39147604-bfbe-4ed5-b19c-54105f8ae8a7>', 'dump': 'CC-MAIN-2013-20', 'url': 'http://daytimeroyaltyonline.com/single/?p=8906650&t=8780053', 'date': '2013-05-18T05:48:59Z', 'file_path': 's3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/war...', 'language': 'en', 'language_score': 0.8232095837593079, 'token_count': 142} {'text': '*sigh* Fundamentalist community, let me pass on some advice to you I learne...', 'id': '<urn:uuid:ba819eb7-e6e6-415a-87f4-0347b6a4f017>', 'dump': 'CC-MAIN-2013-20', 'url': 'http://endogenousretrovirus.blogspot.com/2007/11/if-you-have-set-yourself-on...', 'date': '2013-05-18T06:43:03Z', 'file_path': 's3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/war...', 'language': 'en', 'language_score': 0.9737711548805237, 'token_count': 703} ...

{'text': '|Viewing Single Post From: Spoilers for the Week of February 11th|\n|Lil||F...',

'id': '<urn:uuid:39147604-bfbe-4ed5-b19c-54105f8ae8a7>', 'dump': 'CC-MAIN-2013-20',

'url': 'http://daytimeroyaltyonline.com/single/?p=8906650&t=8780053',

'date': '2013-05-18T05:48:59Z',

'file_path': 's3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/war...',

'language': 'en', 'language_score': 0.8232095837593079, 'token_count': 142}

{'text': '*sigh* Fundamentalist community, let me pass on some advice to you I learne...',

'id': '<urn:uuid:ba819eb7-e6e6-415a-87f4-0347b6a4f017>', 'dump': 'CC-MAIN-2013-20',

'url': 'http://endogenousretrovirus.blogspot.com/2007/11/if-you-have-set-yourself-on...',

'date': '2013-05-18T06:43:03Z',

'file_path': 's3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/war...',

'language': 'en', 'language_score': 0.9737711548805237, 'token_count': 703}

...

For training a tokenizer (and even a language model), you only need the text field of each sample.

To train a BPE tokenizer using the tokenizers library, you simply feed the text samples to the trainer. Below is the complete code:

from typing import Iterator import datasets from tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoders, normalizers # Load FineWeb 10B sample (using only a slice for demo to save memory) dataset = datasets.load_dataset("HuggingFaceFW/fineweb", name="sample-10BT", split="train", streaming=True) def get_texts(dataset: datasets.Dataset, limit: int = 100_000) -> Iterator[str]: """Get texts from the dataset until the limit is reached or the dataset is exhausted""" count = 0 for sample in dataset: yield sample["text"] count += 1 if limit and count >= limit: break # Initialize a BPE model: either byte_fallback=True or set unk_token="[UNK]" tokenizer = Tokenizer(models.BPE(byte_fallback=True)) tokenizer.normalizer = normalizers.NFKC() tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True, use_regex=False) tokenizer.decoder = decoders.ByteLevel() # Trainer trainer = trainers.BpeTrainer( vocab_size=25_000, min_frequency=2, special_tokens=["[PAD]", "[CLS]", "[SEP]", "[MASK]"], show_progress=True, ) # Train and save the tokenizer to disk texts = get_texts(dataset, limit=10_000) tokenizer.train_from_iterator(texts, trainer=trainer) tokenizer.save("bpe_tokenizer.json") # Reload the tokenizer from disk tokenizer = Tokenizer.from_file("bpe_tokenizer.json") # Test: encode/decode text = "Let's have a pizza party! 🍕" enc = tokenizer.encode(text) print("Token IDs:", enc.ids) print("Decoded:", tokenizer.decode(enc.ids))

from typing import Iterator

import datasets

from tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoders, normalizers

# Load FineWeb 10B sample (using only a slice for demo to save memory)

dataset = datasets.load_dataset("HuggingFaceFW/fineweb", name="sample-10BT", split="train", streaming=True)

def get_texts(dataset: datasets.Dataset, limit: int = 100_000) -> Iterator[str]:

"""Get texts from the dataset until the limit is reached or the dataset is exhausted"""

count = 0

for sample in dataset:

yield sample["text"]

count += 1

if limit and count >= limit:

break

# Initialize a BPE model: either byte_fallback=True or set unk_token="[UNK]"

tokenizer = Tokenizer(models.BPE(byte_fallback=True))

tokenizer.normalizer = normalizers.NFKC()

tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True, use_regex=False)

tokenizer.decoder = decoders.ByteLevel()

# Trainer

trainer = trainers.BpeTrainer(

vocab_size=25_000,

min_frequency=2,

special_tokens=["[PAD]", "[CLS]", "[SEP]", "[MASK]"],

show_progress=True,

)

# Train and save the tokenizer to disk

texts = get_texts(dataset, limit=10_000)

tokenizer.train_from_iterator(texts, trainer=trainer)

tokenizer.save("bpe_tokenizer.json")

# Reload the tokenizer from disk

tokenizer = Tokenizer.from_file("bpe_tokenizer.json")

# Test: encode/decode

text = "Let's have a pizza party! 🍕"

enc = tokenizer.encode(text)

print("Token IDs:", enc.ids)

print("Decoded:", tokenizer.decode(enc.ids))

When you run this code, you will see:

Resolving data files: 100%|███████████████████████| 27468/27468 [00:03<00:00, 7792.97it/s] [00:00:01] Pre-processing sequences ████████████████████████████ 0 / 0 [00:00:02] Tokenize words ████████████████████████████ 10000 / 10000 [00:00:00] Count pairs ████████████████████████████ 10000 / 10000 [00:00:38] Compute merges ████████████████████████████ 24799 / 24799 Token IDs: [3548, 277, 396, 1694, 14414, 227, 12060, 715, 9814, 180, 188] Decoded: Let's have a pizza party! 🍕

Resolving data files: 100%|███████████████████████| 27468/27468 [00:03<00:00, 7792.97it/s]

[00:00:01] Pre-processing sequences ████████████████████████████ 0 / 0

[00:00:02] Tokenize words ████████████████████████████ 10000 / 10000

[00:00:00] Count pairs ████████████████████████████ 10000 / 10000

[00:00:38] Compute merges ████████████████████████████ 24799 / 24799

Token IDs: [3548, 277, 396, 1694, 14414, 227, 12060, 715, 9814, 180, 188]

Decoded: Let's have a pizza party! 🍕

To avoid loading the entire dataset at once, use the streaming=True argument in the load_dataset() function. The tokenizers library expects only text for training BPE, so the get_texts() function yields text samples one by one. The loop terminates when the limit is reached since the entire dataset is not needed to train a tokenizer.

To create byte-level BPE, set the byte_fallback=True argument in the BPE model and configure the ByteLevel pre-tokenizer and decoder. Adding a NFKC normalizer is also recommended to clean Unicode text for better tokenization.

For a decoder-only model, you will also need special tokens such as <PAD>, <EOT>, and <MASK>. The <EOT> token signals the end of a text sequence, allowing the model to declare when sequence generation is complete.

Once the tokenizer is trained, save it to a file for later use. To use a tokenizer, call the encode() method to convert text into a sequence of token IDs, or the decode() method to convert token IDs back to text.

Note that the code above sets a small vocabulary size of 25,000 and limits the training dataset to 10,000 samples for demonstration purposes, enabling training to complete in a reasonable time. In practice, use a larger vocabulary size and training dataset so the language model can capture the diversity of the language. As a reference, the vocabulary size of the Llama 2 is 32,000 and that of Llama 3 is 128,256.

Training a BPE tokenizer with SentencePiece library

As an alternative to Hugging Face’s tokenizers library, you can use Google’s sentencepiece library. The library is written in C++ and is fast, though its API and documentation are less refined than those of the tokenizers library.

The previous code rewritten using the sentencepiece library is as follows:

from typing import Iterator import datasets import sentencepiece as spm # Load FineWeb 10B sample (using only a slice for demo to save memory) dataset = datasets.load_dataset("HuggingFaceFW/fineweb", name="sample-10BT", split="train", streaming=True) def get_texts(dataset: datasets.Dataset, limit: int = 100_000) -> Iterator[str]: """Get texts from the dataset until the limit is reached or the dataset is exhausted""" count = 0 for sample in dataset: yield sample["text"] count += 1 if limit and count >= limit: break # Define special tokens as comma-separated string spm.SentencePieceTrainer.Train( sentence_iterator=get_texts(dataset, limit=10_000), byte_fallback=True, model_prefix="sp_bpe", vocab_size=32_000, model_type="bpe", unk_id=0, bos_id=1, eos_id=2, pad_id=3, # set to -1 to disable character_coverage=1.0, input_sentence_size=10_000, shuffle_input_sentence=False, ) # Load the trained SentencePiece model sp = spm.SentencePieceProcessor(model_file="sp_bpe.model") # Test: encode/decode text = "Let's have a pizza party! 🍕" ids = sp.encode(text, out_type=int, enable_sampling=False) # default: no special tokens tokens = sp.encode(text, out_type=str, enable_sampling=False) print("Tokens:", tokens) print("Token IDs:", ids) decoded = sp.decode(ids) print("Decoded:", decoded)

from typing import Iterator

import datasets

import sentencepiece as spm

# Load FineWeb 10B sample (using only a slice for demo to save memory)

dataset = datasets.load_dataset("HuggingFaceFW/fineweb", name="sample-10BT", split="train", streaming=True)

def get_texts(dataset: datasets.Dataset, limit: int = 100_000) -> Iterator[str]:

"""Get texts from the dataset until the limit is reached or the dataset is exhausted"""

count = 0

for sample in dataset:

yield sample["text"]

count += 1

if limit and count >= limit:

break

# Define special tokens as comma-separated string

spm.SentencePieceTrainer.Train(

sentence_iterator=get_texts(dataset, limit=10_000),

byte_fallback=True,

model_prefix="sp_bpe",

vocab_size=32_000,

model_type="bpe",

unk_id=0,

bos_id=1,

eos_id=2,

pad_id=3, # set to -1 to disable

character_coverage=1.0,

input_sentence_size=10_000,

shuffle_input_sentence=False,

)

# Load the trained SentencePiece model

sp = spm.SentencePieceProcessor(model_file="sp_bpe.model")

# Test: encode/decode

text = "Let's have a pizza party! 🍕"

ids = sp.encode(text, out_type=int, enable_sampling=False) # default: no special tokens

tokens = sp.encode(text, out_type=str, enable_sampling=False)

print("Tokens:", tokens)

print("Token IDs:", ids)

decoded = sp.decode(ids)

print("Decoded:", decoded)

When you run this code, you will see:

... Tokens: ['▁Let', "'", 's', '▁have', '▁a', '▁pizza', '▁party', '!', '▁', '<0xF0>', '<0x9F>', '<0x8D>', '<0x95>'] Token IDs: [2703, 31093, 31053, 422, 261, 10404, 3064, 31115, 31046, 244, 163, 145, 153] Decoded: Let's have a pizza party! 🍕

...

Tokens: ['▁Let', "'", 's', '▁have', '▁a', '▁pizza', '▁party', '!', '▁', '<0xF0>',

'<0x9F>', '<0x8D>', '<0x95>']

Token IDs: [2703, 31093, 31053, 422, 261, 10404, 3064, 31115, 31046, 244, 163, 145, 153]

Decoded: Let's have a pizza party! 🍕

The trainer in SentencePiece is more verbose than the one in tokenizers, both in code and output. The key is to set byte_fallback=True in the SentencePieceTrainer; otherwise, the tokenizer may require an unknown token. The emoji in the test text serves as a corner case to verify that the tokenizer can handle unseen Unicode characters, which byte-level BPE should handle gracefully.

Training a BPE tokenizer with tiktoken Library

The third library you can use for BPE tokenization is OpenAI’s tiktoken library. While it is easy to load pre-trained tokenizers, training with this library is not recommended.

The code in the previous sections can be rewritten using the tiktoken library as follows:

import sys from typing import Iterator import datasets import tiktoken from tiktoken._educational import SimpleBytePairEncoding # Load FineWeb 10B sample (using only a slice for demo to save memory) dataset = datasets.load_dataset("HuggingFaceFW/fineweb", name="sample-10BT", split="train", streaming=True) def get_texts(dataset: datasets.Dataset, limit: int = 100_000) -> Iterator[str]: """Get texts from the dataset until the limit is reached or the dataset is exhausted""" count = 0 for sample in dataset: yield sample["text"] count += 1 if count >= limit: break # Collect texts up to some manageable limit for tokenizer training limit = 1_000 texts = "\n".join(get_texts(dataset, limit=limit)) # Train a simple BPE tokenizer pat_str=r"""'s|'t|'re|'ve|'m|'ll|'d| ?[\p{L}]+| ?[\p{N}]+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""" enc_simple = SimpleBytePairEncoding.train(training_data=texts, vocab_size=300, pat_str=pat_str) # Convert to real tiktoken encoding and save to disk enc = tiktoken.Encoding( name="my_bpe", pat_str=enc_simple.pat_str, # same regex used during training mergeable_ranks=enc_simple.mergeable_ranks, special_tokens={}, ) # test text = "Let's have a pizza party! 🍕" tok_ids = enc.encode(text) print("Token IDs:", tok_ids) print("Decoded:", enc.decode(tok_ids))

import sys

from typing import Iterator

import datasets

import tiktoken

from tiktoken._educational import SimpleBytePairEncoding

# Load FineWeb 10B sample (using only a slice for demo to save memory)

dataset = datasets.load_dataset("HuggingFaceFW/fineweb", name="sample-10BT", split="train", streaming=True)

def get_texts(dataset: datasets.Dataset, limit: int = 100_000) -> Iterator[str]:

"""Get texts from the dataset until the limit is reached or the dataset is exhausted"""

count = 0

for sample in dataset:

yield sample["text"]

count += 1

if count >= limit:

break

# Collect texts up to some manageable limit for tokenizer training

limit = 1_000

texts = "\n".join(get_texts(dataset, limit=limit))

# Train a simple BPE tokenizer

pat_str=r"""'s|'t|'re|'ve|'m|'ll|'d| ?[\p{L}]+| ?[\p{N}]+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""

enc_simple = SimpleBytePairEncoding.train(training_data=texts, vocab_size=300, pat_str=pat_str)

# Convert to real tiktoken encoding and save to disk

enc = tiktoken.Encoding(

name="my_bpe",

pat_str=enc_simple.pat_str, # same regex used during training

mergeable_ranks=enc_simple.mergeable_ranks,

special_tokens={},

)

# test

text = "Let's have a pizza party! 🍕"

tok_ids = enc.encode(text)

print("Token IDs:", tok_ids)

print("Decoded:", enc.decode(tok_ids))

When you run this code, you will see:

... Token IDs: [76, 101, 116, 39, 115, 293, 97, 118, 101, 257, 278, 105, 122, 122, 97, 278, 286, 116, 121, 33, 32, 240, 159, 141, 149] Decoded: Let's have a pizza party! 🍕

...

Token IDs: [76, 101, 116, 39, 115, 293, 97, 118, 101, 257, 278, 105, 122, 122, 97, 278,

286, 116, 121, 33, 32, 240, 159, 141, 149]

Decoded: Let's have a pizza party! 🍕

The tiktoken library does not have an optimized trainer. The only available module is a Python implementation of the BPE algorithm via the SimpleBytePairEncoding class. To train a tokenizer, you need to define how the input text should be split into words using the pat_str argument, which defines a “word” using a regular expression.

The training output is a dictionary called mergeable ranks, which contains pairs of tokens that can be merged along with their merge priorities. To create a tokenizer, simply pass the pat_str and mergeable_ranks arguments to the Encoding class.

Note that the tokenizer in tiktoken does not have a save function. Instead, save the pat_str and mergeable_ranks arguments if needed.

Since training is done in pure Python, it is very slow. Training your own tokenizer this way is not recommended.

Summary

In this article, you learned about byte-level BPE and how to train a BPE tokenizer. Specifically, you learned how to train a BPE tokenizer with the tokenizers, sentencepiece, and tiktoken libraries. You also learned that a tokenizer can encode text into a list of integer token IDs and decode them back to text.

Training a Tokenizer for Llama Model

Overview

Understanding BPE

Training a BPE tokenizer with Hugging Face tokenizers Library

Training a BPE tokenizer with SentencePiece library

Training a BPE tokenizer with tiktoken Library

Further Readings

Summary

Share this article

Related Articles

Getting Started with Zero-Shot Text Classification

AI Agent Memory Explained in 3 Levels of Difficulty

The Complete Guide to Inference Caching in LLMs