HuggingFace Tokenizers

Behind each of the pipeline examples that we've seen in this chapter is a tokenization step that splits the raw text into smaller pieces called tokens. We'll see how this works in detail in Chapter 2, but for now it's enough to understand that tokens may be words, parts of words, or just characters like punctuation. Transformer models are trained on numerical representations of these tokens, so getting this step right is pretty important for the whole NLP project!

Tokenizers provides many tokenization strategies and is extremely fast at tokenizing text thanks to its Rust backend. It also takes care of all the pre- and postprocessing steps, such as normalizing the inputs and transforming the model outputs to the required format. With Tokenizers, we can load a tokenizer in the same way we can load pretrained model weights with HugginFace Transformers.

We need a dataset and metrics to train and evaluate models, so let's take a look at - Datasets, which is in charge of that aspect.