Large language models (LLMs), like OpenAI's GPT-3 and GPT-4, process text using tokens. These tokens are the fundamental units of text for these models. They are essentially small pieces of text which often, but not always, correspond to words. A tokenizer algorithm breaks down text into these tokens based on rules such as spaces, punctuation, and special characters. Emojis can also be considered tokens.
Think of tokens as building blocks for text. LLMs learn the statistical relationships between these tokens, allowing them to predict the next token in a sequence. This is how they generate coherent and contextually relevant text.
The exact tokenization process can vary between different models. Newer models like GPT-3.5 and GPT-4 use different tokenizers than older models, resulting in different tokens for the same input text.
Roughly speaking, you can map 1 token to equal about 4 characters, so these are good guidelines on the amount of text associated with tokens:
- 100 tokens translate to about 75 words.
- Two sentences equal about 30 tokens.
- A typical paragraph is about 100 tokens.
- A 1500-word article totals around 2048 tokens.
- "The quick brown fox jumps over the lazy dog."
This sentence would be tokenized as: "The", " quick", " brown", " fox", " jumps", " over", " the", " lazy", " dog", "."
Contains 10 tokens and 44 characters.
-
This emoji might be split into multiple tokens, representing the underlying bytes that make up the emoji.
The way words are split into tokens can vary depending on the language. For example, a possible tokenization could be:
These examples demonstrate that the tokenizer doesn't necessarily treat whole words as single tokens. It can break down words into smaller units based on its internal rules. This can lead to a higher token-to-character ratio for languages other than English, potentially making API usage more expensive.
When selecting a model, you need to consider the input and output tokens known in general as context window and maximum output token.