Tokenisation is the process of breaking text into smaller parts called tokens which can be words, sentences or even characters. It is used in Natural Language Processing (NLP), as it helps machines to understand and process text.
- Converts raw text into a format that AI models can understand
- Helps in analyzing and processing text efficiently
- Useful for improving the accuracy of NLP models
In this article we will see different types of tokenisation techniques.

1. Word Tokenization
It splits the text into individual words.
- Separates text based on spaces
- Each word is treated as one unit
- Does not break words further
Example:
Input: “Machine learning is powerful”
Output: [“Machine”, “learning”, “is”, “powerful”]
Advantages
- Simple to implement
- Efficient for basic text processing
Disadvantages
- Cannot handle unseen or complex words
- Ignores context within words
2. Sentence Tokenization
This splits the text into individual sentences.
- Breaks text using punctuation (like . ? !)
- Helps in dividing large text into smaller parts
- Useful for summarization and analysis
Example:
Input: "AI is transforming industries. It is used everywhere."
Output: [“AI is transforming industries.”, “It is used everywhere.”]
Advantages
- Helps in organizing text clearly
- Useful for understanding context between sentences
Disadvantages
- It can make mistakes with complex punctuation
- Rules may differ across languages
3. Subword Tokenization
It works by splitting words into smaller meaningful parts.
- Breaks long or complex words into smaller pieces
- Helps handle unknown words
- Balances word and character levels
Example:
Input: “playing”
Output: [“play”, “ing”]
Advantages
- Handles unseen words effectively
- Reduces vocabulary size
Disadvantages
- More complex than word tokenization
- Can break words unnaturally
4. Character Tokenization
It splits text into individual characters instead of words.
- Breaks every word into letters
- Works the same for all languages
- Does not depend on words or vocabulary
Example:
Input: “Data”
Output: [“D”, “a”, “t”, “a”]
Advantages
- Does not face any issue with unknown words
- Language-independent
Disadvantages
- Increases sequence length
- Slower to process
5. N-gram Tokenization
This splits text into groups of consecutive words.
- Groups words together instead of splitting them alone
- Helps capture relationships between words
- Can be bigrams (2 words), trigrams (3 words), etc.
Example: “
Input: “Deep learning models”
Output: Bigrams: [“Deep learning”, “learning models”]
Advantages
- Captures context better than single words
- Useful for prediction tasks
Disadvantages
- Increases data size
- Needs more system resources (like Memory or CPU)
Difference Between Tokenization Techniques
| Technique | Unit of Split | Example Output | Best Use Case | Limitation |
|---|---|---|---|---|
| Word Tokenization | Words | ["Machine", "learning"] | Basic text processing | Cannot handle unknown words |
| Sentence Tokenization | Sentences | ["AI is good.", "It helps."] | Text summarization | Issues with complex punctuation |
| Subword Tokenization | Sub-parts of words | ["play", "ing"] | Handling rare/unseen words | Slightly complex |
| Character Tokenization | Characters | ["D", "a", "t", "a"] | Language-independent tasks | Longer sequences, slower |
| N-gram Tokenization | Word groups | ["Deep learning", "learning models"] | Context-based predictions | High memory usage |
When to Use Which Tokenization Technique
- Word Tokenization: Use when working on simple tasks like basic text analysis, counting words, or preprocessing.
- Sentence Tokenization: Use when you need to split large text for summarization, sentiment analysis, or paragraph understanding.
- Subword Tokenization: Use in modern NLP models (like transformers) where handling unknown or rare words is important.
- Character Tokenization: Use when working with multiple languages, misspellings, or when vocabulary is not fixed.
- N-gram Tokenization: Use when capturing context between words is important, like in text prediction or language modeling.