BLOG POST ON NATURAL LANGUAGE PROCESSING

Tokenisation in Natural Language Processing

Discover the essentials of tokenisation in Natural Language Processing (NLP) with practical examples and real-world applications. Learn how to break down text into tokens for better text analysis, sentiment analysis, and machine translation. Includes a simple PHP code example for text tokenisation.
Author: John Adeyemi
Date Posted: Sun 19th May, 2024

Tokenisation is the process of breaking down text into smaller units called tokens, such as words, punctuation marks, or subword units. It is a crucial step in Natural Language Processing (NLP), where text is segmented into smaller units called tokens.These tokens can be words, punctuation marks, subword units, or even characters, depending on the level of granularity required for the task at hand. This process is foundational for various NLP applications such as text analysis, sentiment analysis, machine translation, and information retrieval.


Types of Tokenisation

1. Word Tokenisation: Splits text into individual words. For example:

a. Input: "The quick brown fox jumps over the lazy dog".

b. Tokens: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]


2. Sentence Tokenisation: Splits text into sentences. For example:

a. Input: "Hello world! How are you today?"

b. Tokens: ["Hello world!", "How are you today?"]


3. Subword Tokenisation: Splits text into subwords, which can be useful for dealing with out-of-vocabulary words and morphologically rich languages. Techniques include Byte Pair Encoding (BPE) and WordPiece.

a. Input: "unhappiness"

b. Tokens: ["un", "happiness"]


4. Character Tokenisation:

a. Input: "hello"

b. Tokens: ["h", "e", "l", "l", "o"]


Real-World Scenarios

1. Search Engines: Tokenisation helps search engines break down user queries into keywords, enabling more effective search results.

Example: Query "best Italian restaurants near me" is tokenised into ["best", "Italian", "restaurants", "near", "me"].


2. Sentiment Analysis: Tokenisation allows sentiment analysis models to evaluate the sentiment of each word or subword in a sentence.

Example: Sentence "I love this product!" is tokenized into ["I", "love", "this", "product", "!"] to analyze the sentiment conveyed by each token.


3. Machine Translation: Tokenisation is used to convert sentences into tokens that a machine translation model can understand and translate.

Example: Sentence "Bonjour le monde" is tokenized into ["Bonjour", "le", "monde"] before being translated to "Hello world".



Tokenisation in PHP

Below is a simple PHP example demonstrating how to tokenise text from a form input into words:

------------------------------------------------

<!DOCTYPE html>

<html>

<head>

<title>Text Tokenisation</title>

</head>

<body>

<h1>Tokenise Your Text</h1>

<form method="post" action="">

<textarea name="input_text" rows="10" cols="50" placeholder="Enter your text here..."></textarea><br>

<input type="submit" name="submit" value="Tokenise">

</form>


<?php

if ($_SERVER["REQUEST_METHOD"] == "POST" && !empty($_POST["input_text"])) {

$inputText = $_POST["input_text"];

$tokens = tokeniseText($inputText);


echo "<h2>Tokenised Text:</h2>";

echo "<pre>";

print_r($tokens);

echo "</pre>";

}


function tokeniseText($text) {

// Tokenisation: Split text into words using spaces and punctuation marks as delimiters

$tokens = preg_split('/[s,.!?]+/', $text, -1, PREG_SPLIT_NO_EMPTY);

return $tokens;

}

?>

</body>

</html>

------------------------------------------------


Explanation of PHP Code

1. HTML Form: Provides a text area for users to input text and a submit button to process the text.

2. Form Handling: Checks if the form is submitted and if the input text is not empty.

3. Tokenise Function: Uses ‘preg_split’ with a regular expression to split the text into tokens based on spaces and common punctuation marks.

4. Display Tokens: Prints the resulting tokens in a formatted way.


Tokenisation is an essential step in NLP that transforms text into manageable units for further processing. Whether for search engines, sentiment analysis, or machine translation, effective tokenisation improves the performance and accuracy of NLP models. The PHP example illustrates a basic approach to tokenising text, demonstrating how easily it can be implemented in web applications.