Definition

lemmatization

Alexander S. Gillis

By

Alexander S. Gillis, Technical Writer and Editor

What is lemmatization?

Lemmatization is the process of grouping together different inflected forms of the same word. It's used in computational linguistics, natural language processing (NLP) and chatbots. Lemmatization links similar meaning words as one word, making tools such as chatbots and search engine queries more effective and accurate.

The goal of lemmatization is to reduce a word to its root form, also called a lemma. For example, the verb "running" would be identified as "run." Lemmatization studies the morphological, or structural, and contextual analysis of words.

To correctly identify a lemma, tools analyze the context, meaning and the intended part of speech in a sentence, as well as the word within the larger context of the surrounding sentence, neighboring sentences or even the entire document. With this in-depth understanding, tools that use lemmatization can better understand the meaning of a sentence.

How does lemmatization work?

Lemmatization takes a word and breaks it down to its lemma. For example, the verb "walk" might appear as "walking," "walks" or "walked." Inflectional endings such as "s," "ed" and "ing" are removed. Lemmatization groups these words as its lemma, "walk."

This article is part of

A guide to artificial intelligence in the enterprise

Download this entire guide for FREE now!

The word "saw" might be interpreted differently, depending on the sentence. For example, "saw" can be broken down into the lemma "see" or "saw." In these cases, lemmatization attempts to select the right lemma depending on the context of the word, surrounding words and sentence. Other words, such as "better" might be broken down to a lemma such as "good."

A basic way to perform lemmatization is to use an algorithm based on dictionary lookups. This process requires a detailed dictionary so the algorithm can find a specific word and link it back to the word's lemma. More complicated word forms or languages can require a rule-based system for lemmatization.

Applications of lemmatization

Lemmatization is commonly applied in the following areas:

Artificial intelligence (AI).
Big data analytics.
Chatbots.
Machine learning (ML).
NLP.
Search queries.
Sentiment analysis.

Lemmatization can be applied in a number of different circumstances. For example, in search queries, lemmatization lets end users query any version of a base word and get relevant results. Because search engine algorithms use lemmatization, the user can query any inflectional form of a word and get relevant results. For example, if the user queries the plural form of a word such as "routers," the search engine knows to also return relevant content that uses the singular form of the same word -- "router."

Lemmatization is an important part of natural language understanding and NLP, and also plays an important role in big data analytics and AI. For example, in big data analytics, lemmatization is used to normalize text documents.

Likewise, in NLP, lemmatization helps an AI or machine learning tool understand and converse with end users accurately. For example, sentiment analysis, which is used to identify the emotional tone behind a body of text, can use lemmatization to better determine meaning and emotional tone.

Chatbot AI's use lemmatization to help understand user inputs. Specifically, lemmatization helps a chatbot understand the contextual form a word takes, leading to an increased understanding of sentences.

Lemmatization vs. stemming

In linguistics, lemmatization is closely related to stemming, as both strip prefixes and suffixes that have been added to a word's base form.

Stemming algorithms cut off the beginning or end of a word using a list of common prefixes and suffixes that might be part of an inflected word. This process is generally indiscriminate and can result in base forms of a word with incorrect spelling or meaning. Stemming operates without any contextual knowledge, meaning that it can't discern between similar words with different meanings.

For example, the stem of "studies" and "studying" would be "studi" and "study," while in lemmatization the base form would be "study" for both "studies" and "studying." But both lemmatization and stemming would still have the same base form for the word "walking," for example. While being less accurate, stemming is easier to implement and runs faster. An example of stemming and lemmatization is shown as follows:

Stemming:

Study → Studi

Studying → Studi

Studies → Studi

Studied → Studi

Studier → Studier

Lemmatization:

Study → Study

Studying → Study

Studies → Study

Studied → Study

Studier → Study

With stemming, most inflections of the word "study" become "studi" compared to lemmatization where most outputs become "study."

Lemmatization is more complex than stemming, as lemmatization requires words to be categorized by a part of speech as well as by inflected form. This can become quite complicated in languages other than English, whose only inflected forms are singular or plural, verb tense and comparative or superlative forms of adverbs and adjectives.

For more on artificial intelligence-related terms, read the following articles:

What are knowledge-based systems?

What is voice recognition?

What is an intelligent agent?

What is cognitive computing?

What is language modeling?

What is narrow AI?

What is neuromorphic computing?

What is named entity recognition?

What is a recommendation engine?

What is black box AI?

Lemmatization advantages and disadvantages

Lemmatization offers the following benefits:

Accuracy. Lemmatization is much more accurate than stemming, as it's able to more precisely determine the lemma of a word.
Understanding text. Lemmatization is useful for tools in NLP like AI chatbots for understanding full sentence input from end users. This is also useful for returning specific search queries.
Contextual understanding. Word-per-word, lemmatization can understand a term based on the contextual use of that word.

But lemmatization does have some drawbacks when compared to stemming. For example, lemmatization requires more computational overhead than stemming, which can be performed faster and with fewer computing resources. Lemmatization algorithms are also slower than stemming algorithms due to the morphological analysis lemmatization conducts on each inflected word.

Learn more about sentiment analysis tools, including a tool that uses lemmatization and stemming.

This was last updated in March 2023

Continue Reading About lemmatization

How to gather and evaluate customer sentiment

4 main types of artificial intelligence: Explained

Artificial intelligence vs. human intelligence: How are they different?

AI vs. machine learning vs. deep learning: Key differences

How do big data and AI work together?

Dig Deeper on AI technologies

Business Analytics

Snowflake targets enterprise AI with launch of Arctic LLM
The data cloud vendor's open source LLM was designed to excel at business-specific tasks, such as generating code and following ...
AI-fueled efficiency a focus for SAS analytics platform
The vendor's latest product development plans include an AI assistant and prebuilt AI models that enable workers to be more ...
Customer segmentation analytics evolve with GenAI, ML
GenAI, machine learning and advanced analytics techniques automate time-consuming aspects of customer segmentation, freeing up ...

CIO

FTC bans noncompete agreements in split vote
Now that the FTC has issued its final rule banning noncompete clauses, it's likely to face a bevy of legal challenges.
Ally's generative AI strategy eyes multiple LLMs, AI agents
The digital bank plans to privately host multiple LLMs on its GenAI platform, explore autonomous agent technology and evaluate ...
States act on privacy laws as Congress considers new bill
The American Privacy Rights Act introduced this week aims to establish a national privacy standard that would preempt state ...

Data Management

AtScale adds semantic layer support for AI, GenAI models
The vendor's new platform update centers around decision-making flexibility, collaboration and community, and includes a metadata...
Open source vs. proprietary database management
Open source and commercial databases are alternative options to help streamline data management processes. Examine the pros and ...
7 steps to create a data loss prevention policy
Data loss prevention is an ever-changing process of proactive and reactive protection and planning. Read on to learn how to set ...

ERP

How to create a simple supply chain map
A simple supply chain map can give insight into various areas, such as critical business challenges. Learn why manufacturing ...
Certinia adds AI capabilities to PSA cloud suite
The PSA vendor adds AI functionality to its professional services cloud applications that are designed to help services firms ...
5 use cases for edge computing in manufacturing
Edge computing's capabilities can help improve various aspects of manufacturing operations and save companies time and money. ...

Close