top of page
Search
  • Writer's picturePriyanka Suduge

Stemming vs Lemmatization - Which is better?

Stemming is the process of removing and replacing suffixes to get the root form of a word, which is called the 'Stem'. For example, if we consider the two words 'walked' and 'walking', both of them have the same stem form, which is 'walk'. But if we consider the two words 'wolf' and 'wolves', they don't have the same stem, because 'wolves' is stemmed to 'wolv'.


Lemmatization is a bit more advanced mechanism than stemming. It performs some morphological analysis as well as considers the use of vocabulary, before returning the base form or dictionary form of a word, which is called 'Lemma'. So with lemmatization, the two words 'wolf' and 'wolves' both end up with a single lemma, which is 'wolf'.


Both stemming and lemmatization are token normalization techniques. But we don't use both at the same time, as it makes no sense to use two different techniques to achieve the same goal. In general, lemmatization provides a better token normalization, and leads to better models with higher accuracy. But it's not the case all the time. Sometimes stemming can outperform lemmatization, though its not very common.


So which one should you pick for your NLP task? I recommend to start with lemmatization and build your model. Once you complete the entire process, including the model training and testing, you can see the performance of your model. Then consider replacing only lemmatization with stemming. You don't have to replace the other parts of your pipeline, as this is only relevant to text pre-processing. If you can achieve better performance with Stemming, then go for it.


There're many different algorithms for stemming and lemmatization. So, make sure that you try out at least a few from each category, before come to a conclusion.


16 views0 comments

Commentaires


Post: Blog2_Post
bottom of page