Build Beautiful NLP Applications

Welcome to my blog! I initially started this blog as a way for me to document my Ph.D research work and things that I learn along the way. Now, after 13 years of working in Text Mining, Applied NLP and Search, I use my blog as a platform to teach software engineers and data scientists how to implement NLP systems that work. I try to cut through all the clutter on the Web and make my tutorials easily consumable and marry theory with practice so that you learn the intuition behind some of the design decisions. These articles have helped developers and data scientists across the globe, and I hope it helps you too!  Browse the resources below to start building NLP systems that work! 

Links to bookmark

Hands-on tutorials

How to Correctly Use CountVectorizer? An In-Depth Look
How to Correctly Use CountVectorizer? An In-Depth Look

Scikit-learn’s CountVectorizer is used to generate counts of terms / tokens from a corpora of text while also providing the capability to preprocess the text data simultaneously. For example, given t…

Build Your First Text Classifier in Python with Logistic Regression
Build Your First Text Classifier in Python with Logistic Regression

Text classification is the automatic process of predicting one or more categories given a piece of text. For example, predicting if an email is legit or spammy. Thanks to Gmail’s spam classifier, I …

Easily Access Pre-trained Word Embeddings with Gensim
Easily Access Pre-trained Word Embeddings with Gensim

What are pre-trained embeddings and why? Pre-trained word embeddings are vector representation of words trained on a large dataset. With pre-trained embeddings, you will essentially be using the weigh…

How to Use Tfidftransformer & Tfidfvectorizer?
How to Use Tfidftransformer & Tfidfvectorizer?

Scikit-learn’s Tfidftransformer and Tfidfvectorizer aim to do the same thing, which is to convert a collection of raw documents to a matrix of TF-IDF features. The differences between the two modules …

Word Cloud in Python for Jupyter Notebooks and Web Apps
Word Cloud in Python for Jupyter Notebooks and Web Apps

About a year ago, I looked high and low for a python word cloud library that I could use from within my Jupyter notebook that was flexible enough to use counts or tfidf when needed or just accept a se…

Tutorial: Extracting Keywords with TF-IDF and Python’s Scikit-Learn
Tutorial: Extracting Keywords with TF-IDF and Python’s Scikit-Learn

In this era of use Deep Learning for everything, one may be wondering why you would even use TF-IDF for any task at all ?!! The truth is TF-IDF is easy to understand, easy to compute and is one of the…

Tips from first hand experience

All you need to know about Text Preprocessing for Machine Learning & NLP
All you need to know about Text Preprocessing for Machine Learning & NLP

Based on some recent conversations, I realized that text preprocessing is a severely overlooked topic. A few people I spoke to mentioned inconsistent results from their NLP applications only to realiz…

Industrial Strength Natural Language Processing
Industrial Strength Natural Language Processing

Having spent a big part of my career as a graduate student researcher and now a Data Scientist in the industry, I have come to realize that a vast majority of solutions proposed both in academic resea…

Tips for Constructing Custom Stop Word Lists
Tips for Constructing Custom Stop Word Lists

Stop words are a set of commonly used words in any language. For example, in English, “the”, “is” and “and”, would easily qualify as stop words. In NLP and text mining applications, stop w…

Foundational concepts

All you need to know about Text Preprocessing for Machine Learning & NLP

Based on some recent conversations, I realized that text preprocessing is a severely overlooked topic. A few people I spoke to mentioned inconsistent results from their NLP applications only to realiz…

What is Inverse Document Frequency (IDF)?

Inverse Document Frequency (IDF) is a weight indicating how commonly a word is used. The more frequent its usage across documents, the lower its score. The lower the score, the less important the wor…

What is Term-Frequency?

Term Frequency (TF) Term frequency (TF) often used in Text Mining, NLP and Information Retrieval tells you how frequently a term occurs in a document. In the context natural language, terms corres…

What are N-Grams?

N-grams of texts are extensively used in text mining and natural language processing tasks. They are basically a set of co-occurring words within a given window and when computing the n-grams you typi…

What is ROUGE and how it works for evaluation of summaries?

ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It is essentially of a set of metrics for evaluating automatic summarization of texts as well as machine translation. It works by …

What is text similarity?

When talking about text similarity, different people have a slightly different notion on what text similarity means. In essence, the goal is to compute how ‘close’ two pieces of text are in …