Welcome to my blog! I initially started this blog as a way for me to document my Ph.D research work and things that I learn along the way. After 13 years of working in Text Mining, Natural Language Processing, Machine Learning and Search, I use my blog as a platform to teach engineers, leaders, entrepreneurs and data scientists how to implement NLP  systems that deliver.

I try to cut through all the clutter on the Web and make my tutorials easily consumable and marry theory with practice so that you learn the intuition behind some of the design decisions. These articles have helped my readers from around the world, and I hope it helps you too!  Browse the resources below to start building NLP systems that work! 

To follow my new articles, subscribe to my blog and follow me on Twitter.

Resources to bookmark

Hands-on tutorials

HashingVectorizer vs. CountVectorizer
HashingVectorizer vs. CountVectorizer

Previously, we learned how to use CountVectorizer for text processing. In place of CountVectorizer, you also have the option of using HashingVectorizer. In this tutorial, we will learn how HashingVect…

10+ Examples for Using CountVectorizer
10+ Examples for Using CountVectorizer

Scikit-learn’s CountVectorizer is used to transform a corpora of text to a vector of term / token counts. It also provides the capability to preprocess your text data prior to generating the vector re…

Build Your First Text Classifier in Python with Logistic Regression
Build Your First Text Classifier in Python with Logistic Regression

Text classification is the automatic process of predicting one or more categories given a piece of text. For example, predicting if an email is legit or spammy. Thanks to Gmail’s spam classifier, I do…

Easily Access Pre-trained Word Embeddings with Gensim
Easily Access Pre-trained Word Embeddings with Gensim

What are pre-trained embeddings and why? Pre-trained word embeddings are vector representation of words trained on a large dataset. With pre-trained embeddings, you will essentially be using the weigh…

How to Use Tfidftransformer & Tfidfvectorizer?
How to Use Tfidftransformer & Tfidfvectorizer?

Scikit-learn’s Tfidftransformer and Tfidfvectorizer aim to do the same thing, which is to convert a collection of raw documents to a matrix of TF-IDF features. The differences between the two modules …

Word Cloud in Python for Jupyter Notebooks and Web Apps
Word Cloud in Python for Jupyter Notebooks and Web Apps

About a year ago, I looked high and low for a python word cloud library that I could use from within my Jupyter notebook that was flexible enough to use counts or tfidf when needed or just accept a se…

Tips from first hand experience

Text Classification: Best Practices for Real World Applications
Text Classification: Best Practices for Real World Applications

Most text classification examples that you see on the Web or in books focus on demonstrating techniques. This will help you build a pseudo usable prototype. If you want to take your classifier to the …

All you need to know about Text Preprocessing for Machine Learning & NLP
All you need to know about Text Preprocessing for Machine Learning & NLP

Learn what text preprocessing is, the different techniques for text preprocessing and a way to estimate…

Industrial Strength Natural Language Processing
Industrial Strength Natural Language Processing

Having spent a big part of my career as a graduate student researcher and now a Data Scientist in the industry, I have come to realize that a vast majority of solutions proposed both in academic resea…

Foundational concepts

All you need to know about Text Preprocessing for Machine Learning & NLP

Learn what text preprocessing is, the different techniques for text preprocessing and a way to estimate…

What is Inverse Document Frequency (IDF)?

Inverse Document Frequency (IDF) is a weight indicating how commonly a word is used. The more frequent its usage across documents, the lower its score. The lower the score, the less important the word…

What is Term-Frequency?

Term Frequency (TF) Term frequency (TF) often used in Text Mining, NLP and Information Retrieval tells you how frequently a term occurs in a document. In the context natural language, terms corresp…

What are N-Grams?

N-grams of texts are extensively used in text mining and natural language processing tasks. They are basically a set of co-occurring words within a given window and when computing the n-grams you typi…

What is ROUGE and how it works for evaluation of summaries?

ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It is essentially of a set of metri…

What is text similarity?

When talking about text similarity, different people have a slightly different notion on what text similarity means. In essence, the goal is to compute how ‘close’ two pieces of text are in …