Kavita’s Articles

These are the most recent articles that I’ve written. To receive notifications of new articles, you can subscribe to my blog

HashingVectorizer vs. CountVectorizer

Previously, we learned how to use CountVectorizer for text processing. In place of CountVectorizer, you also have the option of using HashingVectorizer. In this tutorial, we will learn how HashingVectorizer differs from CountVectorizer and when to use…

What is Term-Frequency?

Term Frequency (TF) Term frequency (TF) often used in Text Mining, NLP and Information Retrieval tells you how frequently a term occurs in a document. In the context natural language, terms correspond to words or phrases. Since…

What are N-Grams?

N-grams of texts are extensively used in text mining and natural language processing tasks. They are basically a set of co-occurring words within a given window and when computing the n-grams you typically move one word forward…

Industrial Strength Natural Language Processing

Having spent a big part of my career as a graduate student researcher and now a Data Scientist in the industry, I have come to realize that a vast majority of solutions proposed both in academic research…

What is Document Frequency (DF) ?

Document frequency is the number of documents containing a particular term. Based on Figure 1, the word cent has a document frequency of 1. Even though it appeared 3 times, it appeared 3 times in only one…

Tips for Constructing Custom Stop Word Lists

Stop words are a set of commonly used words in any language. For example, in English, “the”, “is” and “and”, would easily qualify as stop words. In NLP and text mining applications, stop words are used to…

How to Use Rouge 2.0?

ROUGE 2.0 is an easy to use evaluation toolkit for Automatic Summarization tasks. It uses the ROUGE system of metrics which works by comparing an automatically produced summary or translation against a set of reference summaries (typically human-produced). ROUGE…