countvectorizer sparse matrix representation

10+ Examples for Using CountVectorizer

Scikit-learn’s CountVectorizer is used to transform a corpora of text to a vector of term / token counts. It also provides the capability to preprocess your text data prior to generating the vector representation making it a highly flexible feature representation module for text.

In this article, we are going to go in-depth into the different ways you can use CountVectorizer such that you are not just computing counts of words, but also preprocessing your text data appropriately as well as extracting additional features from your text dataset.

Example of How CountVectorizer Works

To show you an example of how CountVectorizer works, let’s take the book title below (for context: this is part of a book series that kids love) :

This text is transformed to a sparse matrix as shown in Figure 1(b) below:

Figure 1: CountVectorizer sparse matrix representation of words. (a) is how you visually think about it. (b) is how it is really represented in practice.

Notice that here we have 9 unique words. So 9 columns. Each column in the matrix represents a unique word in the vocabulary, while each row represents the document in our dataset. In this case, we only have one book title (i.e. the document), and therefore we have only 1 row. The values in each cell are the word counts. Note that with this representation, counts of some words could be 0 if the word did not appear in the corresponding document.

While visually it’s easy to think of a word matrix representation as Figure 1 (a), in reality, these words are transformed to numbers and these numbers represent positional index in the sparse matrix as seen in Figure 1(b).

Why the sparse matrix format?

With CountVectorizer we are converting raw text to a numerical vector representation of words and n-grams. This makes it easy to directly use this representation as features (signals) in Machine Learning tasks such as for text classification and clustering.

Note that these algorithms only understand the concept of numerical features irrespective of its underlying type (text, image pixels, numbers, categories and etc.) allowing us to perform complex machine learning tasks on different types of data.

Side Note: If all you are interested in are word counts, then you can get away with using the python Counter. There is no real need to use CountVectorizer. If you still want to do it, here’s the example for extracting counts with CountVectorizer.

Dataset & Imports

In this tutorial, we will be using titles of 5 cat in the hat books (as seen below).

I had intentionally made it a handful of short texts so that you can see how to put CountVectorizer to full use in your applications. Keep note that each title above is considered a document.

CountVectorizer Plain and Simple

What happens above is that the 5 books titles are preprocessed, tokenized and represented as a sparse matrix as explained in the introduction. By default, CountVectorizer does the following:

  • lowercases your text (set lowercase=false if you don’t want lowercasing)
  • uses utf-8 encoding
  • performs tokenization (converts raw text to smaller units of text)
  • uses word level tokenization (meaning each word is treated as a separate token)
  • ignores single characters during tokenization (say goodbye to words like ‘a’ and ‘I’)

Now, let’s look at the vocabulary (collection of unique words from our documents):

As we are using all the defaults, these are all word level tokens, lowercased. Note that the numbers here are not counts, they are the position in the sparse vector. Now, let’s check the shape:

We have 5 (rows) documents and 43 unique words (columns)!

CountVectorizer and Stop Words

Now, the first thing you may want to do, is to eliminate stop words from your text as it has limited predictive power and may not help with downstream tasks such as text classification. Stop word removal is a breeze with CountVectorizer and it can be done in several ways:

  1. Use a custom stop word list that you provide
  2. Use sklearn’s built in English stop word list (not recommended)
  3. Create corpora specific stop words using max_df and min_df (highly recommended and will be covered later in this tutorial)

Let’s look at the 3 ways of using stop words.

Custom Stop Word List

In this example, we provide a list of words that act as our stop words. Notice that the shape has gone from (5,43) to (5,40) because of the stop words that were removed. Note that we can actually load stop words directly from a file into a list and supply that as the stop word list.

To check the stop words that are being used (when explicitly specified), simply access cv.stop_words.

While cv.stop_words gives you the stop words that you explicitly specified as shown above, cv.stop_words_ (note: with underscore suffix) gives you the stop words that CountVectorizer inferred from your min_df and max_df settings as well as those that were cut off during feature selection (through the use of max_features). So far, we have not used the three settings, so cv.stop_words_ will be empty.

Stop Words using MIN_DF

The goal of MIN_DF is to ignore words that have very few occurrences to be considered meaningful. For example, in your text you may have names of people that may appear in only 1 or two documents. In some applications, this may qualify as noise and could be eliminated from further analysis.

Instead of using a minimum term frequency (total occurrences of a word) to eliminate words, MIN_DF looks at how many documents contained a term, better known as document frequency. The MIN_DF value can be an absolute value (e.g. 1, 2, 3, 4) or a value representing proportion of documents (e.g. 0.25 meaning, ignore words that have appeared in 25% of the documents) .

Eliminating words that appeared in less than 2 documents:

Now, to see which words have been eliminated, you can use cv.stop_words_ as this was internally inferred by CountVectorizer (see output below).

Yikes! We removed everything? Not quite. However, most of our words have become stop words and that’s because we have only 5 book titles.

To see what’s remaining, all we need to do is check the vocabulary again with cv.vocabulary_ (see output below):

Sweet! These are words that appeared in all 5 book titles.

Stop Words using MAX_DF

Just as we ignored words that were too rare with MIN_DF, we can ignore words that are too common with MAX_DF. MAX_DF looks at how many documents contained a term, and if it exceeds the MAX_DF threshold, then it is eliminated from consideration. The MAX_DF value can be an absolute value (e.g. 1, 2, 3, 4) or a value representing proportion of documents (e.g. 0.85 meaning, ignore words appeared in 85% of the documents as they are too common).

I’ve typically used a value from 0.75-0.85 depending on the task and for more aggressive stop word removal you can even use a smaller value.

Now, to see which words have been eliminated, you can use cv.stop_words_ (see output below):

In this example, all words that appeared in all 5 book titles have been eliminated.

Why document frequency for eliminating words?

Document frequency is sometimes a better way for inferring stop words compared to term frequency as term frequency can be misleading. For example, let’s say 1 document out of 250,000 documents in your dataset, contains 500 occurrences of the word catnthehat. If you use term frequency for eliminating rare words, the counts are so high that it may never pass your threshold for elimination. The word is still rare as it appears in only one document.

On several occasions, such as in building topic recommendation systems, I’ve found that using document frequency for eliminating rare and common terms gives far better results than relying on just overall term frequency.

Custom Tokenization

The default tokenization in CountVectorizer removes all special characters, punctuation and single characters. If this is not the behavior you desire, and you want to keep punctuation and special characters, you can provide a custom tokenizer to CountVectorizer.

In the example below, we provide a custom tokenizer using tokenizer=my_tokenizer where my_tokenizer is a function that attempts to keep all punctuation, and special characters and tokenizes only based on whitespace.

Fantastic, now we have our punctuation, single characters and special characters!

Custom Preprocessing

In many cases, we want to preprocess our text prior to creating a sparse matrix of terms. As I’ve explained in my text preprocessing article, preprocessing helps reduce noise and improves sparsity issues resulting in a more accurate analysis.

Here is an example of how you can achieve custom preprocessing with CountVectorizer by setting preprocessor=<some_preprocessor>.

In the example above, my_cool_preprocessor is a predefined function where we perform the following steps:

  1. lowercase the text (note: this is done by default if a custom preprocessor is not specified)
  2. remove special characters
  3. normalize certain words
  4. use stems of words instead of the original form (see: preprocessing article on stemming)

You can introduce your very own preprocessing steps such as lemmatization, adding parts-of-speech and so on to make this preprocessing step even more powerful.

Working With N-Grams

One way to enrich the representation of your features for tasks like text classification, is to use n-grams where n > 1. The intuition here is that bi-grams and tri-grams can capture contextual information compared to just unigrams. In addition, for tasks like keyword extraction, unigrams alone while useful, provides limited information. For example, good food carries more meaning than just good and food when observed independently.

Working with n-grams is a breeze with CountVectorizer. You can use word level n-grams or even character level n-grams (very useful in some text classification tasks). Here are a few examples:

Word level – bigrams only

Word level – unigrams and bigrams

Character level – bigrams only

Limiting Vocabulary Size

When your feature space gets too large, you can limit its size by putting a restriction on the vocabulary size. Say you want a max of 10,000 n-grams. CountVectorizer will keep the top 10,000 most frequent n-grams and drop the rest.

Since we have a toy dataset, in the example below, we will limit the number of features to 10.

Notice that the shape now is (5,10) as we asked for a limit of 10 on the vocabulary size. You can check the removed words using cv.stop_words_.

Ignore Counts and Use Binary Values

By default, CountVectorizer uses the counts of terms/tokens. However, you can choose to just use presence or absence of a term instead of the raw counts. This is useful in some tasks such as certain features in text classification where the frequency of occurrence is insignificant. To get binary values instead of counts all you need to do is set binary=True.

If you set binary=True then CountVectorizer no longer uses the counts of terms/tokens. If a token is present in a document, it is 1, if absent it is 0 regardless of its frequency of occurrence. By default, binary=False.

Using CountVectorizer to Extract N-Gram / Term Counts

Finally, you may want to use CountVectorizer to obtain counts of your n-grams. This is slightly tricky to do with CountVectorizer, but achievable as shown below:

The counts are first ordered in descending order. Then from this list, each feature name is extracted and returned with corresponding counts.


CountVectorizer provides a powerful way to extract and represent features from your text data. It allows you to control your n-gram size, perform custom preprocessing, custom tokenization, eliminate stop words and limit vocabulary size.

While counts of words can be useful signals by themselves, in some cases, you will have to use alternative schemes such as TF-IDF to represent your features. For some applications, a binary bag of words representation may also be more effective than counts. For a more sophisticated feature representation, people use word, sentence and paragraph embeddings trained using algorithms like word2vec, Bert and ELMo where each textual unit is encoded using a fixed length vector.

To rerun some of the examples in this tutorial, get the Jupyter notebook for this article. If there is anything that I missed out here, do feel free to leave a comment below.

See Also: Learning How To Build Your First Text Classifier


Recommended reading

Classify news articles with logistic regression and python

Build Your First Text Classifier in Python with Logistic Regression

Text classification is the automatic process of predicting one or more categories given a piece of text. For example, predicting if an email is legit or spammy. Thanks to Gmail’s spam classifier, I don’t see or hear from spammy emails!

Spam classification

Other than spam detection, text classifiers can be used to determine sentiment in social media texts, predict categories of news articles, parse and segment unstructured documents, flag the highly talked about fake news articles and more.

Text classifiers work by leveraging signals in the text to “guess” the most appropriate classification. For example, in a sentiment classification task, occurrences of certain words or phrases, like slow,problem,wouldn't and not can bias the classifier to predict negative sentiment.

The nice thing about text classification is that you have a range of options in terms of what approaches you could use. From unsupervised rules-based approaches to more supervised approaches such as Naive Bayes, SVMs, CRFs and Deep Learning.

In this article, we are going to learn how to build and evaluate a text classifier using logistic regression on a news categorization problem. The problem while not extremely hard, is not as straightforward as making a binary prediction (yes/no, spam/ham).

Here’s the full source code with accompanying dataset for this tutorial. Note that this is a fairly long tutorial and I would suggest that you break it down to several sessions so that you completely grasp the concepts. 

HuffPost Dataset

The dataset that we will be using for this tutorial is from Kaggle. It contains news articles from Huffington Post (HuffPost) from 2014-2018 as seen below. This data set has about ~125,000 articles and 31 different categories

Figure 1: Articles distribution from 2014-2018

Now let’s look at the category distribution of these articles (Figure 2). Notice that politics has the most number of articles and education has the lowest number of articles ranging in the hundreds.

So, nothing surprising in the category distribution other than we have much fewer articles to learn from categories outside POLITICS.

Figure 2: Number of articles per category

Now, let’s take a quick peek at the dataset (Figure 3).

Figure 3: Sneak peak of the news dataset

Notice that the fields we have in order to learn a classifier that predicts the category include headline, short_description, link and authors

The Challenge

As mentioned earlier, the problem that we are going to be tackling is to predict the category of news articles (as seen in Figure 3), using only the description, headline and the url of the articles. 

Without the actual content of the article itself, the data that we have for learning is actually pretty sparse – a problem you may encounter in the real world. But let’s see if we can still learn from it reasonably well. We will not use the author field because we want to test it on articles from a different news organization, specifically from CNN. 

In this tutorial, we will use the Logistic Regression algorithm to implement the classifier. In my experience, I have found Logistic Regression to be  very effective on text data and the underlying algorithm is also fairly easy to understand. More importantly, in the NLP world, it’s generally accepted that Logistic Regression is a great starter algorithm for text related classification

Feature Representation

Features are attributes (signals) that help the model learn. This can be specific words from the text itself (e.g. all words, top occurring terms, adjectives) or additional information inferred based on the original text (e.g. parts-of-speech, contains specific phrase patterns, syntactic tree structure).

For this task, we have text fields that are fairly sparse to learn from. Therefore, we will try to use all words from several of the text fields. This includes the description, headline and tokens from the url. The more advanced feature representation is something you should try as an exercise.

Feature Weighting

Not all words are equally important to a particular document / category. For example, while words like ‘murder’, ‘knife’ and ‘abduction’ are important to a crime related document, words like ‘news’ and ‘reporter’ may not be quite as important. 

In this tutorial, we will be experimenting with 3 feature weighting approaches. The most basic form of feature weighting, is binary weighting. Where if a word is present in a document, the weight is ‘1’ and if the word is absent the weight is ‘0’. 

Another way to assign weights is using the term-frequency of words (the counts). If a word like ‘knife’ appears 5 times in a document, that can become its corresponding weight. 

We will also be using TF-IDF weighting where words that are unique to a particular document would have higher weights compared to words that are used commonly across documents. 

There are of course many other methods for feature weighting. The approaches that we will experiment with in this tutorial are the most common ones and are usually sufficient for most classification tasks. 


One of the most important components in developing a supervised text classifier is the ability to evaluate it. We need to understand if the model has learned sufficiently based on the examples that it saw in order to make correct predictions.

If the performance is rather laughable, then we know that more work needs to be done. We may need to improve the features, add more data, tweak the model parameters and etc. 

For this particular task, even though the HuffPost dataset lists one category per article, in reality, an article can actually belong to more than one category. For example, the article in Figure 4 could belong to COLLEGE (the primary category) or EDUCATION.

Figure 4: Example of article that can fit into multiple categories

If the classifier predicts EDUCATION as its first guess instead of COLLEGE, that doesn’t mean it’s wrong. As this is bound to happen to various other categories, instead of looking at the first predicted category, we will look at the top 3 categories predicted to compute (a) accuracy and (b) mean reciprocal rank (MRR)


Accuracy evaluates the fraction of correct predictions. In our case, it is the number of times the PRIMARY category appeared in the top 3 predicted categories divided by the total number of categorization tasks. 


Unlike accuracy, MRR takes the rank of the first correct answer into consideration (in our case rank of the correctly predicted PRIMARY category). The formula for MRR is as follows:

Figure 5: MRR formula

where Q here refers to all the classification tasks in our test set and rank_{i} is the position of the correctly predicted category. The higher the rank of the correctly predicted category, the higher the MRR.

Since we are using the top 3 predictions, MRR will give us a sense of where the PRIMARY category is at in the ranks. If the rank of the PRIMARY category is on average 2, then the MRR would be ~0.5 and at 3, it would be ~0.3. We want to get the PRIMARY category higher up in the ranks.

Building the classifier

Now it’s finally time to build the classifier! Note that we will be using the LogisticRegression module from sklearn.

Make Necessary Imports

Start with the imports.


Read dataset and create text field variations

Next, we will be creating different variations of the text we will use to train the classifier. This is to see how adding more content to each field, helps with the classification task. Notice that we create a field using only the description, description + headline, and description + headline + url (tokenized). 


Split dataset for training and testing

Next, we will create a train / test split of our dataset, where 25% of the dataset will be used for testing based on our evaluation strategy and remaining will be used for training the classifier. 

Prepare features

Earlier, we talked about feature representation and different feature weighting schemes. In extract_features(...) from above, is where we extract the different types of features based on the weighting schemes.

First, note that cv.fit_transform(...) from the above code snippet creates a vocabulary based on the training set.  Next, cv.transform(...) takes in any text (test or unseen texts) and transforms it according to the vocabulary of the training set, limiting the words by the specified count restrictions (min_df, max_df) and applying necessary stop words if specified. It returns a term-document matrix where each column in the matrix represents a word in the vocabulary while each row represents the documents in the dataset. The values could either be binary or counts.  The same concept also applies to tfidf_vectorizer.fit_transform(...) and tfidf_vectorizer.transform(...).

You can read my article on TfidfVectorizer and Tfidftransformer on how to use these libraries correctly. Read this article if you want more information on how to use CountVectorizer.

Train your Logistic Regression model

The code below shows how we start the training process. When you instantiate the LogisticRegression module, you can vary the solver, the penalty, the C value and also specify how it should handle the multi-class classification problem (one-vs-all or multinomial). By default, a one-vs-all approach is used and that’s what we’re using below: 

In a one-vs-all approach that we are using above, a binary classification problem is fit for each of our 31 labels. Since we are selecting the top 3 categories predicted by the classifier (see below), we will leverage the estimated probabilities instead of the binary predictions. Behind the scenes, we are actually collecting the probability of each news category being positive.


Evaluate performance

In this section, we will look at the results for different variations of our model. First, we train a model using only the description of articles with binary feature weighting. 

Figure 6: Accuracy and MRR using description of text and binary feature weighting

You can see that the accuracy is 0.59 and MRR is 0.48.  This means that only about 59% of the PRIMARY categories are appearing within the top 3 predicted labels. The MRR also tells us that the rank of the PRIMARY category is between position 2 and 3. Let’s see if we can do better. Let’s try a different feature weighting scheme. 

Figure 7: Accuracy and MRR using description of text and tf-idf feature weighting

This second model uses tf-idf weighting instead of binary weighting using the same description field. You can see that the accuracy is 0.63 and MRR is 0.51 a slight improvement. This is a good indicator that the tf-idf weighting works better than binary weighting for this particular task.  

How else can we improve our classifier? Remember, we are only using the description field and it is fairly sparse. What if we used the description, headline and tokenized URL, would this help? Let’s try it. 

Figure 8: Accuracy and MRR using all text fields and tf-idf feature weighting

Now look! As you can see in Figure 8, the accuracy is 0.87 and MRR is 0.75, a significant jump. Now we have about 87% of the primary categories appearing within the top 3 predicted categories. In addition, more of the PRIMARY categories are appearing at position 1. This is good news! 

In Figure 9, you will see how well the model performs on different feature weighting methods and use of text fields.

Figure 9: Experimentation with different combination of feature weighting and text fields

There are several observations that can be made from the results in Figure 9:

  1. tf-idf based weighting outperforms binary & count based schemes
  2. count based feature weighting is no better than binary weighting
  3. Sparsity has a lot to do with how poorly the model performs. The richer the text field, the better the overall performance of the classifier.  

Prediction on CNN articles

Now, the fun part! Let’s test it on articles from a different news source than HuffPost. Let’s see how the classifier visually does on articles from CNN. We will predict the top 2 categories.

A crime related story [ see article ]

Predicted: politics, crime

Entertainment related story [ see article ]

Predicted: entertainment, style

Another entertainment related story [ see article ]

Predicted: entertainment, style

Exercise in space [ see article ]

Predicted: science, healthy living

Overall, not bad. The predicted categories make a lot of sense. Note that in the above predictions, we used the headline text. To further improve the predictions, we can enrich the text with the url tokens and description. 

Saving Logistic Regression Model 

Once we have fully developed the model, we want to use it later on unseen documents. Doing this is actually straightforward with sklearn. First, we have to  save the transformer to later encode / vectorize any unseen document. Next,  we also need to save the trained model so that it can make predictions using the weight vectors. Here’s how you do it:

Saving SKLearn Model & Transformer

Loading Model & Transformer for Reuse

Over to you

Here’s the full source code with accompanying dataset for this tutorial. I hope this article has given you the confidence in implementing your very own high-accuracy text classifier.

Keep in mind that text classification is an art as much as it is a science. Your creativity when it comes to text preprocessing, evaluation and feature representation will determine the success of your classifier. A one-size-fits-all approach is rare. What works for this news categorization task, may very well be inadequate for something like bug detection in source code.

An exercise for you:

Right now, we are at 87% accuracy. How can we improve the accuracy further? What else would you try? Leave a comment below with what you tried, and how well it worked. Aim for a 90-95% accuracy and let us all know what worked! 


  • Curate additional features
  • Perform feature selection 
  • Tweak model parameters
  • Try balancing number of articles per category

See Also: How to extract keywords with TF-IDF?


Recommended Reading

Easily Access Pre-trained Word Embeddings with Gensim

What are pre-trained embeddings and why?

Pre-trained word embeddings are vector representation of words trained on a large dataset. With pre-trained embeddings, you will essentially be using the weights and vocabulary from the end result of the training process done by….someone else! (It could also be you)

One benefit of using pre-trained embeddings is that you can hit the ground running without the need for finding a large text corpora which you will have to preprocess and train with the appropriate settings. Another benefit is the savings in training time. Training on a large corpora could demand high computation power and long training times which may not be something that you want to afford for quick experimentation. If you want to avoid all of these logistics but still have access to good quality embeddings, you could use pre-trained word embeddings trained on a dataset that fits the domain you are working in. For example, if you are working with news articles, it may be perfectly fine to use embeddings trained on a Twitter dataset as there is ongoing discussion about current issues as well as a constant stream of news related Tweets. Accessing pre-trained embeddings is extremely easy with Gensim as it allows you to use pre-trained GloVe and Word2Vec embeddings with minimal effort. The code snippets below show you how. Here’s the working notebook for this tutorial.

Accessing pre-trained Twitter GloVe embeddings

Here, we are trying to access GloVe embeddings trained on a Twitter dataset. This first step downloads the pre-trained embeddings and loads it for re-use. These vectors are based on 2B tweets, 27B tokens, 1.2M vocab, uncased. The original source of the embeddings can be found here: The 25 in the model name below refers to the dimensionality of the vectors. Once you have loaded the pre-trained model, just use it as you would with any Gensim Word2Vec model. Here are a few examples: This next example prints the word vectors for trump and obama. Notice that it prints only 25 values for each word. This is because our vector dimensionality is 25.  For vectors of other dimensionality use the appropriate model names from here or reference the gensim-data GitHub repo:
  • glove-twitter-25 (104 MB)
  • glove-twitter-50 (199 MB)
  • glove-twitter-100 (387 MB)
  • glove-twitter-200 (758 MB)

Accessing pre-trained Wikipedia GloVe embeddings

The GloVe embeddings below was trained on an English Wikipedia dump and English Gigaword 5th Edition dataset. Its dimensionality is 100 and has 6B tokens (uncased). The original source of the embeddings can be found here: Once you have loaded the pre-trained model, just use it as usual. Here is a similarity example: For vectors of other dimensionality you can use the appropriate model names from below or reference the gensim-data repository :
  • glove-wiki-gigaword-50 (65 MB)
  • glove-wiki-gigaword-100 (128 MB)
  • gglove-wiki-gigaword-200 (252 MB)
  • glove-wiki-gigaword-300 (376 MB)

Accessing pre-trained Word2Vec embeddings

So far, you have looked at a few examples using GloVe embeddings. In the same way, you can also load pre-trained Word2Vec embeddings. Here are some of your options for Word2Vec:
  • word2vec-google-news-300 (1662 MB) (dimensionality: 300)
  • word2vec-ruscorpora-300 (198 MB) (dimensionality: 300)
Be warned that the google news embeddings is sizable, so ensure that you have sufficient disk  space before using it.

What can you use pre-trained word embeddings for?

You can use pre-trained word embeddings for a variety of tasks including:
  • Finding word or phrase similarities
  • As feature weights for text classification
  • For creating an embedding layer for neural network based text classification
  • For machine translation
  • Query expansion for search enhancements
  • To create sentence embeddings through vector averaging
The possibilities are actually endless, but you may not always get better results than just a bag-of-words approach. For example, I’ve tried sentence embeddings for a search reranking task and the rankings actually deteriorated. The only way to know if it helps, is to try it and see if it improves your evaluation metrics!

Example of using GloVe embeddings to rank phrases by similarity

Here is an example of using the glove-twitter-25 GloVe embeddings to find phrases that are most similar to the query phrase. Let’s say we have the following phrases and a corresponding query phrase with several misspellings (missing ‘r’ in barack and ‘a’ instead of ‘e’ in hussein). The goal here is given the query phrase, rank all other phrases by semantic similarity (using the glove twitter embeddings) and compare that with surface level similarity using the jaccard similarity index. Jaccard has no notion of semantics so it sees a token as is. The code above splits each candidate phrase as well as the query into a set of tokens (words).  The n_similarity(tokens_1,tokens_2)  takes the average of the word vectors for the query (tokens_2) and the phrase (tokens_1) and computes the cosine similarity using the resulting averaged vectors. The results are later sorted by descending order of cosine similarity scores. This method of vector averaging assumes that the words within tokens_1 share a common concept which is amplified through word averaging. The same is the case for, tokens_2. As pointed out by Radim (creator of Gensim),  this crude method works surprisingly well. Below, you will see the ranking of phrases using the word embeddings method vs. the surface similarity method with Jaccard. Notice that even with misspellings, we are able to produce a decent ranking of most similar phrases using the GloVe vectors. This is because misspellings are common in tweets.  If a misspelled word is present in the vocabulary, then it will have a corresponding weight vector. In comparison, Jaccard similarity does slightly worse (visually speaking) as all it knows are the tokens given to it and is ignorant to misspellings, nor does it have any notion of semantics. I hope this article and accompanying notebook will give you a quick start in using pre-trained word embeddings. Can you think of any other use cases for how you would use these embeddings? Leave a comment below with your ideas!

See Also: A step-by-step Word2Vec tutorial with Gensim


How to Use Tfidftransformer & Tfidfvectorizer?

Scikit-learn’s Tfidftransformer and Tfidfvectorizer aim to do the same thing, which is to convert a collection of raw documents to a matrix of TF-IDF features. The differences between the two modules can be quite confusing and it’s hard to know when to use which. This article shows you how to correctly use each module, the differences between the two and some guidelines on what to use when.

Tfidftransformer Usage

1. Dataset and Imports

Below we have 5 toy documents, all about my cat and my mouse who live happily together in my house. We are going to use this toy dataset to compute the tf-idf scores of words in these documents.

We also import the necessary modules here which include TfidfTransformer and CountVectorizer.

2. Initialize CountVectorizer

In order to start using TfidfTransformer you will first have to create a CountVectorizer to count the number of words (term frequency), limit your vocabulary size, apply stop words and etc. The code below does just that.

Now, let’s check the shape. We should have 5 rows (5 docs) and 16 columns (16 unique words, minus single character words):

Sweet, this is what we want! Now it’s time to compute the IDFs. Note that in this example, we are using all the defaults with CountVectorizer. You can actually specify a custom stop word list, enforce minimum word count, etc. See this article on how to use CountVectorizer.

3. Compute the IDF values

Now we are going to compute the IDF values by calling on the word counts we computed earlier.

To get a glimpse of how the IDF values look, we are going to print it by placing the IDF values in a python DataFrame. The values will be sorted in ascending order.

Resulting IDF values

Notice that the words ‘mouse’ and ‘the’ have the lowest IDF values. This is expected as these words appear in each and every document in our collection. The lower the IDF value of a word, the less unique it is to any particular document.

Import Note: In practice, your IDF should be based on a large corpora of text.

4. Compute the TFIDF score for your documents

Once you have the IDF values, you can now compute the tf-idf scores for any document or set of documents. Let’s compute tf-idf scores for the 5 documents in our collection.

The first line above, gets the word counts for the documents in a sparse matrix form. We could have actually used word_count_vector from above. However, in practice, you may be computing tf-idf scores on a set of new unseen documents. When you do that, you will first have to do cv.transform(your_new_docs) to generate the matrix of word counts.

Then, by invoking tfidf_transformer.transform(count_vector) you will finally be computing the tf-idf scores for your docs. Internally this is computing the tf * idf  multiplication where your term frequency is weighted by its IDF values.

Now, let’s print the tf-idf values of the first document to see if it makes sense. What we are doing below is, placing the tf-idf scores from the first document into a pandas data frame and sorting it in descending order of scores.

Tf-idf scores of first document:

tf-idf values using Tfidftransformer

Notice that only certain words have scores. This is because our first document is “the house had a tiny little mouse”  all the words in this document have a tf-idf score and everything else show up as zeroes. Notice that the word “a” is missing from this list. This is possibly due to internal pre-processing of CountVectorizer where it removes single characters.

The scores above make sense. The more common the word across documents, the lower its score and the more unique a word is to our first document (e.g. ‘had’ and ‘tiny’) the higher the score. So it’s working as expected except for the mysterious a that was chopped off.

Tfidfvectorizer Usage

Now, we are going to use the same 5 documents from above to do the same thing as we did for Tfidftransformer – which is to get the tf-idf scores of a set of documents. But, notice how this is much shorter.

With Tfidfvectorizer you compute the word counts, idf and tf-idf values all at once. It’s really simple.

Now let’s print the tfidf values for the first document from our collection. Notice that these values are identical to the ones from Tfidftransformer, only thing is that it’s done in just two steps.

tf-idf values using Tfidfvectorizer

Here’s another way to do it by calling fit and transform separately and you’ll end up with the same results.

Tfidftransformer vs. Tfidfvectorizer

In summary, the main difference between the two modules are as follows:

With Tfidftransformer you will systematically compute word counts using CountVectorizer and then compute the Inverse Document Frequency (IDF) values and only then compute the Tf-idf scores.

With Tfidfvectorizer on the contrary, you will do all three steps at once. Under the hood, it computes the word counts, IDF values, and Tf-idf scores all using the same dataset.

When to use what?

So now you may be wondering, why you should use more steps than necessary if you can get everything done in two steps. Well, there are cases where you want to use Tfidftransformer over Tfidfvectorizer and it is sometimes not that obvious. Here is a general guideline:

  • If you need the term frequency (term count) vectors for different tasks, use Tfidftransformer.
  • If you need to compute tf-idf scores on documents within your “training” dataset, use Tfidfvectorizer
  • If you need to compute tf-idf scores on documents outside your “training” dataset, use either one, both will work.

See Also: How to extract keywords with TF-IDF?


Recommended Reading

text pre-processing

All you need to know about Text Preprocessing for Machine Learning & NLP

Based on some recent conversations, I realized that text preprocessing is a severely overlooked topic. A few people I spoke to mentioned inconsistent results from their NLP applications only to realize that they were not  preprocessing their text or were using the wrong kind of text preprocessing for their project. With that in mind, I thought of shedding some light around what text preprocessing really is, the different techniques of text preprocessing and a way to estimate how much preprocessing you may need. For those interested, I’ve also made some text preprocessing code snippets in python for you to try. Now, let’s get started!

What is text preprocessing?

To preprocess your text simply means to bring your text into a form that is predictable and analyzable for your task. A task here is a combination of approach and domain. For example, extracting top keywords with tfidf (approach) from Tweets (domain) is an example of a Task.
Task = approach + domain
One task’s ideal preprocessing, can become another task’s worst nightmare. So take note, text preprocessing is not directly transferable from task to task. Let’s take a very simple example, let’s say you are trying to discover commonly used words in a news dataset. If your pre-processing step involves removing stop words because some other task used it, then you are probably going to miss out on some of the common words as you have ALREADY eliminated it. So really, it’s not a one-size-fits-all approach.

Types of text preprocessing techniques

There are different ways to preprocess your text. Here are some of the approaches that you should know about and I will try to highlight the importance of each.


Lowercasing ALL your text data, although commonly overlooked, is one of the simplest and most effective form of text preprocessing. It is applicable to most text mining and NLP problems and can help in cases where your dataset is not very large and significantly helps with consistency of expected output. Quite recently, one of my blog readers trained a word embedding model for similarity lookups. He found that different variation in input capitalization (e.g. ‘Canada’ vs. ‘canada’) gave him different types of output or no output at all. This was probably happening because the dataset had mixed-case occurrences of the word ‘Canada’ and there was insufficient evidence for the neural-network to effectively learn the weights for the less common version. This type of issue is bound to happen when your dataset is fairly small and lowercasing is a great way to deal with sparsity issues. Here is an example of how lowercasing solves the sparsity issue, where the same words with different cases map to the same lowercase form: Another example where lowercasing is very useful is for search. Imagine, you are looking for documents containing “usa”.  However, no results were showing up because “usa” was indexed as “USA”. Now, who should we blame? The U.I. designer who set-up the interface or the engineer who set-up the search index? While lowercasing should be standard practice, I’ve also had situations where preserving the capitalization was important. For example, in predicting the programming language of a source code file. The word System in Java is quite different from  system in python. Lowercasing the two makes them identical, causing the classifier to lose important predictive features. While lowercasing is generally helpful, it may not be applicable for all tasks.


Stemming is the process of reducing inflection in words (e.g. troubled, troubles) to their root form (e.g. trouble). The “root” in this case may not be a real root word, but just a canonical form of the original word. Stemming uses a crude heuristic process that chops off the ends of words in the hope of correctly transforming words into its root form. So the words “trouble”, “troubled” and “troubles” might actually be converted to troubl instead of trouble because the ends were just chopped off (ughh, how crude!). There are different algorithms for stemming. The most common algorithm, which is also known to be empirically effective for English, is Porters Algorithm. Here is an example of stemming in action with Porter Stemmer: Stemming is useful for dealing with sparsity issues as well as standardizing vocabulary. I’ve had success with stemming in search applications in particular. The idea is that, if say you search for “deep learning classes”, you also want to surface documents that mention “deep learning class” as well as “deep learn classes”, although the latter doesn’t sound right. But you get where we are going with this. You want to match all variations of a word to bring up the most relevant documents. In most of my previous text classification work however, stemming only marginally helped improved classification accuracy as opposed to using better engineered features and text enrichment approaches such as using word embeddings.


Lemmatization on the surface is very similar to stemming, where the goal is to remove inflections and map a word to its root form. The only difference is that, lemmatization tries to do it the proper way. It doesn’t just chop things off, it actually transforms words to the actual root. For example, the word “better” would map to “good”.  It may use a dictionary such as WordNet for mappings or some special rule-based approaches. Here is an example of lemmatization in action using a WordNet-based approach: In my experience, lemmatization provides no significant benefit over stemming for search and text classification purposes. In fact, depending on the algorithm you choose, it could be much slower compared to using a very basic stemmer and you may have to know the part-of-speech of the word in question in order to get a correct lemma. This paper finds that lemmatization has no significant impact on accuracy for text classification with neural architectures. I would personally use lemmatization sparingly. The additional overhead may or may not be worth it. But you could always try it to see the impact it has on your performance metric.

Stop-word removal

Stop words are a set of commonly used words in a language. Examples of stop words in English are “a”, “the”, “is”, “are” and etc. The intuition behind using stop words is that, by removing low information words from text, we can focus on the important words instead. For example, in the context of a search system, if your search query is “what is text preprocessing?”,  you want the search system to focus on surfacing documents that talk about text preprocessing over documents that talk about what is. This can be done by preventing all words from your stop word list from being analyzed. Stop words are commonly applied in search systems, text classification applications, topic modeling, topic extraction and others. In my experience, stop word removal while effective in search and topic extraction systems, showed to be non-critical in classification systems. However, it does help reduce the number of features in consideration which helps keep your models decently sized. Here is an example of stop word removal in action. All stop words are replaced with a dummy character, W: Stop word lists can come from pre-established sets or you can create a custom one for your domain. Some libraries (e.g. sklearn) allow you to remove words that appeared in X% of your documents, which can also give you a stop word removal effect.


A highly overlooked preprocessing step is text normalization. Text normalization is the process of transforming text into a canonical (standard) form. For example, the word “gooood” and “gud” can be transformed to “good”, its canonical form. Another example is mapping of near identical words such as “stopwords”, “stop-words” and “stop words” to just “stopwords”. Text normalization is important for noisy texts such as social media comments, text messages and comments to blog posts where abbreviations, misspellings and use of out-of-vocabulary words (oov) are prevalent. This paper showed that by using a text normalization strategy for Tweets, they were able to improve sentiment classification accuracy by ~4%. Here’s an example of words before and after normalization: Notice how the variations, map to the same canonical form. In my experience, text normalization has even been effective for analyzing highly unstructured clinical texts where physicians take notes in non-standard ways. I’ve also found it useful for topic extraction where near synonyms and spelling differences are common (e.g. topic modelling, topic modeling, topic-modeling, topic-modelling). Unfortunately, unlike stemming and lemmatization, there isn’t a standard way to normalize texts. It typically depends on the task. For example, the way you would normalize clinical texts would arguably be different from how your normalize sms text messages. Some common approaches to text normalization include dictionary mappings (easiest), statistical machine translation (SMT) and spelling-correction based approaches. This interesting article compares the use of a dictionary based approach and a SMT approach for normalizing text messages. Interestingly, I’m also seeing more and more papers related to text normalization in the research world.

Noise Removal

Noise removal is about removing characters digits and pieces of text  that can interfere with your text analysis. Noise removal is one of the most essential text preprocessing steps. It is also highly domain dependent. For example, in Tweets, noise could be all special characters except hashtags as it signifies concepts that can characterize a Tweet. The problem with noise is that it can produce results that are inconsistent in your downstream tasks. Let’s take the example below: Notice that all the raw words above have some surrounding noise in them. If you stem these words, you can see that the stemmed result does not look very pretty. None of them have a correct stem. However, with some cleaning as applied in this notebook, the results now look much better: Noise removal is one of the first things you should be looking into when it comes to Text Mining and NLP. There are various ways to remove noise. This includes punctuation removal, special character removal, numbers removal, html formatting removal, domain specific keyword removal (e.g. ‘RT’ for retweet), source code removal, header removal and more. It all depends on which domain you are working in and what entails noise for your task. The code snippet in my notebook shows how to do  some basic noise removal.

Text Enrichment / Augmentation

Text enrichment involves augmenting your original text data with information that you did not previously have. Text enrichment provides more semantics to your original text, thereby improving its predictive power and the depth of analysis you can perform on your data. In an information retrieval example, expanding a user’s query to improve the matching of keywords is a form of augmentation. A query like text mining could become text document mining analysis. While this doesn’t make sense to a human, it can help fetch documents that are more relevant. You can get really creative with how you enrich your text. You can use part-of-speech tagging to get more granular information about the words in your text. For example, in a document classification problem, the appearance of the word book as a noun could result in a different classification than book as a verb as one is used in the context of reading and the other is used in the context of reserving something. This article talks about how Chinese text classification is improved with a combination of nouns  and verbs as input features. With the availability of large amounts texts however, people have started using embeddings to enrich the meaning of words, phrases and sentences for classification, search, summarization and text generation in general. This is especially true in deep learning based NLP approaches where a word level embedding layer is quite common. You can either start with pre-established embeddings or create your own and use it in downstream tasks. Other ways to enrich your text data include phrase extraction, where you recognize compound words as one (aka chunking), expansion with synonyms and dependency parsing.

Do you need all the text preprocessing types?

Not really, but you do have to do some of it for sure if you want good, consistent results. To give you an idea of what the bare minimum should be, I’ve broken it down to Must Do, Should Do and Task Dependent. Everything that falls under task dependent can be quantitatively or qualitatively tested before deciding you actually need it. Remember, less is more and you want to keep your approach as elegant as possible. The more overhead you add, the more layers  you will have to peel back when you run into issues.

Must Do:

  • Noise removal
  • Lowercasing (can be task dependent in some cases)

Should Do:

  • Simple normalization – (e.g. standardize near identical words)

Task Dependent:

  1. Advanced normalization (e.g. addressing out-of-vocabulary words)
  2. Stop-word removal
  3. Stemming / lemmatization
  4. Text enrichment / augmentation
So, for any task, the minimum you should do is try to lowercase your text and remove noise. What entails noise depends on your domain (see section on Noise Removal). You can also do some basic normalization steps for more consistency and then systematically add other layers as you see fit.

General Rule of Thumb

Not all tasks need the same level of preprocessing. For some tasks, you can get away with the minimum. However, for others,  the dataset is so noisy that, if you don’t preprocess enough, it’s going to be garbage-in-garbage-out. Here’s a general rule of thumb. This will not always hold true, but works for most cases. If you have a lot of well written texts to work with in a fairly general domain, then  preprocessing is not extremely critical; you can get away with the bare minimum (e.g. training a word embedding model using all of Wikipedia texts or Reuters news articles). However, if you are working in a very narrow domain (e.g. Tweets about health foods) and data is sparse and noisy, you could benefit from more preprocessing layers, although each layer you add (e.g. stop word removal, stemming, normalization)  needs to be quantitatively or qualitatively verified as a meaningful layer. Here’s a table that summarizes how much preprocessing you should be performing on your text data: I hope the ideas here would steer you towards the right preprocessing steps for your projects. Remember, less is more. A friend of mine once mentioned to me how he made a large e-commerce search system more efficient and less buggy just by throwing out layers of unneeded preprocessing.


Relevant Papers

What is Inverse Document Frequency (IDF)?

Inverse Document Frequency (IDF) is a weight indicating how commonly a word is used. The more frequent its usage across documents, the lower its score. The lower the score, the less important the word becomes.

For example, the word the appears in almost all English texts and would thus have a very low IDF score as it carries very little “topic” information. In contrast, if you take the word coffee, while it is common, it’s not used as widely as the word the. Thus, coffee would have a higher IDF score than the. Traditionally IDF is computed as:

where N is the total number of documents in your text collection and DFt is the number of documents containing the term t and t is any word in your vocabulary.

IDF is typically used to boost the scores of words that are unique to a document with the hope that you surface high information words that characterize your document and suppress words that don’t carry much weight in a document.

Let’s take an example. In a given document, if the word the appeared 10 times and its IDF weight is 0.1, its resulting score would be 1 (since 10*0.1=1). Now if the word coffee also appeared 10 times and its IDF weight is 0.5 the resulting score would be 5. When you rank the words by the resulting scores (in descending order of course!), coffee would appear before the, indicating that coffee is more important than the word the.

In summary, IDF is a useful little formula that you can use the create a stop-word list, use for feature weighting in text classifiers, for keyword extraction and more.


term frequency word cloud

What is Term-Frequency?

Term Frequency (TF)

Term frequency (TF) often used in Text Mining, NLP and Information Retrieval tells you how frequently a term occurs in a document. In the context natural language, terms correspond to words or phrases. Since every document is different in length, it is possible that a term would appear more often in longer documents than shorter ones. Thus, term frequency is often divided by the  the total number of terms in the document as a way of normalization.

There are other ways to normalize term frequencies including using the maximum term frequency in a document as well as average term frequency.

Term Frequency in Practice

Term frequencies are often used to characterize documents. In theory, the more frequent a term appears in a document, the more the term characterizes your document. However there is limitation to this assumption. Let’s take this following news article about the Dow.

 The top occurring terms are the ones that appear in large fonts below. Notice that common words such as ‘the’ and ‘and’ with low information tend to dominate the counts. This is inevitable, since in every spoken language, you will inherently have determiners, connectors and conjunctions to make sentences flow.

Generated using word-cloud library

There are two ways you can improve the ranking of these words such that topic words appear more prominently. The first approach is to eliminate all stop words (common words) such as ‘the’, ‘is’, ‘are’ and so on before computing the term frequencies. Here is an example with some of the stop words removed where the larger fonts indicate high term frequencies:

Word cloud with stop-words removed

Notice that now it becomes much clearer that the document in question actually talks about economic recession. Another way to suppress common words and surface topic words is to multiply the term frequencies with what’s called Inverse Document Frequencies (IDF). IDF is a weight indicating how widely a word is used. The more frequent its usage across documents, the lower its score. For example, the word the would appear in almost all English texts and thus would have a very low inverse document frequency. Multiplying term frequencies with the IDFs dampens the frequencies of highly occurring words and improves the prominence of important topic words and this is the basis of the commonly talked about TF-IDF weighting.


What are N-Grams?

N-grams of texts are extensively used in text mining and natural language processing tasks. They are basically a set of co-occurring words within a given window and when computing the n-grams you typically move one word forward (although you can move X words forward in more advanced scenarios). For example, for the sentence “The cow jumps over the moon”. If N=2 (known as bigrams), then the ngrams would be:
  • the cow
  • cow jumps
  • jumps over
  • over the
  • the moon
So you have 5 n-grams in this case. Notice that we moved from the->cow to cow->jumps to jumps->over, etc, essentially moving one word forward to generate the next bigram. If N=3, the n-grams would be:
  • the cow jumps
  • cow jumps over
  • jumps over the
  • over the moon
So you have 4 n-grams in this case. When N=1, this is referred to as unigrams and this is essentially the individual words in a sentence. When N=2, this is called bigrams and when N=3 this is called trigrams. When N>3 this is usually referred to as four grams or five grams and so on.

How many N-grams in a sentence?

If X=Num of words in a given sentence K, the number of n-grams for sentence K would be:

What are N-grams used for?

N-grams are used for a variety of different task. For example, when developing a language model, n-grams are used to develop not just unigram models but also bigram and trigram models. Google and Microsoft have developed web scale n-gram models that can be used in a variety of tasks such as spelling correction, word breaking and text summarization. Here is a publicly available web scale n-gram model by Microsoft: Here is a paper that uses Web N-gram models for text summarization:Micropinion Generation: An Unsupervised Approach to Generating Ultra-Concise Summaries of Opinions Another use of n-grams is for developing features for supervised Machine Learning models such as SVMs, MaxEnt models, Naive Bayes, etc. The idea is to use tokens such as bigrams in the feature space instead of just unigrams. But please be warned that from my personal experience and various research papers that I have reviewed, the use of bigrams and trigrams in your feature space may not necessarily yield any significant improvement. The only way to know this is to try it!

Java for N-gram Generation

This code block generates n-grams at a sentence level. The input consists of N (the size of n-gram), sent the sentence and ngramList a place to store the n-grams generated.

Python code for N-gram Generation

Similar to the example above, the code below generates n-grams in python.

Example Output

Here is an example of n-grams generated using the python code above run from a Jupyter notebook. The start and end tokens are added to maximize the use of the n-grams. Some phrases tend to occur only at the end and some tend to occur at the very beginning. The _start_ and _end_ tokens help capture this pattern. If you’re using Python, here’s another way to do it using NLTK:  

Industrial Strength Natural Language Processing

Having spent a big part of my career as a graduate student researcher and now a Data Scientist in the industry, I have come to realize that a vast majority of solutions proposed both in academic research papers and in the work place are just not meant to ship — they just don’t scale! And when I say scale, I mean handling real world uses cases,  ability to handle large amounts of data and ease of deployment in a production environment.  Some of these approaches either work on extremely narrow use cases or have a tough time generating results in a timely manner. In some cases, the model takes days to train even though  the problem could be as simple as finding similar documents from a set of 50,000 documents.

More often than not, the problem lies is in the approach that was used. Remember, there will always be more than one way to solve an NLP or Data Science problem and optimizing your choices will increase your chance of success in deploying your models to production. Over the decade, having shipped solutions that serve real users, I now follow a set of best practices that maximizes my chance of success every time I start a new project. I swear by these principles and I hope these become handy to you as well.

1. KISS please!

KISS (Keep it simple, stupid). When it comes to choice of techniques for solving NLP problems, this seems like common sense, but I can’t say this enough: choose techniques and pipelines that are easy to understand and maintain instead of complex ones that only you understand, sometimes only partially. In a lot of NLP applications, you would typically notice one of two things: (1) Deep pre-processing layers or  (2) Complex neural network architectures that are just hard to grasp, let alone train, maintain and improve on iteratively.

The first question to ask yourself is if you need all the layers of pre-processing? Do you really need part-of-speech tagging, chunking, entity resolution, lemmatization and etc.  What if you strip out a few layers? How does this affect the performance of your models? With access to massive amounts of data, in a lot of applications you can actually let the evidence in data guide your model. Think Word2Vec. The success of Word2Vec is in its simplicity. You use large amounts of data, to draw meaning using the data itself.  Layers? What layers?

When it comes to Deep Learning, use it wisely. Not all problems benefit from Deep Learning and for the problems that do, use the architectures that are easy to understand and improve on. For example, for a programming language classification task, I just used a two-layer Artificial Neural Network and realized big wins in terms of training speed and accuracy. In addition, adding a new programming language is pretty seamless as long as you have data to feed into the model. I could have complicated the model to gain some social currency by using a really complex RNN architecture straight from a research paper. But I ended up starting simple just to see how far this would get me, and now I’m at the point where I can say, what’s the need to add more complexity?

2. When in doubt, use a time-tested approach

With every NLP/text mining problems, your options are a plenty. There will always be more than one way to accomplish the same task. For example,  in finding similar documents, you could use a simple bag-of-words approach and compute document similarities using the resulting tf-idf vector. Alternatively, you could do something fancier by generating embeddings of each document and compute similarities using the document embeddings. Which should you use? It actually depends on several things:

a. Which of these methods have seen a higher chance of success in practice? (Hint: We see tf-idf being used all the time for information retrieval and its super fast. How about the latter?)

b. Which of these do I understand better? Remember the more you understand something, the better your chance of tuning it and getting it to work the way you expect it to.

c. Do I have the necessary tools/data to implement either of these?

Some of these questions can be easily answered with some literature search. But you could also reach out to experts such as University Professors or other Data Scientists who have worked on similar problems to give you a recommendation. Occasionally, I run my ideas by my peers who are in the same field to make sure I am thinking about problems and potential solutions correctly, before diving right in. As you get more and more projects under your belt, the intuition factor kicks in and you would just have a very strong sense about what’s going to work and what’s not.

3. Understand your end-point extremely well

My work on topics for GitHub initially started off as topics for the purpose of  repository recommendations. Those topics would have never been exposed to the user and was only intended to be internally used to compute repo to repo similarity. During development, people got really excited and suggested that these should be exposed to users directly. My immediate response was “Heck, no!”.  But people wondered, why not?

Very simple, that was not the intended use of those topics. The level of noise tolerance for something you would use only internally is much higher than what you show to users as suggestions, externally. So in the case of topics, I actually spent three additional months improving the work so that it can actually be exposed to users. I can’t say this enough, but you need to know what your end goal is so that you are actually working towards providing a solution that addresses the problem. Fuzziness in the end goal your are trying to achieve would result in either a complete redo or months of extra work tuning and tweaking your models to do the right thing.

4. Pay attention to your data quality

Garbage in, garbage out is true in every sense of the word when it comes to Machine Learning and NLP.  If you are trying to make predictions of sentiment classes (positive vs. negative) and your positive examples contain a large number of negative comments and vice versa, your classifier is going to be confused. Imagine if I told you 1+2=3 and the next time I tell you 1+2=4 and the next time I tell you again 1+2=3. Ugh, wouldn’t you be so confused? It’s the same analogy.

Also, if you have 90% positive examples and 10% negative ones, how well do you thing your classifier is going to perform on negative comments? Its probably going to say every comment is a positive comment. Class imbalance and lack of diversity in your data can be a real problem. The more diverse your training data, the better it will generalize. This was very evident in one of my projects on clinical text segmentation  (table iv) where when we consciously forced variety in training examples, the results clearly improved.

While over pre-processing your data may be unnecessary, under pre-processing it may also be detrimental. Let’s take Tweets for example. Tweets are highly noisy. You  may have out of vocabulary words like looooooove and abbreviations like lgtm. To make sense of any of this, you would probably would need to bring these back to their normal form first. Without that, this would fall right into the trap of garbage-in-garbage-out especially if you are dealing with a fairly small dataset.

5. Don’t completely believe your quantitative results.

Numbers can sometimes lie.  For example, in a text summarization project, the overlap between your system summary and the human curated summary may be a 100%. However, when you actually visually inspect the machine and human summaries, you might find something astonishing. Human says: this is a great example of a bad summary. Machine says: example great this is summary a bad a of. And your overlap score=1.0.  See my point? Quantitative evaluation alone is NOT ENOUGH. You need to visually inspect your results – and lots of it.  Try to intuitively understand the problems  that you are seeing. That’s one excellent way of getting more ideas on how to tweak your algorithm or ditch it altogether. In the summarization example, the problem was obvious: the word arrangement needs A LOT of work!

6. Think about cost and scalability.

Have you ever thought about what it would take to deploy your model in a production environment? What are your data dependencies, how long does your model take to run, how about time to predict or generate results? Also, what are the memory and computation requirements of your approach when you scale up to the real number of data points that it would be handling? All of this have a direct impact on whether you can budget wise afford to use your proposed approach and secondly if you will be able to handle a production load. If your model is gpu bound, make sure that you are able to afford the cost of serving such a model.

The earlier you think about cost and scalability, the higher your chance of success in getting your models deployed. In my projects, I always instrument time to train, classify and process different loads to approximate how well the solutions that I am developing would hold up in a production environment.

In summary, the prototypes that you develop don’t have to be throw away prototypes. It can be the start of some really powerful production level solution if you plan ahead. Think about your end point and how the output from your model will be consumed and used and don’t over-complicate your solution. You will not go wrong if you KISS and pick a technique that fits the problem instead of forcing your problem to fit your chosen technique!

Word Cloud in Python for Jupyter Notebooks and Web Apps

About a year ago, I looked high and low for a python word cloud library that I could use from within my Jupyter notebook that was flexible enough to use counts or tfidf when needed or just accept a set of words and corresponding weights. I was a bit surprised that something like that did not already exist within libraries like plotly. All I wanted to do, was to get a quick understanding of my text data and word vectors. That’s probably not too much to ask?

Here I am a year later, using my own word_cloud visualization library. Its not the prettiest or the most sophisticated, but it works for my use cases. I decided to share it, so that others could use it as well. After installation, here are a few ways you could use it:

Generate word clouds with a single text document

This example show cases how you can generate word clouds with just one document. While the colors can be randomized, in this example, the colors are based on the default color settings. By default, the words are weighted by word counts unless you explicitly ask for tfidf weighting. Tfidf weighting makes sense only if you have a lot of documents to start with.

Generate word clouds from multiple documents

Let’s say you have a 100 documents from one news category, and you just want to see what the common mentions are.

Generate word clouds from existing weights

Let’s say you have a set of words with corresponding weights, and you just want to visualize it. All you need to do is make sure that the weights are normalized between [0-1]

Please feel free to propose changes to prettify the output — just open a pull request with your changes.