Hands-On NLP

These are tutorials with code samples that teach you how to accomplish specific goals with text data. For example, extracting topics from text, using vectorizers correctly and etc.

countvectorizer sparse matrix representation

10+ Examples for Using CountVectorizer

Scikit-learn’s CountVectorizer is used to transform a corpora of text to a vector of term / token counts. It also provides the capability to preprocess your text data prior to generating the vector representation making it a highly flexible feature representation module for text.

In this article, we are going to go in-depth into the different ways you can use CountVectorizer such that you are not just computing counts of words, but also preprocessing your text data appropriately as well as extracting additional features from your text dataset.

Example of How CountVectorizer Works

To show you an example of how CountVectorizer works, let’s take the book title below (for context: this is part of a book series that kids love) :

This text is transformed to a sparse matrix as shown in Figure 1(b) below:

Figure 1: CountVectorizer sparse matrix representation of words. (a) is how you visually think about it. (b) is how it is really represented in practice.

Notice that here we have 9 unique words. So 9 columns. Each column in the matrix represents a unique word in the vocabulary, while each row represents the document in our dataset. In this case, we only have one book title (i.e. the document), and therefore we have only 1 row. The values in each cell are the word counts. Note that with this representation, counts of some words could be 0 if the word did not appear in the corresponding document.

While visually it’s easy to think of a word matrix representation as Figure 1 (a), in reality, these words are transformed to numbers and these numbers represent positional index in the sparse matrix as seen in Figure 1(b).

Why the sparse matrix format?

With CountVectorizer we are converting raw text to a numerical vector representation of words and n-grams. This makes it easy to directly use this representation as features (signals) in Machine Learning tasks such as for text classification and clustering.

Note that these algorithms only understand the concept of numerical features irrespective of its underlying type (text, image pixels, numbers, categories and etc.) allowing us to perform complex machine learning tasks on different types of data.

Side Note: If all you are interested in are word counts, then you can get away with using the python Counter. There is no real need to use CountVectorizer. If you still want to do it, here’s the example for extracting counts with CountVectorizer.

Dataset & Imports

In this tutorial, we will be using titles of 5 cat in the hat books (as seen below).

I had intentionally made it a handful of short texts so that you can see how to put CountVectorizer to full use in your applications. Keep note that each title above is considered a document.

CountVectorizer Plain and Simple

What happens above is that the 5 books titles are preprocessed, tokenized and represented as a sparse matrix as explained in the introduction. By default, CountVectorizer does the following:

  • lowercases your text (set lowercase=false if you don’t want lowercasing)
  • uses utf-8 encoding
  • performs tokenization (converts raw text to smaller units of text)
  • uses word level tokenization (meaning each word is treated as a separate token)
  • ignores single characters during tokenization (say goodbye to words like ‘a’ and ‘I’)

Now, let’s look at the vocabulary (collection of unique words from our documents):

As we are using all the defaults, these are all word level tokens, lowercased. Note that the numbers here are not counts, they are the position in the sparse vector. Now, let’s check the shape:

We have 5 (rows) documents and 43 unique words (columns)!

CountVectorizer and Stop Words

Now, the first thing you may want to do, is to eliminate stop words from your text as it has limited predictive power and may not help with downstream tasks such as text classification. Stop word removal is a breeze with CountVectorizer and it can be done in several ways:

  1. Use a custom stop word list that you provide
  2. Use sklearn’s built in English stop word list (not recommended)
  3. Create corpora specific stop words using max_df and min_df (highly recommended and will be covered later in this tutorial)

Let’s look at the 3 ways of using stop words.

Custom Stop Word List

In this example, we provide a list of words that act as our stop words. Notice that the shape has gone from (5,43) to (5,40) because of the stop words that were removed. Note that we can actually load stop words directly from a file into a list and supply that as the stop word list.

To check the stop words that are being used (when explicitly specified), simply access cv.stop_words.

While cv.stop_words gives you the stop words that you explicitly specified as shown above, cv.stop_words_ (note: with underscore suffix) gives you the stop words that CountVectorizer inferred from your min_df and max_df settings as well as those that were cut off during feature selection (through the use of max_features). So far, we have not used the three settings, so cv.stop_words_ will be empty.

Stop Words using MIN_DF

The goal of MIN_DF is to ignore words that have very few occurrences to be considered meaningful. For example, in your text you may have names of people that may appear in only 1 or two documents. In some applications, this may qualify as noise and could be eliminated from further analysis.

Instead of using a minimum term frequency (total occurrences of a word) to eliminate words, MIN_DF looks at how many documents contained a term, better known as document frequency. The MIN_DF value can be an absolute value (e.g. 1, 2, 3, 4) or a value representing proportion of documents (e.g. 0.25 meaning, ignore words that have appeared in 25% of the documents) .

Eliminating words that appeared in less than 2 documents:

Now, to see which words have been eliminated, you can use cv.stop_words_ as this was internally inferred by CountVectorizer (see output below).

Yikes! We removed everything? Not quite. However, most of our words have become stop words and that’s because we have only 5 book titles.

To see what’s remaining, all we need to do is check the vocabulary again with cv.vocabulary_ (see output below):

Sweet! These are words that appeared in all 5 book titles.

Stop Words using MAX_DF

Just as we ignored words that were too rare with MIN_DF, we can ignore words that are too common with MAX_DF. MAX_DF looks at how many documents contained a term, and if it exceeds the MAX_DF threshold, then it is eliminated from consideration. The MAX_DF value can be an absolute value (e.g. 1, 2, 3, 4) or a value representing proportion of documents (e.g. 0.85 meaning, ignore words appeared in 85% of the documents as they are too common).

I’ve typically used a value from 0.75-0.85 depending on the task and for more aggressive stop word removal you can even use a smaller value.

Now, to see which words have been eliminated, you can use cv.stop_words_ (see output below):

In this example, all words that appeared in all 5 book titles have been eliminated.

Why document frequency for eliminating words?

Document frequency is sometimes a better way for inferring stop words compared to term frequency as term frequency can be misleading. For example, let’s say 1 document out of 250,000 documents in your dataset, contains 500 occurrences of the word catnthehat. If you use term frequency for eliminating rare words, the counts are so high that it may never pass your threshold for elimination. The word is still rare as it appears in only one document.

On several occasions, such as in building topic recommendation systems, I’ve found that using document frequency for eliminating rare and common terms gives far better results than relying on just overall term frequency.

Custom Tokenization

The default tokenization in CountVectorizer removes all special characters, punctuation and single characters. If this is not the behavior you desire, and you want to keep punctuation and special characters, you can provide a custom tokenizer to CountVectorizer.

In the example below, we provide a custom tokenizer using tokenizer=my_tokenizer where my_tokenizer is a function that attempts to keep all punctuation, and special characters and tokenizes only based on whitespace.

Fantastic, now we have our punctuation, single characters and special characters!

Custom Preprocessing

In many cases, we want to preprocess our text prior to creating a sparse matrix of terms. As I’ve explained in my text preprocessing article, preprocessing helps reduce noise and improves sparsity issues resulting in a more accurate analysis.

Here is an example of how you can achieve custom preprocessing with CountVectorizer by setting preprocessor=<some_preprocessor>.

In the example above, my_cool_preprocessor is a predefined function where we perform the following steps:

  1. lowercase the text (note: this is done by default if a custom preprocessor is not specified)
  2. remove special characters
  3. normalize certain words
  4. use stems of words instead of the original form (see: preprocessing article on stemming)

You can introduce your very own preprocessing steps such as lemmatization, adding parts-of-speech and so on to make this preprocessing step even more powerful.

Working With N-Grams

One way to enrich the representation of your features for tasks like text classification, is to use n-grams where n > 1. The intuition here is that bi-grams and tri-grams can capture contextual information compared to just unigrams. In addition, for tasks like keyword extraction, unigrams alone while useful, provides limited information. For example, good food carries more meaning than just good and food when observed independently.

Working with n-grams is a breeze with CountVectorizer. You can use word level n-grams or even character level n-grams (very useful in some text classification tasks). Here are a few examples:

Word level – bigrams only

Word level – unigrams and bigrams

Character level – bigrams only

Limiting Vocabulary Size

When your feature space gets too large, you can limit its size by putting a restriction on the vocabulary size. Say you want a max of 10,000 n-grams. CountVectorizer will keep the top 10,000 most frequent n-grams and drop the rest.

Since we have a toy dataset, in the example below, we will limit the number of features to 10.

Notice that the shape now is (5,10) as we asked for a limit of 10 on the vocabulary size. You can check the removed words using cv.stop_words_.

Ignore Counts and Use Binary Values

By default, CountVectorizer uses the counts of terms/tokens. However, you can choose to just use presence or absence of a term instead of the raw counts. This is useful in some tasks such as certain features in text classification where the frequency of occurrence is insignificant. To get binary values instead of counts all you need to do is set binary=True.

If you set binary=True then CountVectorizer no longer uses the counts of terms/tokens. If a token is present in a document, it is 1, if absent it is 0 regardless of its frequency of occurrence. By default, binary=False.

Using CountVectorizer to Extract N-Gram / Term Counts

Finally, you may want to use CountVectorizer to obtain counts of your n-grams. This is slightly tricky to do with CountVectorizer, but achievable as shown below:

The counts are first ordered in descending order. Then from this list, each feature name is extracted and returned with corresponding counts.


CountVectorizer provides a powerful way to extract and represent features from your text data. It allows you to control your n-gram size, perform custom preprocessing, custom tokenization, eliminate stop words and limit vocabulary size.

While counts of words can be useful signals by themselves, in some cases, you will have to use alternative schemes such as TF-IDF to represent your features. For some applications, a binary bag of words representation may also be more effective than counts. For a more sophisticated feature representation, people use word, sentence and paragraph embeddings trained using algorithms like word2vec, Bert and ELMo where each textual unit is encoded using a fixed length vector.

To rerun some of the examples in this tutorial, get the Jupyter notebook for this article. If there is anything that I missed out here, do feel free to leave a comment below.

See Also: Learning How To Build Your First Text Classifier


Recommended reading

Classify news articles with logistic regression and python

Build Your First Text Classifier in Python with Logistic Regression

Text classification is the automatic process of predicting one or more categories given a piece of text. For example, predicting if an email is legit or spammy. Thanks to Gmail’s spam classifier, I don’t see or hear from spammy emails!

Spam classification

Other than spam detection, text classifiers can be used to determine sentiment in social media texts, predict categories of news articles, parse and segment unstructured documents, flag the highly talked about fake news articles and more.

Text classifiers work by leveraging signals in the text to “guess” the most appropriate classification. For example, in a sentiment classification task, occurrences of certain words or phrases, like slow,problem,wouldn't and not can bias the classifier to predict negative sentiment.

The nice thing about text classification is that you have a range of options in terms of what approaches you could use. From unsupervised rules-based approaches to more supervised approaches such as Naive Bayes, SVMs, CRFs and Deep Learning.

In this article, we are going to learn how to build and evaluate a text classifier using logistic regression on a news categorization problem. The problem while not extremely hard, is not as straightforward as making a binary prediction (yes/no, spam/ham).

Here’s the full source code with accompanying dataset for this tutorial. Note that this is a fairly long tutorial and I would suggest that you break it down to several sessions so that you completely grasp the concepts. 

HuffPost Dataset

The dataset that we will be using for this tutorial is from Kaggle. It contains news articles from Huffington Post (HuffPost) from 2014-2018 as seen below. This data set has about ~125,000 articles and 31 different categories

Figure 1: Articles distribution from 2014-2018

Now let’s look at the category distribution of these articles (Figure 2). Notice that politics has the most number of articles and education has the lowest number of articles ranging in the hundreds.

So, nothing surprising in the category distribution other than we have much fewer articles to learn from categories outside POLITICS.

Figure 2: Number of articles per category

Now, let’s take a quick peek at the dataset (Figure 3).

Figure 3: Sneak peak of the news dataset

Notice that the fields we have in order to learn a classifier that predicts the category include headline, short_description, link and authors

The Challenge

As mentioned earlier, the problem that we are going to be tackling is to predict the category of news articles (as seen in Figure 3), using only the description, headline and the url of the articles. 

Without the actual content of the article itself, the data that we have for learning is actually pretty sparse – a problem you may encounter in the real world. But let’s see if we can still learn from it reasonably well. We will not use the author field because we want to test it on articles from a different news organization, specifically from CNN. 

In this tutorial, we will use the Logistic Regression algorithm to implement the classifier. In my experience, I have found Logistic Regression to be  very effective on text data and the underlying algorithm is also fairly easy to understand. More importantly, in the NLP world, it’s generally accepted that Logistic Regression is a great starter algorithm for text related classification

Feature Representation

Features are attributes (signals) that help the model learn. This can be specific words from the text itself (e.g. all words, top occurring terms, adjectives) or additional information inferred based on the original text (e.g. parts-of-speech, contains specific phrase patterns, syntactic tree structure).

For this task, we have text fields that are fairly sparse to learn from. Therefore, we will try to use all words from several of the text fields. This includes the description, headline and tokens from the url. The more advanced feature representation is something you should try as an exercise.

Feature Weighting

Not all words are equally important to a particular document / category. For example, while words like ‘murder’, ‘knife’ and ‘abduction’ are important to a crime related document, words like ‘news’ and ‘reporter’ may not be quite as important. 

In this tutorial, we will be experimenting with 3 feature weighting approaches. The most basic form of feature weighting, is binary weighting. Where if a word is present in a document, the weight is ‘1’ and if the word is absent the weight is ‘0’. 

Another way to assign weights is using the term-frequency of words (the counts). If a word like ‘knife’ appears 5 times in a document, that can become its corresponding weight. 

We will also be using TF-IDF weighting where words that are unique to a particular document would have higher weights compared to words that are used commonly across documents. 

There are of course many other methods for feature weighting. The approaches that we will experiment with in this tutorial are the most common ones and are usually sufficient for most classification tasks. 


One of the most important components in developing a supervised text classifier is the ability to evaluate it. We need to understand if the model has learned sufficiently based on the examples that it saw in order to make correct predictions.

If the performance is rather laughable, then we know that more work needs to be done. We may need to improve the features, add more data, tweak the model parameters and etc. 

For this particular task, even though the HuffPost dataset lists one category per article, in reality, an article can actually belong to more than one category. For example, the article in Figure 4 could belong to COLLEGE (the primary category) or EDUCATION.

Figure 4: Example of article that can fit into multiple categories

If the classifier predicts EDUCATION as its first guess instead of COLLEGE, that doesn’t mean it’s wrong. As this is bound to happen to various other categories, instead of looking at the first predicted category, we will look at the top 3 categories predicted to compute (a) accuracy and (b) mean reciprocal rank (MRR)


Accuracy evaluates the fraction of correct predictions. In our case, it is the number of times the PRIMARY category appeared in the top 3 predicted categories divided by the total number of categorization tasks. 


Unlike accuracy, MRR takes the rank of the first correct answer into consideration (in our case rank of the correctly predicted PRIMARY category). The formula for MRR is as follows:

Figure 5: MRR formula

where Q here refers to all the classification tasks in our test set and rank_{i} is the position of the correctly predicted category. The higher the rank of the correctly predicted category, the higher the MRR.

Since we are using the top 3 predictions, MRR will give us a sense of where the PRIMARY category is at in the ranks. If the rank of the PRIMARY category is on average 2, then the MRR would be ~0.5 and at 3, it would be ~0.3. We want to get the PRIMARY category higher up in the ranks.

Building the classifier

Now it’s finally time to build the classifier! Note that we will be using the LogisticRegression module from sklearn.

Make Necessary Imports

Start with the imports.


Read dataset and create text field variations

Next, we will be creating different variations of the text we will use to train the classifier. This is to see how adding more content to each field, helps with the classification task. Notice that we create a field using only the description, description + headline, and description + headline + url (tokenized). 


Split dataset for training and testing

Next, we will create a train / test split of our dataset, where 25% of the dataset will be used for testing based on our evaluation strategy and remaining will be used for training the classifier. 

Prepare features

Earlier, we talked about feature representation and different feature weighting schemes. In extract_features(...) from above, is where we extract the different types of features based on the weighting schemes.

First, note that cv.fit_transform(...) from the above code snippet creates a vocabulary based on the training set.  Next, cv.transform(...) takes in any text (test or unseen texts) and transforms it according to the vocabulary of the training set, limiting the words by the specified count restrictions (min_df, max_df) and applying necessary stop words if specified. It returns a term-document matrix where each column in the matrix represents a word in the vocabulary while each row represents the documents in the dataset. The values could either be binary or counts.  The same concept also applies to tfidf_vectorizer.fit_transform(...) and tfidf_vectorizer.transform(...).

You can read my article on TfidfVectorizer and Tfidftransformer on how to use these libraries correctly. Read this article if you want more information on how to use CountVectorizer.

Train your Logistic Regression model

The code below shows how we start the training process. When you instantiate the LogisticRegression module, you can vary the solver, the penalty, the C value and also specify how it should handle the multi-class classification problem (one-vs-all or multinomial). By default, a one-vs-all approach is used and that’s what we’re using below: 

In a one-vs-all approach that we are using above, a binary classification problem is fit for each of our 31 labels. Since we are selecting the top 3 categories predicted by the classifier (see below), we will leverage the estimated probabilities instead of the binary predictions. Behind the scenes, we are actually collecting the probability of each news category being positive.


Evaluate performance

In this section, we will look at the results for different variations of our model. First, we train a model using only the description of articles with binary feature weighting. 

Figure 6: Accuracy and MRR using description of text and binary feature weighting

You can see that the accuracy is 0.59 and MRR is 0.48.  This means that only about 59% of the PRIMARY categories are appearing within the top 3 predicted labels. The MRR also tells us that the rank of the PRIMARY category is between position 2 and 3. Let’s see if we can do better. Let’s try a different feature weighting scheme. 

Figure 7: Accuracy and MRR using description of text and tf-idf feature weighting

This second model uses tf-idf weighting instead of binary weighting using the same description field. You can see that the accuracy is 0.63 and MRR is 0.51 a slight improvement. This is a good indicator that the tf-idf weighting works better than binary weighting for this particular task.  

How else can we improve our classifier? Remember, we are only using the description field and it is fairly sparse. What if we used the description, headline and tokenized URL, would this help? Let’s try it. 

Figure 8: Accuracy and MRR using all text fields and tf-idf feature weighting

Now look! As you can see in Figure 8, the accuracy is 0.87 and MRR is 0.75, a significant jump. Now we have about 87% of the primary categories appearing within the top 3 predicted categories. In addition, more of the PRIMARY categories are appearing at position 1. This is good news! 

In Figure 9, you will see how well the model performs on different feature weighting methods and use of text fields.

Figure 9: Experimentation with different combination of feature weighting and text fields

There are several observations that can be made from the results in Figure 9:

  1. tf-idf based weighting outperforms binary & count based schemes
  2. count based feature weighting is no better than binary weighting
  3. Sparsity has a lot to do with how poorly the model performs. The richer the text field, the better the overall performance of the classifier.  

Prediction on CNN articles

Now, the fun part! Let’s test it on articles from a different news source than HuffPost. Let’s see how the classifier visually does on articles from CNN. We will predict the top 2 categories.

A crime related story [ see article ]

Predicted: politics, crime

Entertainment related story [ see article ]

Predicted: entertainment, style

Another entertainment related story [ see article ]

Predicted: entertainment, style

Exercise in space [ see article ]

Predicted: science, healthy living

Overall, not bad. The predicted categories make a lot of sense. Note that in the above predictions, we used the headline text. To further improve the predictions, we can enrich the text with the url tokens and description. 

Saving Logistic Regression Model 

Once we have fully developed the model, we want to use it later on unseen documents. Doing this is actually straightforward with sklearn. First, we have to  save the transformer to later encode / vectorize any unseen document. Next,  we also need to save the trained model so that it can make predictions using the weight vectors. Here’s how you do it:

Saving SKLearn Model & Transformer

Loading Model & Transformer for Reuse

Over to you

Here’s the full source code with accompanying dataset for this tutorial. I hope this article has given you the confidence in implementing your very own high-accuracy text classifier.

Keep in mind that text classification is an art as much as it is a science. Your creativity when it comes to text preprocessing, evaluation and feature representation will determine the success of your classifier. A one-size-fits-all approach is rare. What works for this news categorization task, may very well be inadequate for something like bug detection in source code.

An exercise for you:

Right now, we are at 87% accuracy. How can we improve the accuracy further? What else would you try? Leave a comment below with what you tried, and how well it worked. Aim for a 90-95% accuracy and let us all know what worked! 


  • Curate additional features
  • Perform feature selection 
  • Tweak model parameters
  • Try balancing number of articles per category

See Also: How to extract keywords with TF-IDF?


Recommended Reading

Easily Access Pre-trained Word Embeddings with Gensim

What are pre-trained embeddings and why?

Pre-trained word embeddings are vector representation of words trained on a large dataset. With pre-trained embeddings, you will essentially be using the weights and vocabulary from the end result of the training process done by….someone else! (It could also be you)

One benefit of using pre-trained embeddings is that you can hit the ground running without the need for finding a large text corpora which you will have to preprocess and train with the appropriate settings. Another benefit is the savings in training time. Training on a large corpora could demand high computation power and long training times which may not be something that you want to afford for quick experimentation. If you want to avoid all of these logistics but still have access to good quality embeddings, you could use pre-trained word embeddings trained on a dataset that fits the domain you are working in. For example, if you are working with news articles, it may be perfectly fine to use embeddings trained on a Twitter dataset as there is ongoing discussion about current issues as well as a constant stream of news related Tweets. Accessing pre-trained embeddings is extremely easy with Gensim as it allows you to use pre-trained GloVe and Word2Vec embeddings with minimal effort. The code snippets below show you how. Here’s the working notebook for this tutorial.

Accessing pre-trained Twitter GloVe embeddings

Here, we are trying to access GloVe embeddings trained on a Twitter dataset. This first step downloads the pre-trained embeddings and loads it for re-use. These vectors are based on 2B tweets, 27B tokens, 1.2M vocab, uncased. The original source of the embeddings can be found here: https://nlp.stanford.edu/projects/glove/. The 25 in the model name below refers to the dimensionality of the vectors. Once you have loaded the pre-trained model, just use it as you would with any Gensim Word2Vec model. Here are a few examples: This next example prints the word vectors for trump and obama. Notice that it prints only 25 values for each word. This is because our vector dimensionality is 25.  For vectors of other dimensionality use the appropriate model names from here or reference the gensim-data GitHub repo:
  • glove-twitter-25 (104 MB)
  • glove-twitter-50 (199 MB)
  • glove-twitter-100 (387 MB)
  • glove-twitter-200 (758 MB)

Accessing pre-trained Wikipedia GloVe embeddings

The GloVe embeddings below was trained on an English Wikipedia dump and English Gigaword 5th Edition dataset. Its dimensionality is 100 and has 6B tokens (uncased). The original source of the embeddings can be found here: https://nlp.stanford.edu/projects/glove/ Once you have loaded the pre-trained model, just use it as usual. Here is a similarity example: For vectors of other dimensionality you can use the appropriate model names from below or reference the gensim-data repository :
  • glove-wiki-gigaword-50 (65 MB)
  • glove-wiki-gigaword-100 (128 MB)
  • gglove-wiki-gigaword-200 (252 MB)
  • glove-wiki-gigaword-300 (376 MB)

Accessing pre-trained Word2Vec embeddings

So far, you have looked at a few examples using GloVe embeddings. In the same way, you can also load pre-trained Word2Vec embeddings. Here are some of your options for Word2Vec:
  • word2vec-google-news-300 (1662 MB) (dimensionality: 300)
  • word2vec-ruscorpora-300 (198 MB) (dimensionality: 300)
Be warned that the google news embeddings is sizable, so ensure that you have sufficient disk  space before using it.

What can you use pre-trained word embeddings for?

You can use pre-trained word embeddings for a variety of tasks including:
  • Finding word or phrase similarities
  • As feature weights for text classification
  • For creating an embedding layer for neural network based text classification
  • For machine translation
  • Query expansion for search enhancements
  • To create sentence embeddings through vector averaging
The possibilities are actually endless, but you may not always get better results than just a bag-of-words approach. For example, I’ve tried sentence embeddings for a search reranking task and the rankings actually deteriorated. The only way to know if it helps, is to try it and see if it improves your evaluation metrics!

Example of using GloVe embeddings to rank phrases by similarity

Here is an example of using the glove-twitter-25 GloVe embeddings to find phrases that are most similar to the query phrase. Let’s say we have the following phrases and a corresponding query phrase with several misspellings (missing ‘r’ in barack and ‘a’ instead of ‘e’ in hussein). The goal here is given the query phrase, rank all other phrases by semantic similarity (using the glove twitter embeddings) and compare that with surface level similarity using the jaccard similarity index. Jaccard has no notion of semantics so it sees a token as is. The code above splits each candidate phrase as well as the query into a set of tokens (words).  The n_similarity(tokens_1,tokens_2)  takes the average of the word vectors for the query (tokens_2) and the phrase (tokens_1) and computes the cosine similarity using the resulting averaged vectors. The results are later sorted by descending order of cosine similarity scores. This method of vector averaging assumes that the words within tokens_1 share a common concept which is amplified through word averaging. The same is the case for, tokens_2. As pointed out by Radim (creator of Gensim),  this crude method works surprisingly well. Below, you will see the ranking of phrases using the word embeddings method vs. the surface similarity method with Jaccard. Notice that even with misspellings, we are able to produce a decent ranking of most similar phrases using the GloVe vectors. This is because misspellings are common in tweets.  If a misspelled word is present in the vocabulary, then it will have a corresponding weight vector. In comparison, Jaccard similarity does slightly worse (visually speaking) as all it knows are the tokens given to it and is ignorant to misspellings, nor does it have any notion of semantics. I hope this article and accompanying notebook will give you a quick start in using pre-trained word embeddings. Can you think of any other use cases for how you would use these embeddings? Leave a comment below with your ideas!

See Also: A step-by-step Word2Vec tutorial with Gensim


How to Use Tfidftransformer & Tfidfvectorizer?

Scikit-learn’s Tfidftransformer and Tfidfvectorizer aim to do the same thing, which is to convert a collection of raw documents to a matrix of TF-IDF features. The differences between the two modules can be quite confusing and it’s hard to know when to use which. This article shows you how to correctly use each module, the differences between the two and some guidelines on what to use when.

Tfidftransformer Usage

1. Dataset and Imports

Below we have 5 toy documents, all about my cat and my mouse who live happily together in my house. We are going to use this toy dataset to compute the tf-idf scores of words in these documents.

We also import the necessary modules here which include TfidfTransformer and CountVectorizer.

2. Initialize CountVectorizer

In order to start using TfidfTransformer you will first have to create a CountVectorizer to count the number of words (term frequency), limit your vocabulary size, apply stop words and etc. The code below does just that.

Now, let’s check the shape. We should have 5 rows (5 docs) and 16 columns (16 unique words, minus single character words):

Sweet, this is what we want! Now it’s time to compute the IDFs. Note that in this example, we are using all the defaults with CountVectorizer. You can actually specify a custom stop word list, enforce minimum word count, etc. See this article on how to use CountVectorizer.

3. Compute the IDF values

Now we are going to compute the IDF values by calling tfidf_transformer.fit(word_count_vector) on the word counts we computed earlier.

To get a glimpse of how the IDF values look, we are going to print it by placing the IDF values in a python DataFrame. The values will be sorted in ascending order.

Resulting IDF values

Notice that the words ‘mouse’ and ‘the’ have the lowest IDF values. This is expected as these words appear in each and every document in our collection. The lower the IDF value of a word, the less unique it is to any particular document.

Import Note: In practice, your IDF should be based on a large corpora of text.

4. Compute the TFIDF score for your documents

Once you have the IDF values, you can now compute the tf-idf scores for any document or set of documents. Let’s compute tf-idf scores for the 5 documents in our collection.

The first line above, gets the word counts for the documents in a sparse matrix form. We could have actually used word_count_vector from above. However, in practice, you may be computing tf-idf scores on a set of new unseen documents. When you do that, you will first have to do cv.transform(your_new_docs) to generate the matrix of word counts.

Then, by invoking tfidf_transformer.transform(count_vector) you will finally be computing the tf-idf scores for your docs. Internally this is computing the tf * idf  multiplication where your term frequency is weighted by its IDF values.

Now, let’s print the tf-idf values of the first document to see if it makes sense. What we are doing below is, placing the tf-idf scores from the first document into a pandas data frame and sorting it in descending order of scores.

Tf-idf scores of first document:

tf-idf values using Tfidftransformer

Notice that only certain words have scores. This is because our first document is “the house had a tiny little mouse”  all the words in this document have a tf-idf score and everything else show up as zeroes. Notice that the word “a” is missing from this list. This is possibly due to internal pre-processing of CountVectorizer where it removes single characters.

The scores above make sense. The more common the word across documents, the lower its score and the more unique a word is to our first document (e.g. ‘had’ and ‘tiny’) the higher the score. So it’s working as expected except for the mysterious a that was chopped off.

Tfidfvectorizer Usage

Now, we are going to use the same 5 documents from above to do the same thing as we did for Tfidftransformer – which is to get the tf-idf scores of a set of documents. But, notice how this is much shorter.

With Tfidfvectorizer you compute the word counts, idf and tf-idf values all at once. It’s really simple.

Now let’s print the tfidf values for the first document from our collection. Notice that these values are identical to the ones from Tfidftransformer, only thing is that it’s done in just two steps.

tf-idf values using Tfidfvectorizer

Here’s another way to do it by calling fit and transform separately and you’ll end up with the same results.

Tfidftransformer vs. Tfidfvectorizer

In summary, the main difference between the two modules are as follows:

With Tfidftransformer you will systematically compute word counts using CountVectorizer and then compute the Inverse Document Frequency (IDF) values and only then compute the Tf-idf scores.

With Tfidfvectorizer on the contrary, you will do all three steps at once. Under the hood, it computes the word counts, IDF values, and Tf-idf scores all using the same dataset.

When to use what?

So now you may be wondering, why you should use more steps than necessary if you can get everything done in two steps. Well, there are cases where you want to use Tfidftransformer over Tfidfvectorizer and it is sometimes not that obvious. Here is a general guideline:

  • If you need the term frequency (term count) vectors for different tasks, use Tfidftransformer.
  • If you need to compute tf-idf scores on documents within your “training” dataset, use Tfidfvectorizer
  • If you need to compute tf-idf scores on documents outside your “training” dataset, use either one, both will work.

See Also: How to extract keywords with TF-IDF?


Recommended Reading

Word Cloud in Python for Jupyter Notebooks and Web Apps

About a year ago, I looked high and low for a python word cloud library that I could use from within my Jupyter notebook that was flexible enough to use counts or tfidf when needed or just accept a set of words and corresponding weights. I was a bit surprised that something like that did not already exist within libraries like plotly. All I wanted to do, was to get a quick understanding of my text data and word vectors. That’s probably not too much to ask? Here I am a year later, using my own word_cloud visualization library. Its not the prettiest or the most sophisticated, but it works for my use cases. I decided to share it, so that others could use it as well. After installation, here are a few ways you could use it:

Generate word clouds with a single text document

This example show cases how you can generate word clouds with just one document. While the colors can be randomized, in this example, the colors are based on the default color settings. By default, the words are weighted by word counts unless you explicitly ask for tfidf weighting. Tfidf weighting makes sense only if you have a lot of documents to start with.

Generate word clouds from multiple documents

Let’s say you have a 100 documents from one news category, and you just want to see what the common mentions are.

Generate word clouds from existing weights

Let’s say you have a set of words with corresponding weights, and you just want to visualize it. All you need to do is make sure that the weights are normalized between [0-1] Please feel free to propose changes to prettify the output — just open a pull request with your changes.


Tutorial: Extracting Keywords with TF-IDF and Python’s Scikit-Learn

In this era of use Deep Learning for everything, one may be wondering why you would even use TF-IDF for any task at all ?!! The truth is TF-IDF is easy to understand, easy to compute and is one of the most versatile statistic that shows the relative importance of a word or phrase in a document or a set of documents in comparison to the rest of your corpus.

Keywords are descriptive words or phrases that characterize your documents. For example, keywords from this article would be tf-idf,   scikit-learn, keyword extraction, extract and so on. These keywords are also referred to as topics in some applications.

TF-IDF can be used for a wide range of tasks including text classification, clustering / topic-modeling, search, keyword extraction and a whole lot more.

In this article, you will learn how to use TF-IDF from the scikit-learn package to extract keywords from documents.

Let’s Get Started…

I’m assuming that folks following this tutorial are already familiar with the concept of TF-IDF. If you are not, please familiarize yourself with the concept before reading on. There are a couple of videos online that give an intuitive explanation of what it is. For a more academic explanation I would recommend my Ph.D advisor’s explanation. If you just need access to my Jupyter Notebook with full code samples, please head over to my repo, otherwise please read on.


In this keyword extraction tutorial, we’ll be using a stack overflow dataset which is a bit noisy and simulates what you could be dealing with in real life. You will find this dataset in my tutorial repo. Notice that there are two files in this repo, the larger file, stackoverflow-data-idf.json has 20,000 posts and is used to compute the Inverse Document Frequency (IDF) and the smaller file, stackoverflow-test.json has 500 posts and we would use that as a test set for us to extract keywords from. This dataset is based on the publicly available stack overflow dump from Google’s Big Query.

The first thing we’ll do is to take a peek at our dataset. The code below reads a one per line json string from data/stackoverflow-data-idf.json into a pandas data frame and prints out its schema and total number of posts. Here, lines=True simply means we are treating each line in the text file as a separate json string.


accepted_answer_id          float64
answer_count                  int64
body                         object
comment_count                 int64
community_owned_date         object
creation_date                object
favorite_count              float64
id                            int64
last_activity_date           object
last_edit_date               object
last_editor_display_name     object
last_editor_user_id         float64
owner_display_name           object
owner_user_id               float64
post_type_id                  int64
score                         int64
tags                         object
title                        object
view_count                    int64
dtype: object
Number of questions,columns= (20000, 19)

Notice that this stack overflow dataset contains 19 fields including post title, body, tags, dates and other metadata which we don’t quite need for this tutorial. What we are mostly interested in for this tutorial, is the body and title which will become our source of text for keyword extraction. We will now create a field that combines both body and title so we have it in one field. We will also print the second text entry in our new field just to see what the text looks like.

The text above is essentially a combination of the title and body of a stack overflow post. Hmmm, this doesn’t look very readable, does it? Well, that’s because we are cleaning the text after we concatenated the two fields (line 18). All of the cleaning happens in pre_process(..). You can do a lot more stuff in pre_process(..), such as eliminate all code sections, normalize the words to its root, etc, but for simplicity we perform only some mild pre-processing.

Creating Vocabulary and Word Counts for IDF

We now need to create the vocabulary and start the counting process. We can use the CountVectorizer to create a vocabulary from all the text in our df_idf['text'] followed by the counts of words in the vocabulary (see: usage examples for CountVectorizer).

While cv.fit(...) would only create the vocabulary, cv.fit_transform(...) creates the vocabulary and returns a term-document matrix which is what we want. With this, each column in the matrix represents a word in the vocabulary while each row represents the document in our dataset where the values in this case are the word counts. Note that with this representation, counts of some words could be 0 if the word did not appear in the corresponding document.

Notice that in the code above, we are passing two parameters to CountVectorizer, max_df and stop_words. The first is just to say ignore all words that have appeared in 85% of the documents, since those may be unimportant. The later, is a custom stop words list. You can also use stop words that are native to sklearn by setting stop_words='english', but I personally find this to be quite limited. The stop word list used for this tutorial can be found here.

The resulting shape of word_count_vector is (20000,124901) since we have 20,000 documents in our dataset (the rows) and the vocabulary size is 124,901. In some text mining applications such as clustering and text classification we typically limit the size of the vocabulary. It’s really easy to do this by setting max_features=vocab_size when instantiating CountVectorizer. For this tutorial let’s limit our vocabulary size to 10,000.

Now, let’s look at 10 words from our vocabulary.


Sweet, these are mostly programming related.

TfidfTransformer to Compute Inverse Document Frequency (IDF)

It’s now time to compute the IDF values. In the code below, we are essentially taking the sparse matrix from CountVectorizer (word_count_vector) to generate the IDF when you invoke tfidf_transformer.fit(...)(see: basic usage example of tfidftransformer and tfidfvectorizer)

An extremely important point to note here is that the IDF should always be based on a large corpora and should be representative of texts you would be using to extract keywords. This is why we are using texts from 20,000 stack overflow posts to compute the IDF instead of just a handful. I’ve seen several articles on the Web that compute the IDF using a handful of documents. You will defeat the whole purpose of IDF weighting if its not based on a large corpora as (a) your vocabulary becomes too small and (b) you have limited ability to observe the behavior of words that you do know about.

Computing TF-IDF and Extracting Keywords

Once we have our IDF computed, we are now ready to compute TF-IDF and then extract top keywords from the TF-IDF vectors. In this example, we will extract top keywords for the questions in data/stackoverflow-test.json. This data file has 500 questions with fields identical to that of data/stackoverflow-data-idf.json as we saw above. We will start by reading our test file, extracting the necessary fields (title and body) and getting the texts into a list.

The next step is to compute the tf-idf value for a given document in our test set by invoking tfidf_transformer.transform(...). This generates a vector of tf-idf scores. Next, we sort the words in the vector in descending order of tf-idf values and then iterate over to extract the top-n keywords. In the example below, we are extracting keywords for the first document in our test set.

The sort_coo(...) method essentially sorts the values in the vector while preserving the column index. Once you have the column index then its really easy to look-up the corresponding word value as you would see in extract_topn_from_vector(...) where we do feature_vals.append(feature_names[idx]).

Example Results

In this section, you will see some of the stack overflow questions followed by the top-10 keywords generated using the code above. Note that these questions are from the stackoverflow-test.json data file.

Question about Eclipse Plugin integration

From the keywords above, the top keywords actually make sense, it talks about eclipse, maven, integrate, war and tomcat which are all unique to this specific question. There are a couple of keywords that could have been eliminated such as possibility and perhaps even project and you can further fine-tune what shows up on top by adding more common words to your stop list and you can even create your own set of stop list, very specific to your domain.

Now let’s look at another example.

Question about SQL Import

Even with all the html tags, because of the pre-processing, we are able to extract some pretty nice keywords here. The last word appropriately would qualify as a stop word. You can keep running different examples to get ideas of how to fine-tune the results.

Whoala! Now you can extract important keywords from any type of text!  To play around with this entire code, please head over to my repo to re-run the full example using my TF-IDF Jupyter Notebook.

Some tips and tricks

  1. You can easily save the resulting CountVectorizer and TfidfTransformer and load them back for use at a later time.
  2. Instead of using CountVectorizer followed by TfidfTransformer, you can directly use TfidfVectorizer by itself. This is equivalent to CountVectorizer followed by TfidfTransformer.
  3. In this example, we computed the tf-idf matrix for each document of interest and then extracted top terms from it. What you could also do is first applytfidf_transformer.transform(docs_test) which will generate a tf-idf matrix for all documents in docs_test at one go and then iterate over the resulting vectors to extract top keywords. The first approach is useful if you have one document coming in at a time. The second approach is more suitable when you want keywords from a fairly large set of documents.

See Also: How to correctly use TFIDFTransformer and TFIDFVectorizer?


How to incorporate phrases into Word2Vec – a text mining approach

Training a Word2Vec model with phrases is very similar to training a Word2Vec model with single words. The difference: you would need to add a layer of intelligence in processing your text data to pre-discover phrases. In this tutorial, you will learn how to create embeddings with phrases without explicitly specifying the number of words that should make-up a phrase (i.e. the n-gram size). This means that you could have phrases with 2 words, 3 words and in some rare cases even 4 or 5.

At a high level, the steps would include:

  • Step 1:  Discovering common phrases in your corpora
  • Step 2: Tagging your corpora with phrases
  • Step 3: Training a Word2Vec model with the newly found phrases

Step 1: Discovering common phrases in your corpora

The first step towards generating embeddings for phrases is recognizing groups of words that make up a phrase. There are many ways to recognize phrases. One way is to use a linguistic heavy approach called “chunking” to detect phrases. NLTK for example, has a chunk capability that you could use.

For this task, I will show you how you can use a text data mining approach with Spark, where you leverage the volume and evidence from your corpora for phrase detection. I like this approach because it’s lightweight, speedy and scales to the amount of data that you need to process.

So here’s how it works. At a high level, the entire corpora of text is segmented using a set of delimiter tokens. This can be special characters, stop words and other terms that can indicate phrase boundary. I specifically used some special characters and a very basic set of English stop words.

Stop words are excellent for splitting text into a set of phrases as they usually consist of connector and filler words used to connect ideas, details, or clauses together in order to make one clear, detailed sentence. You can get creative and use a more complete stop word list or you can even over-simplify this list to make it a minimal stop word list.

The code below shows you how you can use both special characters and stop words to break text into a set of candidate phrases. Check the phrase-at-scale repo for the full source code.

In the code above, we are first splitting text into coarse-grained units using some special characters like comma, period and semi-colon. This is then followed by more fine-grained boundary detection using stop words. When you repeat this process for all documents or sentences in your corpora, you will end up with a huge set of phrases. You can then surface the top phrases using frequency counts and other measures such as Pointwise Mutual Information which can measure strength of association between words in your phrase. For the phrase embedding task, we naturally have to use lots and lots of data, so frequency counts alone would suffice for this task. In some other tasks, I have combined frequency counts with Pointwise Mutual Information to get a better measure phrase quality.

To ensure scalability, I really like using Spark since you can leverage its built-in multi-threading capability on a single machine or use multiple machines to get more CPU power if you really have massive amounts of data to process. The code below shows you the PySpark method that reads your text files, cleans it up, generates candidate phrases, counts frequency of the phrases and filters it down to a set of phrases that satisfy a minimum frequency count. On a 450 MB dataset, run locally, this takes about a minute to discover top phrases and 7 minutes to annotate the entire text corpora with phrases. You can follow instructions in the phrase-at-scale repo to use this PySpark code to discover phrases for your data.

Here is a tiny snapshot of phrases found using the code above on a restaurant review dataset.

Step 2: Tagging your corpora with phrases

There are two ways you can mark certain words as phrases in your corpora. One approach is to pre-annotate your entire corpora and generate a new “annotated corpora”. The other way is to annotate your sentences or documents during the pre-processing phase prior to learning the embeddings. It’s much cleaner to have a separate layer for annotation which does not interfere with the training phase. Otherwise, it will be harder to gauge if your model is slow due to training or annotation.

In annotating your corpora, all you need to do is to somehow join the words that make-up a phrase. For this task, I just use an underscore to join the individual words. So, “…ate fried chicken and onion rings…” would become “…ate fried_chicken and onion_rings…”

Step 3: Training a Phrase2Vec model using Word2Vec

Once you have phrases explicitly tagged in your corpora the training phase is quite similar to any Word2Vec model with Gensim or any other library. You can follow my Word2Vec Gensim Tutorial for a full example on how to train and use Word2Vec.

Example Usage of Phrase Embeddings

The examples below show you the power of phrase embeddings when used to find similar concepts.  These are concepts from the restaurant domain, trained on 450 MB worth of restaurant reviews using Gensim.

Similar and related unigrams, bigrams and trigrams

Notice below that we are able to capture highly related concepts that are unigrams, bigrams and higher order n-grams.

Most similar to 'green_curry':
('panang_curry', 0.8900948762893677)
('yellow_curry', 0.884008526802063)
('panang', 0.8525004386901855)
('drunken_noodles', 0.850254237651825)
('basil_chicken', 0.8400430679321289)
('coconut_soup', 0.8296557664871216)
('massaman_curry', 0.827597975730896)
('pineapple_fried_rice', 0.8266736268997192)

Most similar to 'singapore_noodles':
('shrimp_fried_rice', 0.7932361960411072)
('drunken_noodles', 0.7914629578590393)
('house_fried_rice', 0.7901676297187805)
('mongolian_beef', 0.7796567678451538)
('crab_rangoons', 0.773795485496521)
('basil_chicken', 0.7726351022720337)
('crispy_beef', 0.7671589255332947)
('steamed_dumplings', 0.7614079117774963)

Most similar to 'chicken_tikka_masala':
('korma', 0.8702514171600342)
('butter_chicken', 0.8668922781944275)
('tikka_masala', 0.8444720506668091)
('garlic_naan', 0.8395442962646484)
('lamb_vindaloo', 0.8390569686889648)
('palak_paneer', 0.826908528804779)
('chicken_biryani', 0.8210495114326477)
('saag_paneer', 0.8197864294052124)

Most similar to 'breakfast_burrito':
('huevos_rancheros', 0.8463341593742371)
('huevos', 0.789624035358429)
('chilaquiles', 0.7711247801780701)
('breakfast_sandwich', 0.7659544944763184)
('rancheros', 0.7541004419326782)
('omelet', 0.7512155175209045)
('scramble', 0.7490915060043335)
('omlet', 0.747859001159668)

Most similar to 'little_salty':
('little_bland', 0.745500385761261)
('little_spicy', 0.7443351149559021)
('little_oily', 0.7373550534248352)
('little_overcooked', 0.7355216145515442)
('kinda_bland', 0.7207454442977905)
('slightly_overcooked', 0.712611973285675)
('little_greasy', 0.6943882703781128)
('cooked_nicely', 0.6860566139221191)

Most similar to 'celiac_disease':
('celiac', 0.8376057744026184)
('intolerance', 0.7442486882209778)
('gluten_allergy', 0.7399739027023315)
('celiacs', 0.7183824181556702)
('intolerant', 0.6730632781982422)
('gluten_free', 0.6726624965667725)
('food_allergies', 0.6587174534797668)
('gluten', 0.6406026482582092)

Similar concepts expressed differently

Here you will see that similar concepts that are expressed differently can also be captured.

Most similar to 'reasonably_priced':
('fairly_priced', 0.8588327169418335)
('affordable', 0.7922118306159973)
('inexpensive', 0.7702735066413879)
('decently_priced', 0.7376087307929993)
('reasonable_priced', 0.7328246831893921)
('priced_reasonably', 0.6946456432342529)
('priced_right', 0.6871092915534973)
('moderately_priced', 0.6844340562820435)

Most similar to 'highly_recommend':
('definitely_recommend', 0.9155156016349792)
('strongly_recommend', 0.86533123254776)
('absolutely_recommend', 0.8545517325401306)
('totally_recommend', 0.8534528017044067)
('recommend', 0.8257364630699158)
('certainly_recommend', 0.785507082939148)
('highly_reccomend', 0.7751532196998596)
('highly_recommended', 0.7553941607475281)


In summary, to generate embeddings of phrases, you would need to add a layer for phrase discovery before training a Word2Vec model. If you have lots of data, a text data mining approach has the benefit of being lightweight and scalable, without compromising on quality. In addition, you wouldn’t have to specify a phrase size in advance or be limited by a specific vocabulary. A linguistic heavy approach gives you a lot more specificity in terms of parts of speech and the types of phrases (e.g. noun phrase vs. verb phrase) that you are dealing with. If you really need that information, then you can consider a chunking approach over a text mining approach.


Here are some resources that might come handy to you:


Gensim Word2Vec Tutorial – Full Working Example

The idea behind Word2Vec is pretty simple. We’re making an assumption that the meaning of a word can be inferred by the company it keeps. This is analogous to the saying, “show me your friends, and I’ll tell who you are”.

If you have two words that have very similar neighbors (meaning: the context in which it’s used is about the same), then these words are probably quite similar in meaning or are at least related. For example, the words shocked, appalled and astonished are usually used in a similar context.

“The meaning of a word can be inferred by the company it keeps”

Using this underlying assumption, you can use Word2Vec to:

  • Surface similar concepts
  • Find unrelated concepts
  • Compute similarity between two words and more!

Down to business

In this tutorial, you will learn how to use the Gensim implementation of Word2Vec (in python) and actually get it to work! I‘ve long heard complaints about poor performance, but it really is a combination of two things: (1) your input data and (2) your parameter settings. Check out the Jupyter Notebook if you want direct access to the working example, or read on to get more context.

Side note: The training algorithms in the Gensim package were actually ported from the original Word2Vec implementation by Google and extended with additional functionality.

Imports and logging

First, we start with our imports and get logging established:


Next, is finding a really good dataset. The secret to getting Word2Vec really working for you is to have lots and lots of text data in the relevant domain. For example, if your goal is to build a sentiment lexicon, then using a dataset from the medical domain or even wikipedia may not be effective. So, choose your dataset wisely. As Matei Zaharia says,

It’s your data, stupid

That was said in the context of data quality, but it’s not just quality it’s also using the right data for the task.

For this tutorial, I am going to use data from the OpinRank dataset from some of my Ph.D work. This dataset has full user reviews of cars and hotels. I have specifically concatenated all of the hotel reviews into one big file which is about 97 MB compressed and 229 MB uncompressed. We will use the compressed file for this tutorial. Each line in this file represents a hotel review.

Now, let’s take a closer look at this data below by printing the first line.

You should see the following:

You can see that this is a pretty good full review with many words and that’s what we want. We have approximately 255,000 such reviews in this dataset.

To avoid confusion, the Gensim’s Word2Vec tutorial says that you need to pass a list of tokenized sentences as the input to Word2Vec. However, you can actually pass in a whole review as a sentence (i.e. a much larger size of text), if you have a lot of data and it should not make much of a difference. In the end, all we are using the dataset for is to get all neighboring words (the context) for a given target word.

Read files into a list

Now that we’ve had a sneak peak of our dataset, we can read it into a list so that we can pass this on to the Word2Vec model. Notice in the code below, that I am directly reading the compressed file. I’m also doing a mild pre-processing of the reviews using gensim.utils.simple_preprocess (line). This does some basic pre-processing such as tokenization, lowercasing, etc. and returns back a list of tokens (words). Documentation of this pre-processing method can be found on the official Gensim documentation site.

Training the Word2Vec model

Training the model is fairly straightforward. You just instantiate Word2Vec and pass the reviews that we read in the previous step. So, we are essentially passing on a list of lists. Where each list within the main list contains a set of tokens from a user review. Word2Vec uses all these tokens to internally create a vocabulary. And by vocabulary, I mean a set of unique words.

The step above, builds the vocabulary, and starts training the Word2Vec model. We will get to what these parameters actually mean later in this article. Behind the scenes, what’s happening here is that we are training a neural network with a single hidden layer where we train the model to predict the current word based on the context (using the default neural architecture). However, we are not going to use the neural network after training! Instead, the goal is to learn the weights of the hidden layer. These weights are essentially the word vectors that we’re trying to learn. The resulting learned vector is also known as the embeddings. You can think of these embeddings as some features that describe the target word. For example, the word king may be described by the gender, age, the type of people the king associates with, etc.

Training on the Word2Vec OpinRank dataset takes several minutes so sip a cup of tea, and wait patiently.

Some results!

Let’s get to the fun stuff already! Since we trained on user reviews, it would be nice to see similarity on some adjectives. This first example shows a simple look up of words similar to the word ‘dirty’. All we need to do here is to call the most_similar function and provide the word ‘dirty’ as the positive example. This returns the top 10 similar words.

Gensim most similar to the word "dirty"

Ooh, that looks pretty good. Let’s look at more.

Similar to polite:

Similar to france:

Similar to shocked:

Overall, the results actually make sense. All of the related words tend to be used in similar contexts.

Now you could even use Word2Vec to compute similarity between two words in the vocabulary by invoking the similarity(...) function and passing in the relevant words.

Under the hood, the above three snippets compute the cosine similarity between the two specified words using word vectors (embeddings) of each. From the scores above, it makes sense that dirty is highly similar to smelly but dirty is dissimilar to clean. If you do a similarity between two identical words, the score will be 1.0 as the range of the cosine similarity can go from [-1 to 1] and sometimes bounded between [0,1] depending on how it’s being computed. You can read more about cosine similarity scoring here.

You will find more examples of how you could use Word2Vec in my Jupyter Notebook.

A closer look at the parameter settings

To train the model earlier, we had to set some parameters. Now, let’s try to understand what some of them mean. For reference, this is the command that we used to train the model.


The size of the dense vector to represent each token or word (i.e. the context or neighboring words). If you have limited data, then size should be a much smaller value since you would only have so many unique neighbors for a given word. If you have lots of data, it’s good to experiment with various sizes. A value of 100–150 has worked well for me for similarity lookups.


The maximum distance between the target word and its neighboring word. If your neighbor’s position is greater than the maximum window width to the left or the right, then, some neighbors would not be considered as being related to the target word. In theory, a smaller window should give you terms that are more related. Again, if your data is not sparse, then the window size should not matter too much, as long as it’s not overly narrow or overly broad. If you are not too sure about this, just use the default value.


Minimium frequency count of words. The model would ignore words that do not satisfy the min_count. Extremely infrequent words are usually unimportant, so its best to get rid of those. Unless your dataset is really tiny, this does not really affect the model in terms of your final results. The settings here probably has more of an effect on memory usage and storage requirements of the model files.


How many threads to use behind the scenes?


Number of iterations (epochs) over the corpus. 5 is a good starting point. I always use a minimum of 10 iterations.

When should you use Word2Vec?

There are many application scenarios for Word2Vec. Imagine if you need to build a sentiment lexicon. Training a Word2Vec model on large amounts of user reviews helps you achieve that. You have a lexicon for not just sentiment, but for most words in the vocabulary.

Beyond raw unstructured text data, you could also use Word2Vec for more structured data. For example, if you had tags for a million stackoverflow questions and answers, you could find related tags and recommend those for exploration. You can do this by treating each set of co-occuring tags as a “sentence” and train a Word2Vec model on this data. Granted, you still need a large number of examples to make it work.

See Also: How to use pre-trained embeddings with Gensim?


Papers to read

These are some of the recommended readings:

How to read CSV & JSON files in Spark – word count example

One of the really nice things about spark is the ability to read input files of different formats right out of the box. Though this is a nice to have feature, reading files in spark is not always consistent and seems to keep changing with different spark releases. This article will show you how to read files in csv and json to compute word counts on selected fields. This example assumes that you would be using spark 2.0+ with python 3.0 and above. Full working code can be found in this repository.

Data files

To illustrate by example let’s make some assumptions about data files. Let’s assume that we have data files containing a title field and a corresponding text field. The toy example format in json is as follows:

And, the format in csv is as follows:

Assume that we want to compute word counts based on the textfield.

Reading JSON File

Reading the json file is actually pretty straightforward, first you create an SQLContext from the spark context. This gives you the capability of querying the json file in regular SQL type syntax.

In this next step, you use the sqlContext to read the json file and select only the text field. Remember that we have two fields, title and text and in this case we are only going to process the text field. This step returns a spark data frame where each entry is a Row object. In order to access the text field in each row, you would have to use row.text. Note that the select here is conceptually the same as traditional SQL where you would do: select text from .....

To view what you have just read, you can use df.show()

You should see something like this:

SQL Query to Read JSON file

Note that you can achieve the same results, by issuing an actual SQL query on the dataset. For this, you first register the dataset as a view, then you issue the query. This also returns the same DataFrame as above.

Reading CSV File

Reading the csv file is similar to json, with a small twist to it, you would use sqlContext.read.load(...) and provide a format to it as below. Note that this method of reading is also applicable to different file types including json, parquet and csv and probably others as well.

Since the csv data file in this example has a header row, this can be used to infer schema and thus header='true' as seen above. In this example, we are again selecting only the text field. This method of reading a file also returns a data frame identical to the previous example on reading a json file.

Generating Word Counts

Now that we know that reading the csv file or the json file returns identical data frames, we can use a single method to compute the word counts on the text field. The idea here is to break words into tokens for each row entry in the data frame, and return a count of 1 for each token (line 4). This function returns a list of lists where each internal list contains just the word and a count of 1 ([w, 1]). The tokenized words would serve as the key and the corresponding count would be the value. Then when you reduce by key, you can add up all counts on a per word (key) basis to get total counts for each word (see line 8). Note that add here is a python function from the operator module.

As you can see below, accessing the text field is pretty simple if you are dealing with data frames.

And whoala, now you know how to read files with pyspark and use it for some basic processing! For the full source code please see links below.

Source Code