Tutorial: Extracting Keywords with TF-IDF and Python’s Scikit-Learn

In this era of use Deep Learning for everything, one may be wondering why you would even use TF-IDF for any task at all ?!! The truth is TF-IDF is easy to understand, easy to compute and is one of the most versatile statistic that shows the relative importance of a word or phrase in a document or a set of documents in comparison to the rest of your corpus.

Keywords are descriptive words or phrases that characterize your documents. For example, keywords from this article would be tf-idf,   scikit-learn, keyword extraction, extract and so on. These keywords are also referred to as topics in some applications.

TF-IDF can be used for a wide range of tasks including text classification, clustering / topic-modeling, search, keyword extraction and a whole lot more.

In this article, you will learn how to use TF-IDF from the scikit-learn package to extract keywords from documents.

Let’s Get Started…

I’m assuming that folks following this tutorial are already familiar with the concept of TF-IDF. If you are not, please familiarize yourself with the concept before reading on. There are a couple of videos online that give an intuitive explanation of what it is. For a more academic explanation I would recommend my Ph.D advisor’s explanation. If you just need access to my Jupyter Notebook with full code samples, please head over to my repo, otherwise please read on.


In this keyword extraction tutorial, we’ll be using a stack overflow dataset which is a bit noisy and simulates what you could be dealing with in real life. You will find this dataset in my tutorial repo. Notice that there are two files in this repo, the larger file, stackoverflow-data-idf.json has 20,000 posts and is used to compute the Inverse Document Frequency (IDF) and the smaller file, stackoverflow-test.json has 500 posts and we would use that as a test set for us to extract keywords from. This dataset is based on the publicly available stack overflow dump from Google’s Big Query.

The first thing we’ll do is to take a peek at our dataset. The code below reads a one per line json string from data/stackoverflow-data-idf.json into a pandas data frame and prints out its schema and total number of posts. Here, lines=True simply means we are treating each line in the text file as a separate json string.


accepted_answer_id          float64
answer_count                  int64
body                         object
comment_count                 int64
community_owned_date         object
creation_date                object
favorite_count              float64
id                            int64
last_activity_date           object
last_edit_date               object
last_editor_display_name     object
last_editor_user_id         float64
owner_display_name           object
owner_user_id               float64
post_type_id                  int64
score                         int64
tags                         object
title                        object
view_count                    int64
dtype: object
Number of questions,columns= (20000, 19)

Notice that this stack overflow dataset contains 19 fields including post title, body, tags, dates and other metadata which we don’t quite need for this tutorial. What we are mostly interested in for this tutorial, is the body and title which will become our source of text for keyword extraction. We will now create a field that combines both body and title so we have it in one field. We will also print the second text entry in our new field just to see what the text looks like.

The text above is essentially a combination of the title and body of a stack overflow post. Hmmm, this doesn’t look very readable, does it? Well, that’s because we are cleaning the text after we concatenated the two fields (line 18). All of the cleaning happens in pre_process(..). You can do a lot more stuff in pre_process(..), such as eliminate all code sections, normalize the words to its root, etc, but for simplicity we perform only some mild pre-processing.

Creating Vocabulary and Word Counts for IDF

We now need to create the vocabulary and start the counting process. We can use the CountVectorizer to create a vocabulary from all the text in our df_idf['text'] followed by the counts of words in the vocabulary (see: usage examples for CountVectorizer).

While would only create the vocabulary, cv.fit_transform(...) creates the vocabulary and returns a term-document matrix which is what we want. With this, each column in the matrix represents a word in the vocabulary while each row represents the document in our dataset where the values in this case are the word counts. Note that with this representation, counts of some words could be 0 if the word did not appear in the corresponding document.

Notice that in the code above, we are passing two parameters to CountVectorizer, max_df and stop_words. The first is just to say ignore all words that have appeared in 85% of the documents, since those may be unimportant. The later, is a custom stop words list. You can also use stop words that are native to sklearn by setting stop_words='english', but I personally find this to be quite limited. The stop word list used for this tutorial can be found here.

The resulting shape of word_count_vector is (20000,124901) since we have 20,000 documents in our dataset (the rows) and the vocabulary size is 124,901. In some text mining applications such as clustering and text classification we typically limit the size of the vocabulary. It’s really easy to do this by setting max_features=vocab_size when instantiating CountVectorizer. For this tutorial let’s limit our vocabulary size to 10,000.

Now, let’s look at 10 words from our vocabulary.


Sweet, these are mostly programming related.

TfidfTransformer to Compute Inverse Document Frequency (IDF)

It’s now time to compute the IDF values. In the code below, we are essentially taking the sparse matrix from CountVectorizer (word_count_vector) to generate the IDF when you invoke basic usage example of tfidftransformer and tfidfvectorizer)

An extremely important point to note here is that the IDF should always be based on a large corpora and should be representative of texts you would be using to extract keywords. This is why we are using texts from 20,000 stack overflow posts to compute the IDF instead of just a handful. I’ve seen several articles on the Web that compute the IDF using a handful of documents. You will defeat the whole purpose of IDF weighting if its not based on a large corpora as (a) your vocabulary becomes too small and (b) you have limited ability to observe the behavior of words that you do know about.

Computing TF-IDF and Extracting Keywords

Once we have our IDF computed, we are now ready to compute TF-IDF and then extract top keywords from the TF-IDF vectors. In this example, we will extract top keywords for the questions in data/stackoverflow-test.json. This data file has 500 questions with fields identical to that of data/stackoverflow-data-idf.json as we saw above. We will start by reading our test file, extracting the necessary fields (title and body) and getting the texts into a list.

The next step is to compute the tf-idf value for a given document in our test set by invoking tfidf_transformer.transform(...). This generates a vector of tf-idf scores. Next, we sort the words in the vector in descending order of tf-idf values and then iterate over to extract the top-n keywords. In the example below, we are extracting keywords for the first document in our test set.

The sort_coo(...) method essentially sorts the values in the vector while preserving the column index. Once you have the column index then its really easy to look-up the corresponding word value as you would see in extract_topn_from_vector(...) where we do feature_vals.append(feature_names[idx]).

Example Results

In this section, you will see some of the stack overflow questions followed by the top-10 keywords generated using the code above. Note that these questions are from the stackoverflow-test.json data file.

Question about Eclipse Plugin integration

From the keywords above, the top keywords actually make sense, it talks about eclipse, maven, integrate, war and tomcat which are all unique to this specific question. There are a couple of keywords that could have been eliminated such as possibility and perhaps even project and you can further fine-tune what shows up on top by adding more common words to your stop list and you can even create your own set of stop list, very specific to your domain.

Now let’s look at another example.

Question about SQL Import

Even with all the html tags, because of the pre-processing, we are able to extract some pretty nice keywords here. The last word appropriately would qualify as a stop word. You can keep running different examples to get ideas of how to fine-tune the results.

Whoala! Now you can extract important keywords from any type of text!  To play around with this entire code, please head over to my repo to re-run the full example using my TF-IDF Jupyter Notebook.

Some tips and tricks

  1. You can easily save the resulting CountVectorizer and TfidfTransformer and load them back for use at a later time.
  2. Instead of using CountVectorizer followed by TfidfTransformer, you can directly use TfidfVectorizer by itself. This is equivalent to CountVectorizer followed by TfidfTransformer.
  3. In this example, we computed the tf-idf matrix for each document of interest and then extracted top terms from it. What you could also do is first applytfidf_transformer.transform(docs_test) which will generate a tf-idf matrix for all documents in docs_test at one go and then iterate over the resulting vectors to extract top keywords. The first approach is useful if you have one document coming in at a time. The second approach is more suitable when you want keywords from a fairly large set of documents.

See Also: How to correctly use TFIDFTransformer and TFIDFVectorizer?


Tips for Constructing Custom Stop Word Lists

Stop words are a set of commonly used words in any language. For example, in English, “the”, “is” and “and”, would easily qualify as stop words. In NLP and text mining applications, stop words are used to eliminate unimportant words, allowing applications to focus on the important words instead.

While it is fairly easy to use a published set of stop words, in many cases, using such stop words is completely insufficient for certain applications. For example, in clinical texts, terms like “mcg” “dr.” and “patient” occur almost in every document that you come across. So, these terms may be regarded as potential stop words for clinical text mining and retrieval. Similarly, for tweets, terms like “#” “RT”, “@username” can be potentially regarded as stop words. The common language specific stop word list generally DOES NOT cover such domain specific terms.

The good news is that it is actually fairly easy to construct your own domain specific stop word list. Here are a few ways of doing it assuming you have a large corpus of text from the domain of interest, you can do one or more of the following to figure out your stop words:

1. Most frequent terms as stop words

Sum the term frequencies of each unique word, w across all documents in your collection. Sort the terms in descending order of raw term frequency. You can take the top N terms to be your stop words. You can also eliminate common English words (using a publish stop list) prior to sorting so that you are sure that you target the domain specific stop words. Another option is to treat words occurring in more X% of your documents as stop words. I have personally found eliminating words that appear in 85% of documents to be effective in several text mining tasks.  The benefit of this approach is that it is really easy implement, the downside however is if you have a particularly long document, the raw term frequency from just a few documents can dominate and cause the term to be at the top. One way to resolve this is to normalize the raw term frequency using a normalizer such as the document length (i.e. number of words in a given document).

2. Least frequent terms as stop words

Just as terms that are extremely frequent could be distracting terms rather than discriminating terms, terms that are extremely infrequent may also not be useful for text mining and retrieval. For example the username “@username” that occurs only once in a collection of tweets, may not be very useful. Other terms like “yoMateZ!” which could be just made-up terms by people again may not be useful for text mining applications. Note that certain terms like “yaaaaayy!!” can often be normalized to standard forms such as “yay”. However, despite all the normalization if terms still have a term frequency count of one you could remove it. This could significantly reduce your overall feature space.

3. Low IDF terms as stop words

Inverse document frequency (IDF) basically refers to the inverse fraction of documents in your collection that contains a specific term ti. Let us say you have N documents. And term ti occurred in M of the N documents. The IDF of ti is thus computed as:

IDF(ti)=Log N/M

So the more documents ti appears in, the lower the IDF score. This means terms that appear in each and every document will have an IDF score of 0. If you rank each ti in your collection by its IDF score in descending order, you can treat the bottom K terms with the lowest IDF scores to be your stop words. Again, you can also eliminate common English words (using a published stop list) prior to sorting so that you are sure that you target the domain specific low IDF words. This is not necessary really if your K is large enough such that it will prune both general stop words as well as domain specific stop words. You will find more information about IDFs here.

So, would stop words help my task?

So how would you know if removing domain specific stop words would be helpful in your case? Easy, test it on a subset of your data. See if whatever measure of accuracy and performance improves, stays constant or degrades. If it degrades, needless to say, don’t do it unless the degradation is negligible and you see gains in other forms such as decrease in size of model, ability to process things in memory, and etc.

How to incorporate phrases into Word2Vec – a text mining approach

Training a Word2Vec model with phrases is very similar to training a Word2Vec model with single words. The difference: you would need to add a layer of intelligence in processing your text data to pre-discover phrases. In this tutorial, you will learn how to create embeddings with phrases without explicitly specifying the number of words that should make-up a phrase (i.e. the n-gram size). This means that you could have phrases with 2 words, 3 words and in some rare cases even 4 or 5.

At a high level, the steps would include:

  • Step 1:  Discovering common phrases in your corpora
  • Step 2: Tagging your corpora with phrases
  • Step 3: Training a Word2Vec model with the newly found phrases

Step 1: Discovering common phrases in your corpora

The first step towards generating embeddings for phrases is recognizing groups of words that make up a phrase. There are many ways to recognize phrases. One way is to use a linguistic heavy approach called “chunking” to detect phrases. NLTK for example, has a chunk capability that you could use.

For this task, I will show you how you can use a text data mining approach with Spark, where you leverage the volume and evidence from your corpora for phrase detection. I like this approach because it’s lightweight, speedy and scales to the amount of data that you need to process.

So here’s how it works. At a high level, the entire corpora of text is segmented using a set of delimiter tokens. This can be special characters, stop words and other terms that can indicate phrase boundary. I specifically used some special characters and a very basic set of English stop words.

Stop words are excellent for splitting text into a set of phrases as they usually consist of connector and filler words used to connect ideas, details, or clauses together in order to make one clear, detailed sentence. You can get creative and use a more complete stop word list or you can even over-simplify this list to make it a minimal stop word list.

The code below shows you how you can use both special characters and stop words to break text into a set of candidate phrases. Check the phrase-at-scale repo for the full source code.

In the code above, we are first splitting text into coarse-grained units using some special characters like comma, period and semi-colon. This is then followed by more fine-grained boundary detection using stop words. When you repeat this process for all documents or sentences in your corpora, you will end up with a huge set of phrases. You can then surface the top phrases using frequency counts and other measures such as Pointwise Mutual Information which can measure strength of association between words in your phrase. For the phrase embedding task, we naturally have to use lots and lots of data, so frequency counts alone would suffice for this task. In some other tasks, I have combined frequency counts with Pointwise Mutual Information to get a better measure phrase quality.

To ensure scalability, I really like using Spark since you can leverage its built-in multi-threading capability on a single machine or use multiple machines to get more CPU power if you really have massive amounts of data to process. The code below shows you the PySpark method that reads your text files, cleans it up, generates candidate phrases, counts frequency of the phrases and filters it down to a set of phrases that satisfy a minimum frequency count. On a 450 MB dataset, run locally, this takes about a minute to discover top phrases and 7 minutes to annotate the entire text corpora with phrases. You can follow instructions in the phrase-at-scale repo to use this PySpark code to discover phrases for your data.

Here is a tiny snapshot of phrases found using the code above on a restaurant review dataset.

Step 2: Tagging your corpora with phrases

There are two ways you can mark certain words as phrases in your corpora. One approach is to pre-annotate your entire corpora and generate a new “annotated corpora”. The other way is to annotate your sentences or documents during the pre-processing phase prior to learning the embeddings. It’s much cleaner to have a separate layer for annotation which does not interfere with the training phase. Otherwise, it will be harder to gauge if your model is slow due to training or annotation.

In annotating your corpora, all you need to do is to somehow join the words that make-up a phrase. For this task, I just use an underscore to join the individual words. So, “…ate fried chicken and onion rings…” would become “…ate fried_chicken and onion_rings…”

Step 3: Training a Phrase2Vec model using Word2Vec

Once you have phrases explicitly tagged in your corpora the training phase is quite similar to any Word2Vec model with Gensim or any other library. You can follow my Word2Vec Gensim Tutorial for a full example on how to train and use Word2Vec.

Example Usage of Phrase Embeddings

The examples below show you the power of phrase embeddings when used to find similar concepts.  These are concepts from the restaurant domain, trained on 450 MB worth of restaurant reviews using Gensim.

Similar and related unigrams, bigrams and trigrams

Notice below that we are able to capture highly related concepts that are unigrams, bigrams and higher order n-grams.

Most similar to 'green_curry':
('panang_curry', 0.8900948762893677)
('yellow_curry', 0.884008526802063)
('panang', 0.8525004386901855)
('drunken_noodles', 0.850254237651825)
('basil_chicken', 0.8400430679321289)
('coconut_soup', 0.8296557664871216)
('massaman_curry', 0.827597975730896)
('pineapple_fried_rice', 0.8266736268997192)

Most similar to 'singapore_noodles':
('shrimp_fried_rice', 0.7932361960411072)
('drunken_noodles', 0.7914629578590393)
('house_fried_rice', 0.7901676297187805)
('mongolian_beef', 0.7796567678451538)
('crab_rangoons', 0.773795485496521)
('basil_chicken', 0.7726351022720337)
('crispy_beef', 0.7671589255332947)
('steamed_dumplings', 0.7614079117774963)

Most similar to 'chicken_tikka_masala':
('korma', 0.8702514171600342)
('butter_chicken', 0.8668922781944275)
('tikka_masala', 0.8444720506668091)
('garlic_naan', 0.8395442962646484)
('lamb_vindaloo', 0.8390569686889648)
('palak_paneer', 0.826908528804779)
('chicken_biryani', 0.8210495114326477)
('saag_paneer', 0.8197864294052124)

Most similar to 'breakfast_burrito':
('huevos_rancheros', 0.8463341593742371)
('huevos', 0.789624035358429)
('chilaquiles', 0.7711247801780701)
('breakfast_sandwich', 0.7659544944763184)
('rancheros', 0.7541004419326782)
('omelet', 0.7512155175209045)
('scramble', 0.7490915060043335)
('omlet', 0.747859001159668)

Most similar to 'little_salty':
('little_bland', 0.745500385761261)
('little_spicy', 0.7443351149559021)
('little_oily', 0.7373550534248352)
('little_overcooked', 0.7355216145515442)
('kinda_bland', 0.7207454442977905)
('slightly_overcooked', 0.712611973285675)
('little_greasy', 0.6943882703781128)
('cooked_nicely', 0.6860566139221191)

Most similar to 'celiac_disease':
('celiac', 0.8376057744026184)
('intolerance', 0.7442486882209778)
('gluten_allergy', 0.7399739027023315)
('celiacs', 0.7183824181556702)
('intolerant', 0.6730632781982422)
('gluten_free', 0.6726624965667725)
('food_allergies', 0.6587174534797668)
('gluten', 0.6406026482582092)

Similar concepts expressed differently

Here you will see that similar concepts that are expressed differently can also be captured.

Most similar to 'reasonably_priced':
('fairly_priced', 0.8588327169418335)
('affordable', 0.7922118306159973)
('inexpensive', 0.7702735066413879)
('decently_priced', 0.7376087307929993)
('reasonable_priced', 0.7328246831893921)
('priced_reasonably', 0.6946456432342529)
('priced_right', 0.6871092915534973)
('moderately_priced', 0.6844340562820435)

Most similar to 'highly_recommend':
('definitely_recommend', 0.9155156016349792)
('strongly_recommend', 0.86533123254776)
('absolutely_recommend', 0.8545517325401306)
('totally_recommend', 0.8534528017044067)
('recommend', 0.8257364630699158)
('certainly_recommend', 0.785507082939148)
('highly_reccomend', 0.7751532196998596)
('highly_recommended', 0.7553941607475281)


In summary, to generate embeddings of phrases, you would need to add a layer for phrase discovery before training a Word2Vec model. If you have lots of data, a text data mining approach has the benefit of being lightweight and scalable, without compromising on quality. In addition, you wouldn’t have to specify a phrase size in advance or be limited by a specific vocabulary. A linguistic heavy approach gives you a lot more specificity in terms of parts of speech and the types of phrases (e.g. noun phrase vs. verb phrase) that you are dealing with. If you really need that information, then you can consider a chunking approach over a text mining approach.


Here are some resources that might come handy to you:


Gensim Word2Vec Tutorial – Full Working Example

The idea behind Word2Vec is pretty simple. We’re making an assumption that the meaning of a word can be inferred by the company it keeps. This is analogous to the saying, “show me your friends, and I’ll tell who you are”.

If you have two words that have very similar neighbors (meaning: the context in which it’s used is about the same), then these words are probably quite similar in meaning or are at least related. For example, the words shocked, appalled and astonished are usually used in a similar context.

“The meaning of a word can be inferred by the company it keeps”

Using this underlying assumption, you can use Word2Vec to:

  • Surface similar concepts
  • Find unrelated concepts
  • Compute similarity between two words and more!

Down to business

In this tutorial, you will learn how to use the Gensim implementation of Word2Vec (in python) and actually get it to work! I‘ve long heard complaints about poor performance, but it really is a combination of two things: (1) your input data and (2) your parameter settings. Check out the Jupyter Notebook if you want direct access to the working example, or read on to get more context.

Side note: The training algorithms in the Gensim package were actually ported from the original Word2Vec implementation by Google and extended with additional functionality.

Imports and logging

First, we start with our imports and get logging established:


Next, is finding a really good dataset. The secret to getting Word2Vec really working for you is to have lots and lots of text data in the relevant domain. For example, if your goal is to build a sentiment lexicon, then using a dataset from the medical domain or even wikipedia may not be effective. So, choose your dataset wisely. As Matei Zaharia says,

It’s your data, stupid

That was said in the context of data quality, but it’s not just quality it’s also using the right data for the task.

For this tutorial, I am going to use data from the OpinRank dataset from some of my Ph.D work. This dataset has full user reviews of cars and hotels. I have specifically concatenated all of the hotel reviews into one big file which is about 97 MB compressed and 229 MB uncompressed. We will use the compressed file for this tutorial. Each line in this file represents a hotel review.

Now, let’s take a closer look at this data below by printing the first line.

You should see the following:

You can see that this is a pretty good full review with many words and that’s what we want. We have approximately 255,000 such reviews in this dataset.

To avoid confusion, the Gensim’s Word2Vec tutorial says that you need to pass a list of tokenized sentences as the input to Word2Vec. However, you can actually pass in a whole review as a sentence (i.e. a much larger size of text), if you have a lot of data and it should not make much of a difference. In the end, all we are using the dataset for is to get all neighboring words (the context) for a given target word.

Read files into a list

Now that we’ve had a sneak peak of our dataset, we can read it into a list so that we can pass this on to the Word2Vec model. Notice in the code below, that I am directly reading the compressed file. I’m also doing a mild pre-processing of the reviews using gensim.utils.simple_preprocess (line). This does some basic pre-processing such as tokenization, lowercasing, etc. and returns back a list of tokens (words). Documentation of this pre-processing method can be found on the official Gensim documentation site.

Training the Word2Vec model

Training the model is fairly straightforward. You just instantiate Word2Vec and pass the reviews that we read in the previous step. So, we are essentially passing on a list of lists. Where each list within the main list contains a set of tokens from a user review. Word2Vec uses all these tokens to internally create a vocabulary. And by vocabulary, I mean a set of unique words.

The step above, builds the vocabulary, and starts training the Word2Vec model. We will get to what these parameters actually mean later in this article. Behind the scenes, what’s happening here is that we are training a neural network with a single hidden layer where we train the model to predict the current word based on the context (using the default neural architecture). However, we are not going to use the neural network after training! Instead, the goal is to learn the weights of the hidden layer. These weights are essentially the word vectors that we’re trying to learn. The resulting learned vector is also known as the embeddings. You can think of these embeddings as some features that describe the target word. For example, the word king may be described by the gender, age, the type of people the king associates with, etc.

Training on the Word2Vec OpinRank dataset takes several minutes so sip a cup of tea, and wait patiently.

Some results!

Let’s get to the fun stuff already! Since we trained on user reviews, it would be nice to see similarity on some adjectives. This first example shows a simple look up of words similar to the word ‘dirty’. All we need to do here is to call the most_similar function and provide the word ‘dirty’ as the positive example. This returns the top 10 similar words.

Gensim most similar to the word "dirty"

Ooh, that looks pretty good. Let’s look at more.

Similar to polite:

Similar to france:

Similar to shocked:

Overall, the results actually make sense. All of the related words tend to be used in similar contexts.

Now you could even use Word2Vec to compute similarity between two words in the vocabulary by invoking the similarity(...) function and passing in the relevant words.

Under the hood, the above three snippets compute the cosine similarity between the two specified words using word vectors (embeddings) of each. From the scores above, it makes sense that dirty is highly similar to smelly but dirty is dissimilar to clean. If you do a similarity between two identical words, the score will be 1.0 as the range of the cosine similarity can go from [-1 to 1] and sometimes bounded between [0,1] depending on how it’s being computed. You can read more about cosine similarity scoring here.

You will find more examples of how you could use Word2Vec in my Jupyter Notebook.

A closer look at the parameter settings

To train the model earlier, we had to set some parameters. Now, let’s try to understand what some of them mean. For reference, this is the command that we used to train the model.


The size of the dense vector to represent each token or word (i.e. the context or neighboring words). If you have limited data, then size should be a much smaller value since you would only have so many unique neighbors for a given word. If you have lots of data, it’s good to experiment with various sizes. A value of 100–150 has worked well for me for similarity lookups.


The maximum distance between the target word and its neighboring word. If your neighbor’s position is greater than the maximum window width to the left or the right, then, some neighbors would not be considered as being related to the target word. In theory, a smaller window should give you terms that are more related. Again, if your data is not sparse, then the window size should not matter too much, as long as it’s not overly narrow or overly broad. If you are not too sure about this, just use the default value.


Minimium frequency count of words. The model would ignore words that do not satisfy the min_count. Extremely infrequent words are usually unimportant, so its best to get rid of those. Unless your dataset is really tiny, this does not really affect the model in terms of your final results. The settings here probably has more of an effect on memory usage and storage requirements of the model files.


How many threads to use behind the scenes?


Number of iterations (epochs) over the corpus. 5 is a good starting point. I always use a minimum of 10 iterations.

When should you use Word2Vec?

There are many application scenarios for Word2Vec. Imagine if you need to build a sentiment lexicon. Training a Word2Vec model on large amounts of user reviews helps you achieve that. You have a lexicon for not just sentiment, but for most words in the vocabulary.

Beyond raw unstructured text data, you could also use Word2Vec for more structured data. For example, if you had tags for a million stackoverflow questions and answers, you could find related tags and recommend those for exploration. You can do this by treating each set of co-occuring tags as a “sentence” and train a Word2Vec model on this data. Granted, you still need a large number of examples to make it work.

See Also: How to use pre-trained embeddings with Gensim?


Papers to read

These are some of the recommended readings:

How to read CSV & JSON files in Spark – word count example

One of the really nice things about spark is the ability to read input files of different formats right out of the box. Though this is a nice to have feature, reading files in spark is not always consistent and seems to keep changing with different spark releases. This article will show you how to read files in csv and json to compute word counts on selected fields. This example assumes that you would be using spark 2.0+ with python 3.0 and above. Full working code can be found in this repository.

Data files

To illustrate by example let’s make some assumptions about data files. Let’s assume that we have data files containing a title field and a corresponding text field. The toy example format in json is as follows:

And, the format in csv is as follows:

Assume that we want to compute word counts based on the textfield.

Reading JSON File

Reading the json file is actually pretty straightforward, first you create an SQLContext from the spark context. This gives you the capability of querying the json file in regular SQL type syntax.

In this next step, you use the sqlContext to read the json file and select only the text field. Remember that we have two fields, title and text and in this case we are only going to process the text field. This step returns a spark data frame where each entry is a Row object. In order to access the text field in each row, you would have to use row.text. Note that the select here is conceptually the same as traditional SQL where you would do: select text from .....

To view what you have just read, you can use

You should see something like this:

SQL Query to Read JSON file

Note that you can achieve the same results, by issuing an actual SQL query on the dataset. For this, you first register the dataset as a view, then you issue the query. This also returns the same DataFrame as above.

Reading CSV File

Reading the csv file is similar to json, with a small twist to it, you would use and provide a format to it as below. Note that this method of reading is also applicable to different file types including json, parquet and csv and probably others as well.

Since the csv data file in this example has a header row, this can be used to infer schema and thus header='true' as seen above. In this example, we are again selecting only the text field. This method of reading a file also returns a data frame identical to the previous example on reading a json file.

Generating Word Counts

Now that we know that reading the csv file or the json file returns identical data frames, we can use a single method to compute the word counts on the text field. The idea here is to break words into tokens for each row entry in the data frame, and return a count of 1 for each token (line 4). This function returns a list of lists where each internal list contains just the word and a count of 1 ([w, 1]). The tokenized words would serve as the key and the corresponding count would be the value. Then when you reduce by key, you can add up all counts on a per word (key) basis to get total counts for each word (see line 8). Note that add here is a python function from the operator module.

As you can see below, accessing the text field is pretty simple if you are dealing with data frames.

And whoala, now you know how to read files with pyspark and use it for some basic processing! For the full source code please see links below.

Source Code



What is text similarity?

When talking about text similarity, different people have a slightly different notion on what text similarity means. In essence, the goal is to compute how ‘close’ two pieces of text are in (1) meaning or (2) surface closeness. The first is referred to as semantic similarity and the latter is referred to as lexical similarityAlthough the methods for lexical similarity are often used to achieve semantic similarity (to a certain extent), achieving true semantic similarity is often much more involved. In this article, I mainly focus on lexical similarity as it has the most use from a practical stand-point and then I briefly introduce semantic similarity.

Lexical or Word Level Similarity

For the most part, when referring to text similarity, people actually refer to how similar two pieces of text are at the surface level. For example, how similar are the phrases “the cat ate the mouse” with “the mouse ate the cat food” by just looking at the words?  On the surface, if you consider only word level similarity, these two phrases (with determiners disregarded) appear very similar as 3 of the 4 unique words are an exact overlap.


This notion of similarity is often referred to as lexical similarity. It typically does not take into account the actual meaning behind words or the entire phrase in context. While the actual meaning of the phrases is often disregarded this does not mean that computing similarity in this way is ineffective. You can actually come up with creative ways of expanding the scope of each word or a set of related words (lexical chains) to improve the similarity between two pieces of text being compared. For instance, if you are comparing similarity between phrases from newspaper articles. You could potentially use N non-determiner words to the left and to the right of the current word in the phrase for a simple scope expansion. Instead of doing a word for word comparison,  you are essentially providing more context. This is analogous to expanding a search query. Imagine if you were to compare the similarity between your search query, and all the documents on the Web so that you can get the best results matching your query. How would you expand your query? The same thought process can be applied in improving lexical level similarity measures.


Another point to note is that lexical similarity can be computed at various granularity. You can compute lexical similarity at the character level, word level (as shown earlier) or at a phrase  level (or lexical chain level) where you break a piece of text into a group of related words prior to computing similarity. Character level similarity is also known as string similarity/matching and is commonly used to determine how close two strings are. For example how close are the names ‘Kavita Ganesan’ and ‘Kavita A Ganesan’ ? Pretty close! You can use the common metrics outlined below for string similarity or you can use edit distance  type of measures to quantify how dissimilar two strings are. In essence, you are trying to compute the minimum number of operations required to transform one string into the other.

Common Metrics

Some of the most common metrics for computing similarity between two pieces of text are the Jaccard coefficient, Dice and Cosine similarity all of which have been around for a very long time. Jaccard and Dice are actually really simple as you are just dealing with sets. Here is how you can compute Jaccard:

Simply put, this is the intersection of the sets divided by the union of the sets. Your resulting value will be between [0,1], so you can set a threshold as needed. I have found that a threshold of 0.6 and above is pretty effective in detecting phrases that are similar (maximum length of phrase is 10 words). For longer texts this value could be smaller or if you only care for marginal overlap, again this value could be much smaller. The more you normalize, pre-process and filter your text (e.g. stem, remove noise, remove stop words), the better the outcome of your text similarity measure using simple measures such as Jaccard.

Where is lexical similarity used?

Clustering – if you want to group similar texts together how can you tell if two groups of text are even similar?

Redundancy removal – if two pieces of texts are so similar, why do you need both? You can always eliminate the redundant one. Think of duplicate product listings, or the same person in your database, with slight variation in the name or even html pages that are near duplicates.

Information Retrieval – you could use the more established information retrieval measures like BM25, PL2, etc. But you could also use a measure like cosine (for longer texts) or jaccard and dice for (shorter texts).

Semantic Similarity

So far, we have talked about lexical similarity. Another notion of similarity mostly explored by the NLP research community is how similar in meaning are any two phrases?  If we look at the phrases, “the cat ate the mouse” and “the mouse ate the cat food”, we know that while the words significantly overlap, these two phrases actually have different meaning. Getting the meaning out of the phrases is often a more difficult task as it requires deeper level of analysis. In this example, we can actually look at simple aspects like order of words: “cat==>ate==>mouse” and “mouse==>ate==>cat food”. Although the words overlap in this case, the order of occurrence is different and from that we can tell that these two phrases actually have different meaning. This is just one simple example. Most people use syntactic parsing to help with semantic similarity. Let’s look at the parse trees for these two phrases. What can you get from it?

You can get phrases out of the parse (e.g. “cat food”), dependency structure (e.g. mouse is the object of ate in the first case and food is the object of ate in the second case) as well as parts of speech (nouns, verbs, adjectives and etc.)  – all of which can be used in different ways to estimate semantic similarity. Semantic similarity is often used to address NLP tasks such as paraphrase identification and automatic question answering. To get a better understanding of semantic similarity and paraphrasing you can refer to some of the articles below.

Related Articles:

Leveraging large amounts of opinions for decision making

Opinion Driven Decision Support System (ODSS) refers to the use of large amounts of online opinions to facilitate business and consumer decision making. The idea is to combine the strengths of search technologies with opinion mining and analysis tools to provide a synergistic decision making platform. The research and engineering problems related to developing such a system include :
  1. opinion acquisition
  2. opinion based search
  3. opinion summarization
  4. presentation of results
Opinions in this case can be aggregation of user reviews, blog comments, facebook status updates, Tweets and so on. Essentially any opinion containing texts on specific topics or entities qualify as candidates for building an ODSS platform. Here’s a description of some of the research and engineering problems towards developing an ODSS platform:

1. Search Capabilities Based on Opinions

The goal of opinion-based search is to help users find entities of interest based on their key requirements. Since a user is often interested in choosing an entity based on opinions on that entity, a system that ranks entities based on a user’s personal preferences would provide a more direct support for a user’s decision-making task. For example, in the case of finding hotels at a destination, a user may only want to consider hotels where other people thought was clean. By finding and ranking hotels based on how well it satisfies such a requirement would significantly reduce the number of entities in consideration, facilitating decision making. Unlike traditional search, the query in this case is a set of preferences and the results is a set of entities that match these preferences. The challenge is to accurately match the user’s preferences with existing opinions in order to recommend the best entities. This special ranking problem is referred to as Opinion-Based Entity Ranking. Many of the existing opinion mining techniques can be potentially used for this new ranking task. I have explored information retrieval based techniques to specifically solve this ranking problem  and there has been a few follow-up works (from other groups) trying other approaches.

2. Opinion Summarization (i.e. Sentiment Analysis + Text Summarization)

Opinion summaries play a critical role in helping users better analyze entities in consideration (e.g. product, physician, cars, politican). Users are often looking out for major concerns or advantages in selecting a specific entity. Thus, a summary that can quickly highlight the key opinions about the entity would significantly help exploration of entities and aid decision making. The field of opinion summarization has been long explored with most techniques being focused on generating structured summaries on a fixed set of topics. These are referred to as stuctured summaries. In the last few years, textual summaries of opinions have been gaining more and more popularity. Bing Liu’s Opinion Mining Tutorial covers some of these recent works or you can refer to this article point (5).

3. Opinion Acquisition (i.e. Opinion or Sentiment Crawling)

To support accurate search and analysis based on opinions, opinionated content is imperative. Relying on opinions from just one specific source not only makes the information unreliable, but also incomplete due to variations in opinions as well as potential bias present in a specific source. Although many applications rely on large amounts of opinions, there has been very limited work on collecting and integrating a complete set of opinions. I recently explored a very simple method to collecting large amounts of opinions on arbitrary entities.

The idea of an Opinion Driven Decision Support (ODSS) was developed as part of my thesis. For more information on this please see Kavita’s thesis.