Tutorial: Extracting Keywords with TF-IDF and Python’s Scikit-Learn

In this era of use Deep Learning for everything, one may be wondering why you would even use TF-IDF for any task at all ?!! The truth is TF-IDF is easy to understand, easy to compute and is one of the most versatile statistic that shows the relative importance of a word or phrase in a document or a set of documents in comparison to the rest of your corpus.

Keywords are descriptive words or phrases that characterize your documents. For example, keywords from this article would be tf-idf,   scikit-learn, keyword extraction, extract and so on. These keywords are also referred to as topics in some applications.

TF-IDF can be used for a wide range of tasks including text classification, clustering / topic-modeling, search, keyword extraction and a whole lot more.

In this article, you will learn how to use TF-IDF from the scikit-learn package to extract keywords from documents.

Let’s Get Started…

I’m assuming that folks following this tutorial are already familiar with the concept of TF-IDF. If you are not, please familiarize yourself with the concept before reading on. There are a couple of videos online that give an intuitive explanation of what it is. For a more academic explanation I would recommend my Ph.D advisor’s explanation. If you just need access to my Jupyter Notebook with full code samples, please head over to my repo, otherwise please read on.

Dataset

In this keyword extraction tutorial, we’ll be using a stack overflow dataset which is a bit noisy and simulates what you could be dealing with in real life. You will find this dataset in my tutorial repo. Notice that there are two files in this repo, the larger file, stackoverflow-data-idf.json has 20,000 posts and is used to compute the Inverse Document Frequency (IDF) and the smaller file, stackoverflow-test.json has 500 posts and we would use that as a test set for us to extract keywords from. This dataset is based on the publicly available stack overflow dump from Google’s Big Query.

The first thing we’ll do is to take a peek at our dataset. The code below reads a one per line json string from data/stackoverflow-data-idf.json into a pandas data frame and prints out its schema and total number of posts. Here, lines=True simply means we are treating each line in the text file as a separate json string.

Schema:

accepted_answer_id          float64
answer_count                  int64
body                         object
comment_count                 int64
community_owned_date         object
creation_date                object
favorite_count              float64
id                            int64
last_activity_date           object
last_edit_date               object
last_editor_display_name     object
last_editor_user_id         float64
owner_display_name           object
owner_user_id               float64
post_type_id                  int64
score                         int64
tags                         object
title                        object
view_count                    int64
dtype: object
Number of questions,columns= (20000, 19)

Notice that this stack overflow dataset contains 19 fields including post title, body, tags, dates and other metadata which we don’t quite need for this tutorial. What we are mostly interested in for this tutorial, is the body and title which will become our source of text for keyword extraction. We will now create a field that combines both body and title so we have it in one field. We will also print the second text entry in our new field just to see what the text looks like.


The text above is essentially a combination of the title and body of a stack overflow post. Hmmm, this doesn’t look very readable, does it? Well, that’s because we are cleaning the text after we concatenated the two fields (line 18). All of the cleaning happens in pre_process(..). You can do a lot more stuff in pre_process(..), such as eliminate all code sections, normalize the words to its root, etc, but for simplicity we perform only some mild pre-processing.

Creating Vocabulary and Word Counts for IDF

We now need to create the vocabulary and start the counting process. We can use the CountVectorizer to create a vocabulary from all the text in our df_idf['text'] followed by the counts of words in the vocabulary.

While cv.fit(...) would only create the vocabulary, cv.fit_transform(...) creates the vocabulary and returns a term-document matrix which is what we want. With this, each column in the matrix represents a word in the vocabulary while each row represents the document in our dataset where the values in this case are the word counts. Note that with this representation, counts of some words could be 0 if the word did not appear in the corresponding document.

Notice that in the code above, we are passing two parameters to CountVectorizer, max_df and stop_words. The first is just to say ignore all words that have appeared in 85% of the documents, since those may be unimportant. The later, is a custom stop words list. You can also use stop words that are native to sklearn by setting stop_words='english', but I personally find this to be quite limited. The stop word list used for this tutorial can be found here.

The resulting shape of word_count_vector is (20000,124901) since we have 20,000 documents in our dataset (the rows) and the vocabulary size is 124,901. In some text mining applications such as clustering and text classification we typically limit the size of the vocabulary. It’s really easy to do this by setting max_features=vocab_size when instantiating CountVectorizer. For this tutorial let’s limit our vocabulary size to 10,000.

Now, let’s look at 10 words from our vocabulary.

['serializing',
 'private',
 'struct',
 'public',
 'class',
 'contains',
 'properties',
 'string',
 'serialize',
 'attempt']

Sweet, these are mostly programming related.

TfidfTransformer to Compute Inverse Document Frequency (IDF)

Its now time to compute the IDF values. In the code below, we are essentially taking the sparse matrix from CountVectorizer (word_count_vector) to generate the IDF when you invoke tfidf_transformer.fit(...)(see basic usage example of tfidftransformer and tfidfvectorizer)

An extremely important point to note here is that the IDF should always be based on a large corpora and should be representative of texts you would be using to extract keywords. This is why we are using texts from 20,000 stack overflow posts to compute the IDF instead of just a handful. I’ve seen several articles on the Web that compute the IDF using a handful of documents. You will defeat the whole purpose of IDF weighting if its not based on a large corpora as (a) your vocabulary becomes too small and (b) you have limited ability to observe the behavior of words that you do know about.

Computing TF-IDF and Extracting Keywords

Once we have our IDF computed, we are now ready to compute TF-IDF and then extract top keywords from the TF-IDF vectors. In this example, we will extract top keywords for the questions in data/stackoverflow-test.json. This data file has 500 questions with fields identical to that of data/stackoverflow-data-idf.json as we saw above. We will start by reading our test file, extracting the necessary fields (title and body) and getting the texts into a list.

The next step is to compute the tf-idf value for a given document in our test set by invoking tfidf_transformer.transform(...). This generates a vector of tf-idf scores. Next, we sort the words in the vector in descending order of tf-idf values and then iterate over to extract the top-n keywords. In the example below, we are extracting keywords for the first document in our test set.

The sort_coo(...) method essentially sorts the values in the vector while preserving the column index. Once you have the column index then its really easy to look-up the corresponding word value as you would see in extract_topn_from_vector(...) where we do feature_vals.append(feature_names[idx]).

Example Results

In this section, you will see some of the stack overflow questions followed by the top-10 keywords generated using the code above. Note that these questions are from the stackoverflow-test.json data file.

Question about Eclipse Plugin integration


From the keywords above, the top keywords actually make sense, it talks about eclipse, maven, integrate, war and tomcat which are all unique to this specific question. There are a couple of keywords that could have been eliminated such as possibility and perhaps even project and you can further fine-tune what shows up on top by adding more common words to your stop list and you can even create your own set of stop list, very specific to your domain.

Now let’s look at another example.

Question about SQL Import


Even with all the html tags, because of the pre-processing, we are able to extract some pretty nice keywords here. The last word appropriately would qualify as a stop word. You can keep running different examples to get ideas of how to fine-tune the results.

Whoala! Now you can extract important keywords from any type of text!  To play around with this entire code, please head over to my repo to re-run the full example using my TF-IDF Jupyter Notebook.

Some tips and tricks

  1. You can easily save the resulting CountVectorizer and TfidfTransformer and load them back for use at a later time.
  2. Instead of using CountVectorizer followed by TfidfTransformer, you can directly use TfidfVectorizer by itself. This is equivalent to CountVectorizer followed by TfidfTransformer.
  3. In this example, we computed the tf-idf matrix for each document of interest and then extracted top terms from it. What you could also do is first applytfidf_transformer.transform(docs_test) which will generate a tf-idf matrix for all documents in docs_test at one go and then iterate over the resulting vectors to extract top keywords. The first approach is useful if you have one document coming in at a time. The second approach is more suitable when you want keywords from a fairly large set of documents.

Resources

12 thoughts on “Tutorial: Extracting Keywords with TF-IDF and Python’s Scikit-Learn”

  1. smithplusplus

    Not sure if it’s just me, but i had to change a few lines in the code for extract_topn_from_vector to get the right number of results back. My sorted_items list was not unique, so I cast it to a set and then back to a list, so that it would give me the topN unique values.

    1. Kavita Ganesan

      It should be unique since each word in the vocabulary would occupy one column in the vector. Did you try it on the stack overflow dataset or is this another dataset? The only thing I can think of is surrounding whitespaces which can cause the same word to occupy different positions in the vector

  2. This is a great walkthrough, thank you very much! Incredibly concise and clearly written.

    However, this approach only yields single keywords. How would you generate longer keywords, like bigrams or trigrams? E.g. to get things out like “private class” or “public static method”

    1. That’s a great question. In the past I’ve tried to use average tf-idf of individual words in a phrase which actually worked pretty well. In fact, this approach uses that: https://githubengineering.com/topics/. Alternatively, you can also compute tf and idf for n-grams > 1 and use those counts/weights.

Have a thought?