Python Keyword Extraction Tutorial using TF-IDF

In this era of use Deep Learning for everything, one may be wondering why you would even use TF-IDF for any task at all ?!! The truth is TF-IDF is easy to understand, easy to compute and is one of the most versatile statistic that shows the relative importance of a word or phrase in a document or a set of documents in comparison to the rest of your corpus.

Keywords are descriptive words or phrases that characterize your documents. For example, keywords from this article would be tf-idf,   scikit-learn, keyword extraction, extract and so on. These keywords are also referred to as topics in some applications.

TF-IDF can be used for a wide range of tasks including text classification, clustering / topic-modeling, search, keyword extraction and a whole lot more.

In this article, you will learn how to perform keyword extraction using python, specifically using TF-IDF from the scikit-learn package to extract keywords from documents.

Let’s get started with python keyword extraction

I’m assuming that folks following this tutorial are already familiar with the concept of TF-IDF. If you are not, please familiarize yourself with the concept before reading on. There are a couple of videos online that give an intuitive explanation of what it is. For a more academic explanation I would recommend my Ph.D advisor’s explanation.

Dataset

In this keyword extraction tutorial, we’ll be using a stack overflow dataset which is a bit noisy and simulates what you could be dealing with in real life.

We will be using two files, one file, stackoverflow-data-idf.json has 20,000 posts and is used to compute the Inverse Document Frequency (IDF) and another file, stackoverflow-test.json has 500 posts and we would use that as a test set for us to extract keywords from. This dataset is based on the publicly available stack overflow dump from Google’s Big Query.

The first thing we’ll do is to take a peek at our dataset. The code below reads a one per line json string from data/stackoverflow-data-idf.json into a pandas data frame and prints out its schema and total number of posts. Here, lines=True simply means we are treating each line in the text file as a separate json string.

import pandas as pd

# read json into a dataframe
df_idf=pd.read_json("data/stackoverflow-data-idf.json",lines=True)

# print schema
print("Schema:\n\n",df_idf.dtypes)
print("Number of questions,columns=",df_idf.shape)
Schema:

accepted_answer_id          float64
answer_count                  int64
body                         object
comment_count                 int64
community_owned_date         object
creation_date                object
favorite_count              float64
id                            int64
last_activity_date           object
last_edit_date               object
last_editor_display_name     object
last_editor_user_id         float64
owner_display_name           object
owner_user_id               float64
post_type_id                  int64
score                         int64
tags                         object
title                        object
view_count                    int64
dtype: object
Number of questions,columns= (20000, 19)

Notice that this stack overflow dataset contains 19 fields including post title, body, tags, dates and other metadata which we don’t quite need for this tutorial. What we are mostly interested in for this tutorial, is the body and title which will become our source of text for keyword extraction. We will now create a field that combines both body and title so we have it in one field. We will also print the second text entry in our new field just to see what the text looks like.


import re
def pre_process(text):
    
    # lowercase
    text=text.lower()
    
    #remove tags
    text=re.sub("","",text)
    
    # remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",text)
    
    return text

df_idf['text'] = df_idf['title'] + df_idf['body']
df_idf['text'] = df_idf['text'].apply(lambda x:pre_process(x))

#show the second 'text' just for fun
df_idf['text'][2]


The text above is essentially a combination of the title and body of a stack overflow post. Hmmm, this doesn’t look very readable, does it? Well, that’s because we are cleaning the text after we concatenated the two fields (line 18). All of the cleaning happens in pre_process(..). You can do a lot more stuff in pre_process(..), such as eliminate all code sections, normalize the words to its root, etc, but for simplicity we perform only some mild pre-processing.

Creating Vocabulary and Word Counts for IDF

We now need to create the vocabulary and start the counting process. We can use the CountVectorizer to create a vocabulary from all the text in our df_idf['text'] followed by the counts of words in the vocabulary (see: usage examples for CountVectorizer).


from sklearn.feature_extraction.text import CountVectorizer
import re

def get_stop_words(stop_file_path):
    """load stop words """
    
    with open(stop_file_path, 'r', encoding="utf-8") as f:
        stopwords = f.readlines()
        stop_set = set(m.strip() for m in stopwords)
        return frozenset(stop_set)

#load a set of stop words
stopwords=get_stop_words("resources/stopwords.txt")

#get the text column 
docs=df_idf['text'].tolist()

#create a vocabulary of words, 
#ignore words that appear in 85% of documents, 
#eliminate stop words
cv=CountVectorizer(max_df=0.85,stop_words=stopwords)
word_count_vector=cv.fit_transform(docs)

While cv.fit(...) would only create the vocabulary, cv.fit_transform(...) creates the vocabulary and returns a term-document matrix which is what we want. With this, each column in the matrix represents a word in the vocabulary while each row represents the document in our dataset where the values in this case are the word counts. Note that with this representation, counts of some words could be 0 if the word did not appear in the corresponding document.

Notice that in the code above, we are passing two parameters to CountVectorizer, max_df and stop_words. The first is just to say ignore all words that have appeared in 85% of the documents, since those may be unimportant. The later, is a custom stop words list. You can also use stop words that are native to sklearn by setting stop_words='english', but I personally find this to be quite limited. I used a custom stop word list used for this tutorial.

The resulting shape of word_count_vector is (20000,124901) since we have 20,000 documents in our dataset (the rows) and the vocabulary size is 124,901. In some text mining applications such as clustering and text classification we typically limit the size of the vocabulary. It’s really easy to do this by setting max_features=vocab_size when instantiating CountVectorizer. For this tutorial let’s limit our vocabulary size to 10,000.


cv=CountVectorizer(max_df=0.85,stop_words=stopwords,max_features=10000)
word_count_vector=cv.fit_transform(docs)

Now, let’s look at 10 words from our vocabulary.


list(cv.vocabulary_.keys())[:10]
['serializing',
 'private',
 'struct',
 'public',
 'class',
 'contains',
 'properties',
 'string',
 'serialize',
 'attempt']

Sweet, these are mostly programming related.

TfidfTransformer to Compute Inverse Document Frequency (IDF)

It’s now time to compute the IDF values. In the code below, we are essentially taking the sparse matrix from CountVectorizer (word_count_vector) to generate the IDF when you invoke tfidf_transformer.fit(...)(see: basic usage example of tfidftransformer and tfidfvectorizer)


from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector)

An extremely important point to note here is that the IDF should always be based on a large corpora and should be representative of texts you would be using to extract keywords. This is why we are using texts from 20,000 stack overflow posts to compute the IDF instead of just a handful. I’ve seen several articles on the Web that compute the IDF using a handful of documents. You will defeat the whole purpose of IDF weighting if its not based on a large corpora as (a) your vocabulary becomes too small and (b) you have limited ability to observe the behavior of words that you do know about.

Computing TF-IDF and Extracting Keywords

Once we have our IDF computed, we are now ready to compute TF-IDF and then extract top keywords from the TF-IDF vectors. In this example, we will extract top keywords for the questions in data/stackoverflow-test.json. This data file has 500 questions with fields identical to that of data/stackoverflow-data-idf.json as we saw above. We will start by reading our test file, extracting the necessary fields (title and body) and getting the texts into a list.


# read test docs into a dataframe and concatenate title and body
df_test=pd.read_json("data/stackoverflow-test.json",lines=True)
df_test['text'] = df_test['title'] + df_test['body']
df_test['text'] =df_test['text'].apply(lambda x:pre_process(x))

# get test docs into a list
docs_test=df_test['text'].tolist()

The next step is to compute the tf-idf value for a given document in our test set by invoking tfidf_transformer.transform(...). This generates a vector of tf-idf scores. Next, we sort the words in the vector in descending order of tf-idf values and then iterate over to extract the top-n keywords. In the example below, we are extracting keywords for the first document in our test set.


# you only needs to do this once, this is a mapping of index to 
feature_names=cv.get_feature_names()

# get the document that we want to extract keywords from
doc=docs_test[0]

#generate tf-idf for the given document
tf_idf_vector=tfidf_transformer.transform(cv.transform([doc]))

#sort the tf-idf vectors by descending order of scores
sorted_items=sort_coo(tf_idf_vector.tocoo())

#extract only the top n; n here is 10
keywords=extract_topn_from_vector(feature_names,sorted_items,10)

# now print the results
print("\n=====Doc=====")
print(doc)
print("\n===Keywords===")
for k in keywords:
    print(k,keywords[k])


def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    """get the feature names and tf-idf score of top n items"""
    
    #use only topn items from vector
    sorted_items = sorted_items[:topn]

    score_vals = []
    feature_vals = []
    
    # word index and corresponding tf-idf score
    for idx, score in sorted_items:
        
        #keep track of feature name and its corresponding score
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])

    #create a tuples of feature,score
    #results = zip(feature_vals,score_vals)
    results= {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]]=score_vals[idx]
    
    return results

The sort_coo(...) method essentially sorts the values in the vector while preserving the column index. Once you have the column index then its really easy to look-up the corresponding word value as you would see in extract_topn_from_vector(...) where we do feature_vals.append(feature_names[idx]).

Example Results

In this section, you will see some of the stack overflow questions followed by the top-10 keywords generated using the code above. Note that these questions are from the stackoverflow-test.json data file.

Question about Eclipse Plugin integration


From the keywords above, the top keywords actually make sense, it talks about eclipse, maven, integrate, war and tomcat which are all unique to this specific question. There are a couple of keywords that could have been eliminated such as possibility and perhaps even project and you can further fine-tune what shows up on top by adding more common words to your stop list and you can even create your own set of stop list, very specific to your domain.

Now let’s look at another example.

Question about SQL Import

python keyword extraction

Even with all the html tags, because of the pre-processing, we are able to extract some pretty nice keywords here. The last word appropriately would qualify as a stop word. You can keep running different examples to get ideas of how to fine-tune the results.

Wow! Now you can extract important keywords from any type of text!

Some tips and tricks

  1. You can easily save the resulting CountVectorizer and TfidfTransformer and load them back for use at a later time.
  2. Instead of using CountVectorizer followed by TfidfTransformer, you can directly use TfidfVectorizer by itself. This is equivalent to CountVectorizer followed by TfidfTransformer.
  3. In this example, we computed the tf-idf matrix for each document of interest and then extracted top terms from it. What you could also do is first applytfidf_transformer.transform(docs_test) which will generate a tf-idf matrix for all documents in docs_test at one go and then iterate over the resulting vectors to extract top keywords. The first approach is useful if you have one document coming in at a time. The second approach is more suitable when you want keywords from a fairly large set of documents.

See Also: How to correctly use TFIDFTransformer and TFIDFVectorizer?

Resources for python keyword extraction

About The Author

25 thoughts on “Python Keyword Extraction Tutorial using TF-IDF”

  1. Hi Kavita

    Thank you for such an intuitive article. One question, I want to store the final keywords along with their weights in a new column corresponding to the description that I have. What would be the best way to do that? My data set is similar to your and I can see that keywords are getting generated in the correct fashion. I just want to store them in a new column seperated by a comma. For example: Java 0.29, SQL 0.84 and so on

    1. Thanks for the question Ishan.

      Here’s what you can do:

      Let’s say the documents you are extracting keywords from is in a data frame called `df` and your keywords for each document is in a list of lists. e.g. `keywords=[[‘Java 0.29′,’SQL 0.84’],[‘Python 0.39′,’ML 0.85’]]`. In this example, we’re assuming 2 documents, and thus 2 sets of extracted keywords.

      To store the results in a separate column:

      `df[‘keywords’]=keywords`

      This should create a new column with the keywords for each document.

      1. Apologies that I am writing so much but I am in dire need to get this resolved. So if I pass my column where all the docuemtns are present in such a way :

        tf_idf_vector=tfidf_transformer.transform(cv.transform(df1.CHARACTERISTICS))

        #sort the tf-idf vectors by descending order of scores
        sorted_items=sort_coo(tf_idf_vector.tocoo())

        # #extract only the top n; n here is 10
        keywords=extract_topn_from_vector(feature_names,sorted_items,20)

        df1.CHARACTERISTICS has 876 rows. Now I need to create a coreespondng column in front of it that will store the keywords and the scores.
        A simple assignment would not work. I am not sure what would let me create this column?

  2. That’s a nice tutorial, thanks a lot for it. However, what should you do if you need to extract keywords from a single document composed of around 5000 words? How would you use tf-idf for that? what other possible solutions would you use?

    1. Great question. You have several options for a single document.

      (1) Train your TF-IDF model in a similar domain with several hundred documents. Then use those vectors to compute TF-IDF for your one-off documents. Now, this should work provided your domain doesn’t keep changing and even if it does it could still work if your IDF scores are based on a generalized corpus like Wikipedia.

      (2) Experiment with word counts. Use stop words to eliminate the common words.

  3. The article is good,however if we want the output in bigram or trigram format what should be the approach..Because in most of the keyword extraction projects,its difficult to understand the output of unigram.

    1. If you want bigrams/trigrams, then you can use `ngram_range` in count vectorizer. However, if you limit to only trigrams and bigrams and your data is sparse, then the results will not make sense. So I would try n_gram range from 1 to 3 and see how it performs on your dataset.

    1. Kavita Ganesan

      I would first check the vocabulary `list(cv.vocabulary_.keys())` to see if there are duplicates in it.

      One way to eliminate the duplicates is to do a `set(list_of_items)`, however there should not be any duplicates. It’s either a bug in the code, or there are surrounding white spaces or special characters that are causing this.

  4. Smith can you please show me how you went around the code, like a snippet of the casting to a set and back to a list. Thanks

Have a thought?