Scikit-learn’s Tfidftransformer and Tfidfvectorizer aim to do the same thing, which is to convert a collection of raw documents to a matrix of TF-IDF features. The differences between the two modules can be quite confusing and it’s hard to know when to use which. This article shows you how to correctly use each module, the differences between the two and some guidelines on what to use when.
1. Dataset and Imports
Below we have 5 toy documents, all about my cat and my mouse who live happily together in my house. We are going to use this toy dataset to compute the tf-idf scores of words in these documents.
We also import the necessary modules here which include
import pandas as pd from sklearn.feature_extraction.text import TfidfTransformer from sklearn.feature_extraction.text import CountVectorizer # this is a very toy example, do not try this at home unless you want to understand the usage differences docs=["the house had a tiny little mouse", "the cat saw the mouse", "the mouse ran away from the house", "the cat finally ate the mouse", "the end of the mouse story" ]
2. Initialize CountVectorizer
In order to start using
TfidfTransformer you will first have to create a CountVectorizer to count the number of words (term frequency), limit your vocabulary size, apply stop words and etc. The code below does just that.
#instantiate CountVectorizer() cv=CountVectorizer() # this steps generates word counts for the words in your docs word_count_vector=cv.fit_transform(docs)
Now, let’s check the shape. We should have 5 rows (5 docs) and 16 columns (16 unique words, minus single character words):
Sweet, this is what we want! Now it’s time to compute the IDFs. Note that in this example, we are using all the defaults with CountVectorizer. You can actually specify a custom stop word list, enforce minimum word count, etc. See this article on how to use CountVectorizer.
3. Compute the IDF values
Now we are going to compute the IDF values by calling
tfidf_transformer.fit(word_count_vector) on the word counts we computed earlier.
To get a glimpse of how the IDF values look, we are going to print it by placing the IDF values in a python DataFrame. The values will be sorted in ascending order.
# print idf values df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(),columns=["idf_weights"]) # sort ascending df_idf.sort_values(by=['idf_weights'])
Notice that the words ‘mouse’ and ‘the’ have the lowest IDF values. This is expected as these words appear in each and every document in our collection. The lower the IDF value of a word, the less unique it is to any particular document.
Import Note: In practice, your IDF should be based on a large corpora of text.
4. Compute the TFIDF score for your documents
Once you have the IDF values, you can now compute the tf-idf scores for any document or set of documents. Let’s compute tf-idf scores for the 5 documents in our collection.
# count matrix count_vector=cv.transform(docs) # tf-idf scores tf_idf_vector=tfidf_transformer.transform(count_vector)
The first line above, gets the word counts for the documents in a sparse matrix form. We could have actually used
word_count_vector from above. However, in practice, you may be computing tf-idf scores on a set of new unseen documents. When you do that, you will first have to do
cv.transform(your_new_docs) to generate the matrix of word counts.
Then, by invoking
tfidf_transformer.transform(count_vector) you will finally be computing the tf-idf scores for your docs. Internally this is computing the
tf * idf multiplication where your term frequency is weighted by its IDF values.
Now, let’s print the tf-idf values of the first document to see if it makes sense. What we are doing below is, placing the tf-idf scores from the first document into a pandas data frame and sorting it in descending order of scores.
feature_names = cv.get_feature_names() #get tfidf vector for first document first_document_vector=tf_idf_vector #print the scores df = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf"]) df.sort_values(by=["tfidf"],ascending=False)
Tf-idf scores of first document:
Notice that only certain words have scores. This is because our first document is “the house had a tiny little mouse” all the words in this document have a tf-idf score and everything else show up as zeroes. Notice that the word “a” is missing from this list. This is possibly due to internal pre-processing of CountVectorizer where it removes single characters.
The scores above make sense. The more common the word across documents, the lower its score and the more unique a word is to our first document (e.g. ‘had’ and ‘tiny’) the higher the score. So it’s working as expected except for the mysterious
a that was chopped off.
Now, we are going to use the same 5 documents from above to do the same thing as we did for Tfidftransformer – which is to get the tf-idf scores of a set of documents. But, notice how this is much shorter.
With Tfidfvectorizer you compute the word counts, idf and tf-idf values all at once. It’s really simple.
from sklearn.feature_extraction.text import TfidfVectorizer # settings that you use for count vectorizer will go here tfidf_vectorizer=TfidfVectorizer(use_idf=True) # just send in all your docs here tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(docs)
Now let’s print the tfidf values for the first document from our collection. Notice that these values are identical to the ones from Tfidftransformer, only thing is that it’s done in just two steps.
# get the first vector out (for the first document) first_vector_tfidfvectorizer=tfidf_vectorizer_vectors # place tf-idf values in a pandas data frame df = pd.DataFrame(first_vector_tfidfvectorizer.T.todense(), index=tfidf_vectorizer.get_feature_names(), columns=["tfidf"]) df.sort_values(by=["tfidf"],ascending=False)
Here’s another way to do it by calling
transform separately and you’ll end up with the same results.
tfidf_vectorizer=TfidfVectorizer(use_idf=True) # just send in all your docs here fitted_vectorizer=tfidf_vectorizer.fit(docs) tfidf_vectorizer_vectors=fitted_vectorizer.transform(docs)
Tfidftransformer vs. Tfidfvectorizer
In summary, the main difference between the two modules are as follows:
With Tfidftransformer you will systematically compute word counts using CountVectorizer and then compute the Inverse Document Frequency (IDF) values and only then compute the Tf-idf scores.
With Tfidfvectorizer on the contrary, you will do all three steps at once. Under the hood, it computes the word counts, IDF values, and Tf-idf scores all using the same dataset.
When to use what?
So now you may be wondering, why you should use more steps than necessary if you can get everything done in two steps. Well, there are cases where you want to use Tfidftransformer over Tfidfvectorizer and it is sometimes not that obvious. Here is a general guideline:
- If you need the term frequency (term count) vectors for different tasks, use Tfidftransformer.
- If you need to compute tf-idf scores on documents within your “training” dataset, use Tfidfvectorizer
- If you need to compute tf-idf scores on documents outside your “training” dataset, use either one, both will work.