Scikit-learn’s Tfidftransformer and Tfidfvectorizer aim to do the same thing, which is to convert a collection of raw documents to a matrix of TF-IDF features. The differences between the two modules can be quite confusing and it’s hard to know when to use which. This article shows you how to correctly use each module, the differences between the two and some guidelines on what to use when.
Tfidftransformer Usage
1. Dataset and Imports
Below we have 5 toy documents, all about my cat and my mouse who live happily together in my house. We are going to use this toy dataset to compute the tf-idf scores of words in these documents.
We also import the necessary modules here which include TfidfTransformer
and CountVectorizer
.
import pandas as pd from sklearn.feature_extraction.text import TfidfTransformer from sklearn.feature_extraction.text import CountVectorizer # this is a very toy example, do not try this at home unless you want to understand the usage differences docs=["the house had a tiny little mouse", "the cat saw the mouse", "the mouse ran away from the house", "the cat finally ate the mouse", "the end of the mouse story" ]
2. Initialize CountVectorizer
In order to start using TfidfTransformer
you will first have to create a CountVectorizer to count the number of words (term frequency), limit your vocabulary size, apply stop words and etc. The code below does just that.
#instantiate CountVectorizer() cv=CountVectorizer() # this steps generates word counts for the words in your docs word_count_vector=cv.fit_transform(docs)
Now, let’s check the shape. We should have 5 rows (5 docs) and 16 columns (16 unique words, minus single character words):
word_count_vector.shape
(5, 16)
Sweet, this is what we want! Now it’s time to compute the IDFs. Note that in this example, we are using all the defaults with CountVectorizer. You can actually specify a custom stop word list, enforce minimum word count, etc. See this article on how to use CountVectorizer.
3. Compute the IDF values
Now we are going to compute the IDF values by calling tfidf_transformer.fit(word_count_vector)
on the word counts we computed earlier.
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True) tfidf_transformer.fit(word_count_vector)
To get a glimpse of how the IDF values look, we are going to print it by placing the IDF values in a python DataFrame. The values will be sorted in ascending order.
# print idf values df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(),columns=["idf_weights"]) # sort ascending df_idf.sort_values(by=['idf_weights'])

Notice that the words ‘mouse’ and ‘the’ have the lowest IDF values. This is expected as these words appear in each and every document in our collection. The lower the IDF value of a word, the less unique it is to any particular document.
Import Note: In practice, your IDF should be based on a large corpora of text.
4. Compute the TFIDF score for your documents
Once you have the IDF values, you can now compute the tf-idf scores for any document or set of documents. Let’s compute tf-idf scores for the 5 documents in our collection.
# count matrix count_vector=cv.transform(docs) # tf-idf scores tf_idf_vector=tfidf_transformer.transform(count_vector)
The first line above, gets the word counts for the documents in a sparse matrix form. We could have actually used word_count_vector
from above. However, in practice, you may be computing tf-idf scores on a set of new unseen documents. When you do that, you will first have to do cv.transform(your_new_docs)
to generate the matrix of word counts.
Then, by invoking tfidf_transformer.transform(count_vector)
you will finally be computing the tf-idf scores for your docs. Internally this is computing the tf * idf
multiplication where your term frequency is weighted by its IDF values.
Now, let’s print the tf-idf values of the first document to see if it makes sense. What we are doing below is, placing the tf-idf scores from the first document into a pandas data frame and sorting it in descending order of scores.
feature_names = cv.get_feature_names() #get tfidf vector for first document first_document_vector=tf_idf_vector[0] #print the scores df = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf"]) df.sort_values(by=["tfidf"],ascending=False)
Tf-idf scores of first document:

Notice that only certain words have scores. This is because our first document is “the house had a tiny little mouse” all the words in this document have a tf-idf score and everything else show up as zeroes. Notice that the word “a” is missing from this list. This is possibly due to internal pre-processing of CountVectorizer where it removes single characters.
The scores above make sense. The more common the word across documents, the lower its score and the more unique a word is to our first document (e.g. ‘had’ and ‘tiny’) the higher the score. So it’s working as expected except for the mysterious a
that was chopped off.
Tfidfvectorizer Usage
Now, we are going to use the same 5 documents from above to do the same thing as we did for Tfidftransformer – which is to get the tf-idf scores of a set of documents. But, notice how this is much shorter.
With Tfidfvectorizer you compute the word counts, idf and tf-idf values all at once. It’s really simple.
from sklearn.feature_extraction.text import TfidfVectorizer # settings that you use for count vectorizer will go here tfidf_vectorizer=TfidfVectorizer(use_idf=True) # just send in all your docs here tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(docs)
Now let’s print the tfidf values for the first document from our collection. Notice that these values are identical to the ones from Tfidftransformer, only thing is that it’s done in just two steps.
# get the first vector out (for the first document) first_vector_tfidfvectorizer=tfidf_vectorizer_vectors[0] # place tf-idf values in a pandas data frame df = pd.DataFrame(first_vector_tfidfvectorizer.T.todense(), index=tfidf_vectorizer.get_feature_names(), columns=["tfidf"]) df.sort_values(by=["tfidf"],ascending=False)

Here’s another way to do it by calling fit
and transform
separately and you’ll end up with the same results.
tfidf_vectorizer=TfidfVectorizer(use_idf=True) # just send in all your docs here fitted_vectorizer=tfidf_vectorizer.fit(docs) tfidf_vectorizer_vectors=fitted_vectorizer.transform(docs)
Tfidftransformer vs. Tfidfvectorizer
In summary, the main difference between the two modules are as follows:
With Tfidftransformer you will systematically compute word counts using CountVectorizer and then compute the Inverse Document Frequency (IDF) values and only then compute the Tf-idf scores.
With Tfidfvectorizer on the contrary, you will do all three steps at once. Under the hood, it computes the word counts, IDF values, and Tf-idf scores all using the same dataset.
When to use what?
So now you may be wondering, why you should use more steps than necessary if you can get everything done in two steps. Well, there are cases where you want to use Tfidftransformer over Tfidfvectorizer and it is sometimes not that obvious. Here is a general guideline:
- If you need the term frequency (term count) vectors for different tasks, use Tfidftransformer.
- If you need to compute tf-idf scores on documents within your “training” dataset, use Tfidfvectorizer
- If you need to compute tf-idf scores on documents outside your “training” dataset, use either one, both will work.
Thanks Kavitha. Yes, i understand for a holistic feature set to be used for vectorization, everytime, i need to use both closed and new .
Hello Kavitha thanks for the detailed explanation. I have a use case,
System Downtime incident management tickets ( incidents)
1. Use past data (2 – 3 yrs , closed tickets) , where problem description as an input to build feature set .
2. Use the vectorized output of this feature set to map to specific solution options taken up to fix the problem ( basically bucket / categorize the solution options)
3. a logit model , and use this as trainig set (#2) , infact split train test
4 when new ticket comes, use this as unseen data and use it for predicting potential solution option for a problem.
Question is, do you recommend to use the new ticket description also in building the feature set or will the past closed ticket description is good enough ?
Hi Raj, you need both. The past ticket—for training, the new ticket description—for inference.
Thank you very much for this document. It will help me a lot
Hey Kavita,
Thanks for the article. I’m running into an interesting case. Using both approaches on a test dataset of 10 support comments, yelds completely different IDF scores. The ones computed with the transformer are higher and alywas non-zero, whereas the ones from the all-in-one vectorizer approach are much lower and a lot of the words have zero scores. What could be the reason for that. All the preprocessing on the text before that is identical (using the same code).
Alex
I’m not sure why that may be the case. But, I would start with a few sentences from that data and represent it exactly as in this example.
See if there are discrepancies in the values. If there isn’t there may just be some bug in the code.
As a python learner, this question might be foolish. Is there a way you would recommend transferring a text document or PDF to a string that is appropriate to use with CountVectorizer? Examples I have seen use small strings written into the command.
Thanks,
Chris
Hi Chris,
You can extract the text from the PDF and have it as a long string. Or you can have it as a list of strings, where each string is an extracted line from the PDF. Both will work.
—Kavita