Previously, we learned how to use CountVectorizer for text processing. In place of CountVectorizer, you also have the option of using HashingVectorizer.
In this tutorial, we will learn how HashingVectorizer differs from CountVectorizer and when to use which.
CountVectorizer vs. HashingVectorizer
HashingVectorizer and CountVectorizer are meant to do the same thing. Which is to convert a collection of text documents to a matrix of token occurrences. The difference is that HashingVectorizer does not store the resulting vocabulary (i.e. the unique tokens).
With HashingVectorizer, each token directly maps to a column position in a matrix, where its size is pre-defined. For example, if you have 10,000 columns in your matrix, each token maps to 1 of the 10,000 columns. This mapping happens via hashing. The hash function used is called Murmurhash3.
See figure 1 for a visual example of how HashingVectorizer works.
The benefit of not storing the vocabulary (dictionary of tokens) is two folded. First, this is very efficient for a large dataset.
Holding a 300M token vocabulary in memory could be a challenge in certain computing environments as these are essentially strings. It demands more memory compared to their integer counterparts.
By not having to store the vocabulary, the resulting HashingVectorizer object when saved, would be much smaller and thus faster to load back into memory when needed.
The downside of doing this is that it will not be possible to retrieve the actual token given the column position. This would be especially important in tasks like keyword extraction, where you want to retrieve and use the actual tokens.
Now let’s look at how to use HashingVectorizer to generate raw term counts without any type of normalization.
HashingVectorizer Usage
Toy Dataset and Imports
Here we are using 5 cat in the hat book titles as we used in the CountVectorizer tutorial.
from sklearn.feature_extraction.text import HashingVectorizer # dataset cat_in_the_hat_docs=[ "One Cent, Two Cents, Old Cent, New Cent: All About Money (Cat in the Hat's Learning Library", "Inside Your Outside: All About the Human Body (Cat in the Hat's Learning Library)", "Oh, The Things You Can Do That Are Good for You: All About Staying Healthy (Cat in the Hat's Learning Library)", "On Beyond Bugs: All About Insects (Cat in the Hat's Learning Library)", "There's No Place Like Space: All About Our Solar System (Cat in the Hat's Learning Library)" ]
Generate Raw Term Counts
Now, we will use HashingVectorizer to compute token counts using our toy dataset.
# Compute raw counts using hashing vectorizer
# Small numbers of n_features can cause hash collisions
hvectorizer = HashingVectorizer(n_features=10000,norm=None,alternate_sign=False)
# compute counts without any term frequency normalization
X = hvectorizer.fit_transform(cat_in_the_hat_docs)
Notice that the matrix size has to be pre-specified. Also, if your vocabulary is huge and your matrix size is small, then you are going to end up with hash collision. Meaning, two different tokens (e.g. `coffee` and `caffe`) could map to the same column position, distorting your counts. So you want to be careful during initialization. We are also turning off normalization with norm=None
.
Now if you check the shape, you should see:
(5, 10000)
5 documents, and a 10,000 column matrix. In this example, most of the columns would be empty as the toy dataset is really small.
Let’s print the columns that are populated for the first document.
# print populated columns of first document
# format: (doc id, pos_in_matrix) raw_count
print(X[0])
(0, 93) 3.0 (0, 689) 1.0 (0, 717) 1.0 (0, 1664) 1.0 (0, 2759) 1.0 (0, 3124) 1.0 (0, 4212) 1.0 (0, 4380) 1.0 (0, 5044) 1.0 (0, 7353) 1.0 (0, 8903) 1.0 (0, 8958) 1.0 (0, 9376) 1.0 (0, 9402) 1.0 (0, 9851) 1.0
15 unique tokens, one with a count of 3
and the rest all 1
. Notice that the position ranges from 0 to 9999. All other arguments that you use with CountVectorizer such as stop words, n_grams size and etc. would also apply here.
Same Results With CountVectorizer
Now we can achieve the same results with CountVectorizer.
Generate Raw Term Counts
from sklearn.feature_extraction.text import CountVectorizer
cvectorizer = CountVectorizer()
# compute counts without any term frequency normalization
X = cvectorizer.fit_transform(cat_in_the_hat_docs)
If you print the shape, you will see:
(5, 43)
Notice that instead of (5,10000)
as in the HashingVectorizer example, you see (5,43)
. This is because we did not force a matrix size with CountVectorizer. The matrix size is based on how many unique tokens were found in your vocabulary, where in this case it is 43
.
Now if we print the counts for the first document, this is what we would see:
# print populated columns of first document
# format: (doc id, pos_in_matrix) raw_count
print(X[0])
(0, 28) 1 (0, 8) 3 (0, 40) 1 (0, 9) 1 (0, 26) 1 (0, 23) 1 (0, 1) 1 (0, 0) 1 (0, 22) 1 (0, 7) 1 (0, 16) 1 (0, 37) 1 (0, 13) 1 (0, 19) 1 (0, 20) 1
This is similar to the HashingVectorizer example, we have 15 populated columns. But, the column position ranges from (0,42)
.
When to use HashingVectorizer?
If you are using a large dataset for your machine learning tasks and you have no use for the resulting dictionary of tokens, then HashingVectorizer would be a good candidate.
However, if you worry about hash collisions (which is bound to happen if the size of your matrix is too small), then you might want to stick to CountVectorizer until you feel that you have maxed out your computing resources and it’s time to optimize. Also, if you need access to the actual tokens, then again CountVectorizer is the more appropriate choice.
Recommended reading
- How to use CountVectorizer for text processing
- How to correctly use scikit-learn’s tfidftransformer and tfidfvectorizer?