scikit-learn

10+ Examples for Using CountVectorizer

Scikit-learn’s CountVectorizer is used to transform a corpora of text to a vector of term / token counts. It also provides the capability to preprocess your text data prior to generating the vector representation making it a highly flexible feature representation module for text.

In this article, we are going to go in-depth into the different ways you can use CountVectorizer such that you are not just computing counts of words, but also preprocessing your text data appropriately as well as extracting additional features from your text dataset.

Example of How CountVectorizer Works

To show you an example of how CountVectorizer works, let’s take the book title below (for context: this is part of a book series that kids love) :

This text is transformed to a sparse matrix as shown in Figure 1(b) below:

Notice that here we have 9 unique words. So 9 columns. Each column in the matrix represents a unique word in the vocabulary, while each row represents the document in our dataset. In this case, we only have one book title (i.e. the document), and therefore we have only 1 row. The values in each cell are the word counts. Note that with this representation, counts of some words could be 0 if the word did not appear in the corresponding document.

While visually it’s easy to think of a word matrix representation as Figure 1 (a), in reality, these words are transformed to numbers and these numbers represent positional index in the sparse matrix as seen in Figure 1(b).

Why the sparse matrix format?

With CountVectorizer we are converting raw text to a numerical vector representation of words and n-grams. This makes it easy to directly use this representation as features (signals) in Machine Learning tasks such as for text classification and clustering.

Note that these algorithms only understand the concept of numerical features irrespective of its underlying type (text, image pixels, numbers, categories and etc.) allowing us to perform complex machine learning tasks on different types of data.

Side Note: If all you are interested in are word counts, then you can get away with using the python Counter. There is no real need to use CountVectorizer. If you still want to do it, here’s the example for extracting counts with CountVectorizer.

Dataset & Imports

In this tutorial, we will be using titles of 5 cat in the hat books (as seen below).

I had intentionally made it a handful of short texts so that you can see how to put CountVectorizer to full use in your applications. Keep note that each title above is considered a document.

CountVectorizer Plain and Simple

What happens above is that the 5 books titles are preprocessed, tokenized and represented as a sparse matrix as explained in the introduction. By default, CountVectorizer does the following:

• lowercases your text (set `lowercase=false` if you don’t want lowercasing)
• uses utf-8 encoding
• performs tokenization (converts raw text to smaller units of text)
• uses word level tokenization (meaning each word is treated as a separate token)
• ignores single characters during tokenization (say goodbye to words like ‘a’ and ‘I’)

Now, let’s look at the vocabulary (collection of unique words from our documents):

As we are using all the defaults, these are all word level tokens, lowercased. Note that the numbers here are not counts, they are the position in the sparse vector. Now, let’s check the shape:

We have 5 (rows) documents and 43 unique words (columns)!

CountVectorizer and Stop Words

Now, the first thing you may want to do, is to eliminate stop words from your text as it has limited predictive power and may not help with downstream tasks such as text classification. Stop word removal is a breeze with CountVectorizer and it can be done in several ways:

1. Use a custom stop word list that you provide
2. Use sklearn’s built in English stop word list (not recommended)
3. Create corpora specific stop words using `max_df` and `min_df` (highly recommended and will be covered later in this tutorial)

Let’s look at the 3 ways of using stop words.

Custom Stop Word List

In this example, we provide a list of words that act as our stop words. Notice that the shape has gone from `(5,43)` to `(5,40)` because of the stop words that were removed. Note that we can actually load stop words directly from a file into a list and supply that as the stop word list.

To check the stop words that are being used (when explicitly specified), simply access `cv.stop_words`.

While `cv.stop_words` gives you the stop words that you explicitly specified as shown above, `cv.stop_words_` (note: with underscore suffix) gives you the stop words that CountVectorizer inferred from your `min_df` and `max_df` settings as well as those that were cut off during feature selection (through the use of `max_features`). So far, we have not used the three settings, so `cv.stop_words_` will be empty.

Stop Words using MIN_DF

The goal of `MIN_DF` is to ignore words that have very few occurrences to be considered meaningful. For example, in your text you may have names of people that may appear in only 1 or two documents. In some applications, this may qualify as noise and could be eliminated from further analysis.

Instead of using a minimum term frequency (total occurrences of a word) to eliminate words, `MIN_DF` looks at how many documents contained a term, better known as document frequency. The `MIN_DF` value can be an absolute value (e.g. 1, 2, 3, 4) or a value representing proportion of documents (e.g. 0.25 meaning, ignore words that have appeared in 25% of the documents) .

Eliminating words that appeared in less than 2 documents:

Now, to see which words have been eliminated, you can use `cv.stop_words_` as this was internally inferred by CountVectorizer (see output below).

Yikes! We removed everything? Not quite. However, most of our words have become stop words and that’s because we have only 5 book titles.

To see what’s remaining, all we need to do is check the vocabulary again with `cv.vocabulary_` (see output below):

Sweet! These are words that appeared in all 5 book titles.

Stop Words using MAX_DF

Just as we ignored words that were too rare with `MIN_DF`, we can ignore words that are too common with `MAX_DF`. `MAX_DF` looks at how many documents contained a term, and if it exceeds the `MAX_DF` threshold, then it is eliminated from consideration. The `MAX_DF` value can be an absolute value (e.g. 1, 2, 3, 4) or a value representing proportion of documents (e.g. 0.85 meaning, ignore words appeared in 85% of the documents as they are too common).

I’ve typically used a value from `0.75-0.85` depending on the task and for more aggressive stop word removal you can even use a smaller value.

Now, to see which words have been eliminated, you can use `cv.stop_words_` (see output below):

In this example, all words that appeared in all 5 book titles have been eliminated.

Why document frequency for eliminating words?

Document frequency is sometimes a better way for inferring stop words compared to term frequency as term frequency can be misleading. For example, let’s say 1 document out of 250,000 documents in your dataset, contains 500 occurrences of the word `catnthehat`. If you use term frequency for eliminating rare words, the counts are so high that it may never pass your threshold for elimination. The word is still rare as it appears in only one document.

On several occasions, such as in building topic recommendation systems, I’ve found that using document frequency for eliminating rare and common terms gives far better results than relying on just overall term frequency.

Custom Tokenization

The default tokenization in CountVectorizer removes all special characters, punctuation and single characters. If this is not the behavior you desire, and you want to keep punctuation and special characters, you can provide a custom tokenizer to CountVectorizer.

In the example below, we provide a custom tokenizer using `tokenizer=my_tokenizer` where `my_tokenizer` is a function that attempts to keep all punctuation, and special characters and tokenizes only based on whitespace.

Fantastic, now we have our punctuation, single characters and special characters!

Custom Preprocessing

In many cases, we want to preprocess our text prior to creating a sparse matrix of terms. As I’ve explained in my text preprocessing article, preprocessing helps reduce noise and improves sparsity issues resulting in a more accurate analysis.

Here is an example of how you can achieve custom preprocessing with CountVectorizer by setting `preprocessor=<some_preprocessor>`.

In the example above, `my_cool_preprocessor` is a predefined function where we perform the following steps:

1. lowercase the text (note: this is done by default if a custom preprocessor is not specified)
2. remove special characters
3. normalize certain words
4. use stems of words instead of the original form (see: preprocessing article on stemming)

You can introduce your very own preprocessing steps such as lemmatization, adding parts-of-speech and so on to make this preprocessing step even more powerful.

Working With N-Grams

One way to enrich the representation of your features for tasks like text classification, is to use n-grams where n > 1. The intuition here is that bi-grams and tri-grams can capture contextual information compared to just unigrams. In addition, for tasks like keyword extraction, unigrams alone while useful, provides limited information. For example, `good food` carries more meaning than just `good` and `food` when observed independently.

Working with n-grams is a breeze with CountVectorizer. You can use word level n-grams or even character level n-grams (very useful in some text classification tasks). Here are a few examples:

Limiting Vocabulary Size

When your feature space gets too large, you can limit its size by putting a restriction on the vocabulary size. Say you want a max of `10,000 `n-grams. CountVectorizer will keep the top `10,000` most frequent n-grams and drop the rest.

Since we have a toy dataset, in the example below, we will limit the number of features to `10`.

Notice that the shape now is `(5,10)` as we asked for a limit of 10 on the vocabulary size. You can check the removed words using `cv.stop_words_`.

Ignore Counts and Use Binary Values

By default, CountVectorizer uses the counts of terms/tokens. However, you can choose to just use presence or absence of a term instead of the raw counts. This is useful in some tasks such as certain features in text classification where the frequency of occurrence is insignificant. To get binary values instead of counts all you need to do is set `binary=True`.

If you set `binary=True` then CountVectorizer no longer uses the counts of terms/tokens. If a token is present in a document, it is 1, if absent it is `0` regardless of its frequency of occurrence. By default, `binary=False`.

Using CountVectorizer to Extract N-Gram / Term Counts

Finally, you may want to use CountVectorizer to obtain counts of your n-grams. This is slightly tricky to do with CountVectorizer, but achievable as shown below:

The counts are first ordered in descending order. Then from this list, each feature name is extracted and returned with corresponding counts.

Summary

CountVectorizer provides a powerful way to extract and represent features from your text data. It allows you to control your n-gram size, perform custom preprocessing, custom tokenization, eliminate stop words and limit vocabulary size.

While counts of words can be useful signals by themselves, in some cases, you will have to use alternative schemes such as TF-IDF to represent your features. For some applications, a binary bag of words representation may also be more effective than counts. For a more sophisticated feature representation, people use word, sentence and paragraph embeddings trained using algorithms like word2vec, Bert and ELMo where each textual unit is encoded using a fixed length vector.

To rerun some of the examples in this tutorial, get the Jupyter notebook for this article. If there is anything that I missed out here, do feel free to leave a comment below.

Build Your First Text Classifier in Python with Logistic Regression

Text classification is the automatic process of predicting one or more categories given a piece of text. For example, predicting if an email is legit or spammy. Thanks to Gmail’s spam classifier, I don’t see or hear from spammy emails!

Other than spam detection, text classifiers can be used to determine sentiment in social media texts, predict categories of news articles, parse and segment unstructured documents, flag the highly talked about fake news articles and more.

Text classifiers work by leveraging signals in the text to “guess” the most appropriate classification. For example, in a sentiment classification task, occurrences of certain words or phrases, like `slow`,`problem`,`wouldn't` and `not` can bias the classifier to predict negative sentiment.

The nice thing about text classification is that you have a range of options in terms of what approaches you could use. From unsupervised rules-based approaches to more supervised approaches such as Naive Bayes, SVMs, CRFs and Deep Learning.

In this article, we are going to learn how to build and evaluate a text classifier using logistic regression on a news categorization problem. The problem while not extremely hard, is not as straightforward as making a binary prediction (yes/no, spam/ham).

Here’s the full source code with accompanying dataset for this tutorial. Note that this is a fairly long tutorial and I would suggest that you break it down to several sessions so that you completely grasp the concepts.

HuffPost Dataset

The dataset that we will be using for this tutorial is from Kaggle. It contains news articles from Huffington Post (HuffPost) from 2014-2018 as seen below. This data set has about ~125,000 articles and 31 different categories

Now let’s look at the category distribution of these articles (Figure 2). Notice that politics has the most number of articles and education has the lowest number of articles ranging in the hundreds.

So, nothing surprising in the category distribution other than we have much fewer articles to learn from categories outside `POLITICS`.

Now, let’s take a quick peek at the dataset (Figure 3).

Notice that the fields we have in order to learn a classifier that predicts the category include headline, short_description, link and authors

The Challenge

As mentioned earlier, the problem that we are going to be tackling is to predict the category of news articles (as seen in Figure 3), using only the description, headline and the url of the articles.

Without the actual content of the article itself, the data that we have for learning is actually pretty sparse – a problem you may encounter in the real world. But let’s see if we can still learn from it reasonably well. We will not use the author field because we want to test it on articles from a different news organization, specifically from CNN.

In this tutorial, we will use the Logistic Regression algorithm to implement the classifier. In my experience, I have found Logistic Regression to be  very effective on text data and the underlying algorithm is also fairly easy to understand. More importantly, in the NLP world, it’s generally accepted that Logistic Regression is a great starter algorithm for text related classification

Feature Representation

Features are attributes (signals) that help the model learn. This can be specific words from the text itself (e.g. all words, top occurring terms, adjectives) or additional information inferred based on the original text (e.g. parts-of-speech, contains specific phrase patterns, syntactic tree structure).

For this task, we have text fields that are fairly sparse to learn from. Therefore, we will try to use all words from several of the text fields. This includes the description, headline and tokens from the url. The more advanced feature representation is something you should try as an exercise.

Feature Weighting

Not all words are equally important to a particular document / category. For example, while words like ‘murder’, ‘knife’ and ‘abduction’ are important to a crime related document, words like ‘news’ and ‘reporter’ may not be quite as important.

In this tutorial, we will be experimenting with 3 feature weighting approaches. The most basic form of feature weighting, is binary weighting. Where if a word is present in a document, the weight is ‘1’ and if the word is absent the weight is ‘0’.

Another way to assign weights is using the term-frequency of words (the counts). If a word like ‘knife’ appears 5 times in a document, that can become its corresponding weight.

We will also be using TF-IDF weighting where words that are unique to a particular document would have higher weights compared to words that are used commonly across documents.

There are of course many other methods for feature weighting. The approaches that we will experiment with in this tutorial are the most common ones and are usually sufficient for most classification tasks.

Evaluation

One of the most important components in developing a supervised text classifier is the ability to evaluate it. We need to understand if the model has learned sufficiently based on the examples that it saw in order to make correct predictions.

If the performance is rather laughable, then we know that more work needs to be done. We may need to improve the features, add more data, tweak the model parameters and etc.

For this particular task, even though the HuffPost dataset lists one category per article, in reality, an article can actually belong to more than one category. For example, the article in Figure 4 could belong to COLLEGE (the primary category) or EDUCATION.

If the classifier predicts EDUCATION as its first guess instead of COLLEGE, that doesn’t mean it’s wrong. As this is bound to happen to various other categories, instead of looking at the first predicted category, we will look at the top 3 categories predicted to compute (a) accuracy and (b) mean reciprocal rank (MRR)

Accuracy

Accuracy evaluates the fraction of correct predictions. In our case, it is the number of times the PRIMARY category appeared in the top 3 predicted categories divided by the total number of categorization tasks.

MRR

Unlike accuracy, MRR takes the rank of the first correct answer into consideration (in our case rank of the correctly predicted PRIMARY category). The formula for MRR is as follows:

where Q here refers to all the classification tasks in our test set and rank_{i} is the position of the correctly predicted category. The higher the rank of the correctly predicted category, the higher the MRR.

Since we are using the top 3 predictions, MRR will give us a sense of where the PRIMARY category is at in the ranks. If the rank of the PRIMARY category is on average 2, then the MRR would be ~0.5 and at 3, it would be ~0.3. We want to get the PRIMARY category higher up in the ranks.

Building the classifier

Now it’s finally time to build the classifier! Note that we will be using the LogisticRegression module from sklearn.

Read dataset and create text field variations

Next, we will be creating different variations of the text we will use to train the classifier. This is to see how adding more content to each field, helps with the classification task. Notice that we create a field using only the description, description + headline, and description + headline + url (tokenized).

Split dataset for training and testing

Next, we will create a train / test split of our dataset, where 25% of the dataset will be used for testing based on our evaluation strategy and remaining will be used for training the classifier.

Prepare features

Earlier, we talked about feature representation and different feature weighting schemes. In `extract_features(...)` from above, is where we extract the different types of features based on the weighting schemes.

First, note that `cv.fit_transform(...)` from the above code snippet creates a vocabulary based on the training set.  Next, `cv.transform(...)` takes in any text (test or unseen texts) and transforms it according to the vocabulary of the training set, limiting the words by the specified count restrictions (`min_df`, `max_df`) and applying necessary stop words if specified. It returns a term-document matrix where each column in the matrix represents a word in the vocabulary while each row represents the documents in the dataset. The values could either be binary or counts.  The same concept also applies to `tfidf_vectorizer.fit_transform(...)` and `tfidf_vectorizer.transform(...)`.

The code below shows how we start the training process. When you instantiate the LogisticRegression module, you can vary the `solver`, the `penalty`, the `C` value and also specify how it should handle the multi-class classification problem (one-vs-all or multinomial). By default, a one-vs-all approach is used and that’s what we’re using below:

In a one-vs-all approach that we are using above, a binary classification problem is fit for each of our 31 labels. Since we are selecting the top 3 categories predicted by the classifier (see below), we will leverage the estimated probabilities instead of the binary predictions. Behind the scenes, we are actually collecting the probability of each news category being positive.

Evaluate performance

In this section, we will look at the results for different variations of our model. First, we train a model using only the description of articles with binary feature weighting.

You can see that the accuracy is 0.59 and MRR is 0.48.  This means that only about 59% of the PRIMARY categories are appearing within the top 3 predicted labels. The MRR also tells us that the rank of the PRIMARY category is between position 2 and 3. Let’s see if we can do better. Let’s try a different feature weighting scheme.

This second model uses tf-idf weighting instead of binary weighting using the same description field. You can see that the accuracy is 0.63 and MRR is 0.51 a slight improvement. This is a good indicator that the tf-idf weighting works better than binary weighting for this particular task.

How else can we improve our classifier? Remember, we are only using the description field and it is fairly sparse. What if we used the description, headline and tokenized URL, would this help? Let’s try it.

Now look! As you can see in Figure 8, the accuracy is 0.87 and MRR is 0.75, a significant jump. Now we have about 87% of the primary categories appearing within the top 3 predicted categories. In addition, more of the PRIMARY categories are appearing at position 1. This is good news!

In Figure 9, you will see how well the model performs on different feature weighting methods and use of text fields.

There are several observations that can be made from the results in Figure 9:

1. tf-idf based weighting outperforms binary & count based schemes
2. count based feature weighting is no better than binary weighting
3. Sparsity has a lot to do with how poorly the model performs. The richer the text field, the better the overall performance of the classifier.

Prediction on CNN articles

Now, the fun part! Let’s test it on articles from a different news source than HuffPost. Let’s see how the classifier visually does on articles from CNN. We will predict the top 2 categories.

A crime related story [ see article ]

Predicted: politics, crime

Entertainment related story [ see article ]

Predicted: entertainment, style

Another entertainment related story [ see article ]

Predicted: entertainment, style

Exercise in space [ see article ]

Predicted: science, healthy living

Overall, not bad. The predicted categories make a lot of sense. Note that in the above predictions, we used the headline text. To further improve the predictions, we can enrich the text with the url tokens and description.

Saving Logistic Regression Model

Once we have fully developed the model, we want to use it later on unseen documents. Doing this is actually straightforward with sklearn. First, we have to  save the transformer to later encode / vectorize any unseen document. Next,  we also need to save the trained model so that it can make predictions using the weight vectors. Here’s how you do it:

Over to you

Here’s the full source code with accompanying dataset for this tutorial. I hope this article has given you the confidence in implementing your very own high-accuracy text classifier.

Keep in mind that text classification is an art as much as it is a science. Your creativity when it comes to text preprocessing, evaluation and feature representation will determine the success of your classifier. A one-size-fits-all approach is rare. What works for this news categorization task, may very well be inadequate for something like bug detection in source code.

An exercise for you:

Right now, we are at 87% accuracy. How can we improve the accuracy further? What else would you try? Leave a comment below with what you tried, and how well it worked. Aim for a 90-95% accuracy and let us all know what worked!

Hints:

• Perform feature selection
• Tweak model parameters
• Try balancing number of articles per category

How to Use Tfidftransformer & Tfidfvectorizer?

Scikit-learn’s Tfidftransformer and Tfidfvectorizer aim to do the same thing, which is to convert a collection of raw documents to a matrix of TF-IDF features. The differences between the two modules can be quite confusing and it’s hard to know when to use which. This article shows you how to correctly use each module, the differences between the two and some guidelines on what to use when.

Tfidftransformer Usage

1. Dataset and Imports

Below we have 5 toy documents, all about my cat and my mouse who live happily together in my house. We are going to use this toy dataset to compute the tf-idf scores of words in these documents.

We also import the necessary modules here which include `TfidfTransformer` and `CountVectorizer`.

2. Initialize CountVectorizer

In order to start using `TfidfTransformer` you will first have to create a CountVectorizer to count the number of words (term frequency), limit your vocabulary size, apply stop words and etc. The code below does just that.

Now, let’s check the shape. We should have 5 rows (5 docs) and 16 columns (16 unique words, minus single character words):

Sweet, this is what we want! Now it’s time to compute the IDFs. Note that in this example, we are using all the defaults with CountVectorizer. You can actually specify a custom stop word list, enforce minimum word count, etc. See this article on how to use CountVectorizer.

3. Compute the IDF values

Now we are going to compute the IDF values by calling `tfidf_transformer.fit(word_count_vector)` on the word counts we computed earlier.

To get a glimpse of how the IDF values look, we are going to print it by placing the IDF values in a python DataFrame. The values will be sorted in ascending order.

Notice that the words ‘mouse’ and ‘the’ have the lowest IDF values. This is expected as these words appear in each and every document in our collection. The lower the IDF value of a word, the less unique it is to any particular document.

Import Note: In practice, your IDF should be based on a large corpora of text.

4. Compute the TFIDF score for your documents

Once you have the IDF values, you can now compute the tf-idf scores for any document or set of documents. Let’s compute tf-idf scores for the 5 documents in our collection.

The first line above, gets the word counts for the documents in a sparse matrix form. We could have actually used `word_count_vector` from above. However, in practice, you may be computing tf-idf scores on a set of new unseen documents. When you do that, you will first have to do `cv.transform(your_new_docs)` to generate the matrix of word counts.

Then, by invoking `tfidf_transformer.transform(count_vector)` you will finally be computing the tf-idf scores for your docs. Internally this is computing the `tf * idf`  multiplication where your term frequency is weighted by its IDF values.

Now, let’s print the tf-idf values of the first document to see if it makes sense. What we are doing below is, placing the tf-idf scores from the first document into a pandas data frame and sorting it in descending order of scores.

Tf-idf scores of first document:

Notice that only certain words have scores. This is because our first document is “the house had a tiny little mouse”  all the words in this document have a tf-idf score and everything else show up as zeroes. Notice that the word “a” is missing from this list. This is possibly due to internal pre-processing of CountVectorizer where it removes single characters.

The scores above make sense. The more common the word across documents, the lower its score and the more unique a word is to our first document (e.g. ‘had’ and ‘tiny’) the higher the score. So it’s working as expected except for the mysterious `a` that was chopped off.

Tfidfvectorizer Usage

Now, we are going to use the same 5 documents from above to do the same thing as we did for Tfidftransformer – which is to get the tf-idf scores of a set of documents. But, notice how this is much shorter.

With Tfidfvectorizer you compute the word counts, idf and tf-idf values all at once. It’s really simple.

Now let’s print the tfidf values for the first document from our collection. Notice that these values are identical to the ones from Tfidftransformer, only thing is that it’s done in just two steps.

Here’s another way to do it by calling `fit` and `transform` separately and you’ll end up with the same results.

Tfidftransformer vs. Tfidfvectorizer

In summary, the main difference between the two modules are as follows:

With Tfidftransformer you will systematically compute word counts using CountVectorizer and then compute the Inverse Document Frequency (IDF) values and only then compute the Tf-idf scores.

With Tfidfvectorizer on the contrary, you will do all three steps at once. Under the hood, it computes the word counts, IDF values, and Tf-idf scores all using the same dataset.

When to use what?

So now you may be wondering, why you should use more steps than necessary if you can get everything done in two steps. Well, there are cases where you want to use Tfidftransformer over Tfidfvectorizer and it is sometimes not that obvious. Here is a general guideline:

• If you need the term frequency (term count) vectors for different tasks, use Tfidftransformer.
• If you need to compute tf-idf scores on documents within your “training” dataset, use Tfidfvectorizer
• If you need to compute tf-idf scores on documents outside your “training” dataset, use either one, both will work.