What is Inverse Document Frequency (IDF)?

Inverse Document Frequency (IDF) is a weight indicating how commonly a word is used. The more frequent its usage across documents, the lower its score. The lower the score, the less important the word becomes.

For example, the word the appears in almost all English texts and would thus have a very low IDF score as it carries very little “topic” information. In contrast, if you take the word coffee, while it is common, it’s not used as widely as the word the. Thus, coffee would have a higher IDF score than the. Traditionally IDF is computed as:

where N is the total number of documents in your text collection and DFt is the number of documents containing the term t and t is any word in your vocabulary. Learn more about document frequency in this article.

IDF is typically used to boost the scores of words that are unique to a document with the hope that you surface high information words that characterize your document and suppress words that don’t carry much weight in a document.

Let’s take an example. In a given document, if the word the appeared 10 times and its IDF weight is 0.1, its resulting score would be 1 (since 10*0.1=1). Now if the word coffee also appeared 10 times and its IDF weight is 0.5 the resulting score would be 5. When you rank the words by the resulting scores (in descending order of course!), coffee would appear before the, indicating that coffee is more important than the word the.

In summary, IDF is a useful little formula that you can use the create a stop-word list, use for feature weighting in text classifiers, for keyword extraction and more.

Recommended Reading

Have a thought?