Text Mining / NLP Cheat Sheet

This is a work in progress cheat sheet.

Inverse Document Frequency (IDF)

What is it?

IDF is a weight indicating how widely a word is used. The more frequent its usage across documents, the lower its score. For example, the word the would appear in almost all English texts and thus would have a very low inverse document frequency.  In contrast, the word kavita is used a lot less, so its IDF weighting would be much higher. Traditionally IDF is computed as:

where N is the total number of documents in your text collection and DF_x is the number of documents containing the word x. x is any word in your vocabulary.

How is IDF used?

  1. Term weighting. IDF is typically used to boost the scores of words that are unique to a single document so that you surface words that characterize your document and suppress words that don’t carry much weight in a document. For example, in any given document, if the word the appeared 10 times and its IDF weight is 0.1, its resulting score would be 1 (10*0.1). Now if the word kavita also appeared 10 times and its IDF weight is 0.5 the resulting score would be 5. When you rank the words by the resulting scores (in descending order of course!), kavita would appear before the.
  2. Source of stop words. We can even use IDFs to construct a stop word list.