Document frequency is the number of documents containing a particular term. Based on Figure 1, the word
cent has a document frequency of 1. Even though it appeared 3 times, it appeared 3 times in only one document.
all on the other hand, has a document frequency of 5. Even though it appeared once in every document, it appeared in 5 documents.
Why Document Frequency?
Document frequency has several uses. First, it can be used to eliminate unimportant words from analysis. For example, you can enforce that words that appeared in at least 80% of your documents, can be removed from further analysis. This doesn’t just remove common words such as
are but also domain specific words that are too frequent.
Take a clinical text corpora for example. Words like
old are likely to appear in almost all clinical notes. Since these words are so common, they will have low predictive power for downstream machine learning tasks and can thus be eliminated with the use of document frequency.
Document frequency can also be used to assign weights to boost / scale down the strength of words based on relative importance. This is done using the inverse of the frequency, known as inverse document frequency (IDF).
The idea here is that the higher the document frequency, the lower the IDF weight. With this, less weight is assigned to frequent terms and more to infrequent ones. This is commonly used in information retrieval scoring algorithms to weight terms such that the topic words get a higher score than non topic words.
Figure 2 shows you an example of IDF weights generated using titles from a stack overflow dataset. Notice that terms that are used frequently in a question have the lowest weights.
With this, we can even use the IDF weights to curate a stop word list.
Document Frequency vs. Term Frequency
While document frequency is number of documents containing a term, term frequency is the number of occurrences of a term within a document.
The term frequency of
cent in document 1 is
3 and in documents 2, 3, 4, 5 is 0 (see Figure 1).
Corpus level frequency is the sum of the term frequencies across all documents. I like to refer to this as
overall term frequency.The overall term frequency of
cent is 3
In eliminating unimportant words for analysis, one can use overall term frequency or the document frequency. Document frequency is sometimes a better way to do it as term frequency can be misleading.
For example, let’s say 1 out of 10,000 documents in your clinical notes dataset, contains 500 occurrences of the word
leukemia2000. If you use term frequency to eliminate rare words, the counts are so high that it may never pass your threshold for elimination. The word
leukemia2000 is still rare as it appears in only one document.
In contrast, if you use document frequency and you want to enforce that words to keep should at least be used in 25% of your documents. You can easily enforce this.
In summary, document frequency, while being a very simple concept, is extremely powerful in text mining and NLP. You can use it to eliminate rare and low information words, curate stop words and boost and scale down scores of words.