Your article shows that term frequency is counted as: TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document). But other articles on the Web simply use TF as the number of terms occurring in the document, without the denominator. Why do you need to do this? And what is the source? |
Dividing the term frequency with the total number of words is called document length normalization. And, the reason why normalization is important is to minimize the effect of long vs. short documents and reflect the true importance of a keyword to a document. You can think of this as a weighting scheme for words and phrases.
Let’s look at an example.
Say document A has 500 words, and document B has 5000 words. If the phrase text mining
appears 100 times in document A and 100 times in B, is the phrase text mining
equally important to documents A and B? Perhaps it’s more important to document A with respect to its length. One way to quantify this importance is through normalization.
Now, if you normalize the term frequency with the number of words in each document, you get:
TF_A('text mining')=100/500 =0.2
TF_B('text mining')=100/5000 =0.02
Notice that these values immediately show that the word text mining
is more valuable to document A than B. There can be other ways to normalize term frequencies including using the maximum term frequency or average term frequency.
It will take some experimentation to decide which of the normalization techniques to use.
Depending on your use case though, you can always use the raw term frequency without normalization. But for any type of text mining and NLP tasks you’ll always see some form of normalization.
Reference
Why do you need to divide term frequency with a number?
Your article shows that term frequency is counted as:TF(t) = (Number of…
How do I deal with an imbalanced dataset?
I am attempting to develop a classification model using an unbalanced dataset of messages. Could…
How do I retain stop words/characters when doing machine learning?
I’m trying to classify source code into different categories. But the problem with TF-IDF and…