# Why do you need to divide term frequency with a number?

Dividing the term frequency with the total number of words is called document length normalization. And, the reason why normalization is important is to minimize the effect of long vs. short documents and reflect the true importance of a keyword to a document. You can think of this as a weighting scheme for words and phrases.

Let’s look at an example.

Say document A has 500 words, and document B has 5000 words. If the phrase `text mining` appears 100 times in document A and 100 times in B, is the phrase `text mining` equally important to documents A and B? Perhaps it’s more important to document A with respect to its length. One way to quantify this importance is through normalization.

Now, if you normalize the term frequency with the number of words in each document, you get:

```TF_A('text mining')=100/500
=0.2
```
```TF_B('text mining')=100/5000
=0.02
```

Notice that these values immediately show that the word `text mining` is more valuable to document A than B. There can be other ways to normalize term frequencies including using the maximum term frequency or average term frequency.

It will take some experimentation to decide which of the normalization techniques to use.

Depending on your use case though, you can always use the raw term frequency without normalization. But for any type of text mining and NLP tasks you’ll always see some form of normalization.

## Reference

### How do I retain stop words/characters when doing machine learning?

I’m trying to classify source code into different categories. But the problem with TF-IDF and BOW approach is that they remove arithmetic operators like ++,–,== …

### How do I deal with an imbalanced dataset?

I am attempting to develop a classification model using an unbalanced dataset of messages. Could I take the keyword set from the minority labeled records …

### Why do you need to divide term frequency with a number?

Your article shows that term frequency is counted as:TF(t) = (Number of times term t appears in a document) / (Total number of terms …