How do I deal with an imbalanced dataset?

I am attempting to develop a classification model using an unbalanced dataset of messages. Could I take the keyword set from the minority labeled records and boost the TF-IDF values of those particular keywords. I’m looking for a way to improve recall and precision metrics for the minority labeled records.

Instead of trying to manipulate scores for the minority class, you can repeat examples in the minority class to make the classes more balanced. This will avoid you having to manipulate scores of any individual examples.

Alternatively, you can also repeat and augment samples from your minority class by replacing certain words with a Word2Vec generated synonyms of topic words. Here’s how you can use Word2Vec.

Also, if you feel that you have too many examples in your majority class, you can consider downsampling it to match your minority class.

In the end, keep in mind that nothing beats acquiring more high quality data to address the sparsity issue.

Scroll to Top