I’m trying to classify source code into different categories. But the problem with TF-IDF and BOW approach is that they remove arithmetic operators like ++,–,== from the data which play a crucial role in my classification. Is there any way that we could include those? |
Yes, of course, you can retain those. Most machine learning packages tend to remove special characters and single-letter words. To retain those, you’ll have to change the internal pre-processing. You can usually customize it. If you’re using CountVectorizer for example, my CountVectorizer tutorial will show you how to use custom preprocessing and you can code it so that the desired characters are retained.
How do I deal with an imbalanced dataset?
I am attempting to develop a classification model using an unbalanced dataset of messages. Could…
How do I retain stop words/characters when doing machine learning?
I’m trying to classify source code into different categories. But the problem…
Why do you need to divide term frequency with a number?
Your article shows that term frequency is counted as:TF(t) = (Number of times term t…