|I’m trying to classify source code into different categories. But the problem with TF-IDF and BOW approach is that they remove arithmetic operators like ++,–,== from the data which play a crucial role in my classification. Is there any way that we could include those?|
Yes, of course, you can retain those. Most machine learning packages tend to remove special characters and single-letter words. To retain those, you’ll have to change the internal pre-processing. You can usually customize it. If you’re using CountVectorizer for example, my CountVectorizer tutorial will show you how to use custom preprocessing and you can code it so that the desired characters are retained.