Based on some recent conversations, I realized that text preprocessing is a severely overlooked topic. A few people I spoke to mentioned inconsistent results from their NLP applications only to realize that they were not preprocessing their text or were using the wrong kind of text preprocessing for their project.
With that in mind, I thought of shedding some light around what text preprocessing really is, the different techniques of text preprocessing and a way to estimate how much preprocessing you may need. For those interested, I’ve also made some text preprocessing code snippets in python for you to try. Now, let’s get started!
What is text preprocessing?
To preprocess your text simply means to bring your text into a form that is predictableand analyzablefor your task. A task here is a combination of approach and domain. For example, extracting top keywords with tfidf (approach) from Tweets (domain) is an example of a Task.
Task = approach + domain
One task’s ideal preprocessing, can become another task’s worst nightmare. So take note, text preprocessing is not directly transferable from task to task.
Let’s take a very simple example, let’s say you are trying to discover commonly used words in a news dataset. If your pre-processing step involves removing stop words because some other task used it, then you are probably going to miss out on some of the common words as you have ALREADY eliminated it. So really, it’s not a one-size-fits-all approach.
Types of text preprocessing techniques
There are different ways to preprocess your text. Here are some of the approaches that you should know about and I will try to highlight the importance of each.
Lowercasing ALL your text data, although commonly overlooked, is one of the simplest and most effective form of text preprocessing. It is applicable to most text mining and NLP problems and can help in cases where your dataset is not very large and significantly helps with consistency of expected output.
Quite recently, one of my blog readers trained a word embedding model for similarity lookups. He found that different variation in input capitalization (e.g. ‘Canada’ vs. ‘canada’) gave him different types of output or no output at all. This was probably happening because the dataset had mixed-case occurrences of the word ‘Canada’ and there was insufficient evidence for the neural-network to effectively learn the weights for the less common version. This type of issue is bound to happen when your dataset is fairly small and lowercasing is a great way to deal with sparsity issues.
Here is an example of how lowercasing solves the sparsity issue, where the same words with different cases map to the same lowercase form:
Another example where lowercasing is very useful is for search. Imagine, you are looking for documents containing “usa”. However, no results were showing up because “usa” was indexed as “USA”. Now, who should we blame? The U.I. designer who set-up the interface or the engineer who set-up the search index?
While lowercasing should be standard practice, I’ve also had situations where preserving the capitalization was important. For example, in predicting the programming language of a source code file. The word System in Java is quite different from system in python. Lowercasing the two makes them identical, causing the classifier to lose important predictive features. While lowercasing is generally helpful, it may not be applicable for all tasks.
Stemming is the process of reducing inflection in words (e.g. troubled, troubles) to their root form (e.g. trouble). The “root” in this case may not be a real root word, but just a canonical form of the original word.
Stemming uses a crude heuristic process that chops off the ends of words in the hope of correctly transforming words into its root form. So the words “trouble”, “troubled” and “troubles” might actually be converted to troubl instead of trouble because the ends were just chopped off (ughh, how crude!).
There are different algorithms for stemming. The most common algorithm, which is also known to be empirically effective for English, is Porters Algorithm. Here is an example of stemming in action with Porter Stemmer:
Stemming is useful for dealing with sparsity issues as well as standardizing vocabulary. I’ve had success with stemming in search applications in particular. The idea is that, if say you search for “deep learning classes”, you also want to surface documents that mention “deep learning class” as well as “deep learn classes”, although the latter doesn’t sound right. But you get where we are going with this. You want to match all variations of a word to bring up the most relevant documents.
In most of my previous text classification work however, stemming only marginally helped improved classification accuracy as opposed to using better engineered features and text enrichment approaches such as using word embeddings.
Lemmatization on the surface is very similar to stemming, where the goal is to remove inflections and map a word to its root form. The only difference is that, lemmatization tries to do it the proper way. It doesn’t just chop things off, it actually transforms words to the actual root. For example, the word “better” would map to “good”. It may use a dictionary such as WordNet for mappings or some special rule-based approaches. Here is an example of lemmatization in action using a WordNet-based approach:
In my experience, lemmatization provides no significant benefit over stemming for search and text classification purposes. In fact, depending on the algorithm you choose, it could be much slower compared to using a very basic stemmer and you may have to know the part-of-speech of the word in question in order to get a correct lemma. This paper finds that lemmatization has no significant impact on accuracy for text classification with neural architectures.
I would personally use lemmatization sparingly. The additional overhead may or may not be worth it. But you could always try it to see the impact it has on your performance metric.
Stop words are a set of commonly used words in a language. Examples of stop words in English are “a”, “the”, “is”, “are” and etc. The intuition behind using stop words is that, by removing low information words from text, we can focus on the important words instead. For example, in the context of a search system, if your search query is “what is text preprocessing?”, you want the search system to focus on surfacing documents that talk about text preprocessing over documents that talk about what is. This can be done by preventing all words from your stop word list from being analyzed. Stop words are commonly applied in search systems, text classification applications, topic modeling, topic extraction and others.
In my experience, stop word removal while effective in search and topic extraction systems, showed to be non-critical in classification systems. However, it does help reduce the number of features in consideration which helps keep your models decently sized.
Here is an example of stop word removal in action. All stop words are replaced with a dummy character, W:
Stop word lists can come from pre-established sets or you can create a custom one for your domain. Some libraries (e.g. sklearn) allow you to remove words that appeared in X% of your documents, which can also give you a stop word removal effect.
A highly overlooked preprocessing step is text normalization. Text normalization is the process of transforming text into a canonical (standard) form. For example, the word “gooood” and “gud” can be transformed to “good”, its canonical form. Another example is mapping of near identical words such as “stopwords”, “stop-words” and “stop words” to just “stopwords”.
Text normalization is important for noisy texts such as social media comments, text messages and comments to blog posts where abbreviations, misspellings and use of out-of-vocabulary words (oov) are prevalent. This paper showed that by using a text normalization strategy for Tweets, they were able to improve sentiment classification accuracy by ~4%.
Here’s an example of words before and after normalization:
Notice how the variations, map to the same canonical form.
In my experience, text normalization has even been effective for analyzing highly unstructured clinical texts where physicians take notes in non-standard ways. I’ve also found it useful for topic extraction where near synonyms and spelling differences are common (e.g. topic modelling, topic modeling, topic-modeling, topic-modelling).
Unfortunately, unlike stemming and lemmatization, there isn’t a standard way to normalize texts. It typically depends on the task. For example, the way you would normalize clinical texts would arguably be different from how your normalize sms text messages.
Some common approaches to text normalization include dictionary mappings (easiest), statistical machine translation (SMT) and spelling-correction based approaches. This interesting article compares the use of a dictionary based approach and a SMT approach for normalizing text messages. Interestingly, I’m also seeing more and more papers related to text normalization in the research world.
Noise removal is about removing charactersdigits and pieces of text that can interfere with your text analysis. Noise removal is one of the most essential text preprocessing steps. It is also highly domain dependent. For example, in Tweets, noise could be all special characters except hashtags as it signifies concepts that can characterize a Tweet. The problem with noise is that it can produce results that are inconsistent in your downstream tasks. Let’s take the example below:
Notice that all the raw words above have some surrounding noise in them. If you stem these words, you can see that the stemmed result does not look very pretty. None of them have a correct stem. However, with some cleaning as applied in this notebook, the results now look much better:
Noise removal is one of the first things you should be looking into when it comes to Text Mining and NLP. There are various ways to remove noise. This includes punctuation removal, special character removal, numbers removal, html formatting removal, domain specific keyword removal(e.g. ‘RT’ for retweet), source code removal, header removal and more. It all depends on which domain you are working in and what entails noise for your task. The code snippet in my notebook shows how to do some basic noise removal.
Text Enrichment / Augmentation
Text enrichment involves augmenting your original text data with information that you did not previously have. Text enrichment provides more semantics to your original text, thereby improving its predictive power and the depth of analysis you can perform on your data.
In an information retrieval example, expanding a user’s query to improve the matching of keywords is a form of augmentation. A query like text mining could become text document mining analysis. While this doesn’t make sense to a human, it can help fetch documents that are more relevant.
You can get really creative with how you enrich your text. You can use part-of-speech tagging to get more granular information about the words in your text. For example, in a document classification problem, the appearance of the word book as a noun could result in a different classification than book as a verb as one is used in the context of reading and the other is used in the context of reserving something. This article talks about how Chinese text classification is improved with a combination of nouns and verbs as input features.
With the availability of large amounts texts however, people have started using embeddings to enrich the meaning of words, phrases and sentences for classification, search, summarization and text generation in general. This is especially true in deep learning based NLP approaches where a word level embedding layer is quite common. You can either start with pre-established embeddings or create your own and use it in downstream tasks.
Not really, but you do have to do some of it for sure if you want good, consistent results. To give you an idea of what the bare minimum should be, I’ve broken it down to Must Do, Should Do and Task Dependent. Everything that falls under task dependent can be quantitatively or qualitatively tested before deciding you actually need it. Remember, less is more and you want to keep your approach as elegant as possible. The more overhead you add, the more layers you will have to peel back when you run into issues.
Lowercasing (can be task dependent in some cases)
Simple normalization – (e.g. standardize near identical words)
So, for any task, the minimum you should do is try to lowercase your text and remove noise. What entails noise depends on your domain (see section on Noise Removal). You can also do some basic normalization steps for more consistency and then systematically add other layers as you see fit.
General Rule of Thumb
Not all tasks need the same level of preprocessing. For some tasks, you can get away with the minimum. However, for others, the dataset is so noisy that, if you don’t preprocess enough, it’s going to be garbage-in-garbage-out.
Here’s a general rule of thumb. This will not always hold true, but works for most cases. If you have a lot of well written texts to work with in a fairly general domain, then preprocessing is not extremely critical; you can get away with the bare minimum (e.g. training a word embedding model using all of Wikipedia texts or Reuters news articles). However, if you are working in a very narrow domain (e.g. Tweets about health foods) and data is sparse and noisy, you could benefit from more preprocessing layers, although each layer you add (e.g. stop word removal, stemming, normalization) needs to be quantitatively or qualitatively verified as a meaningful layer.
Here’s a table that summarizes how much preprocessing you should be performing on your text data:
I hope the ideas here would steer you towards the right preprocessing steps for your projects. Remember, less is more. A friend of mine once mentioned to me how he made a large e-commerce search system more efficient and less buggy just by throwing out layers of unneeded preprocessing.
Having spent a big part of my career as a graduate student researcher and now a Data Scientist in the industry, I have come to realize that a vast majority of solutions proposed both in academic research papers and in the work place are just not meant to ship — they just don’t scale! And when I say scale, I mean handling real world uses cases, ability to handle large amounts of data and ease of deployment in a production environment. Some of these approaches either work on extremely narrow use cases or have a tough time generating results in a timely manner. In some cases, the model takes days to train even though the problem could be as simple as finding similar documents from a set of 50,000 documents.
More often than not, the problem lies is in the approach that was used. Remember, there will always bemore than one way to solve an NLP or Data Science problem and optimizing your choices will increase your chance of success in deploying your models to production. Over the decade, having shipped solutions that serve real users, I now follow a set of best practices that maximizes my chance of success every time I start a new project. I swear by these principles and I hope these become handy to you as well.
1. KISS please!
KISS (Keep it simple, stupid). When it comes to choice of techniques for solving NLP problems, this seems like common sense, but I can’t say this enough: choose techniques and pipelines that are easy to understand and maintain instead of complex ones that only you understand, sometimes only partially. In a lot of NLP applications, you would typically notice one of two things: (1) Deep pre-processing layers or (2) Complex neural network architectures that are just hard to grasp, let alone train, maintain and improve on iteratively.
The first question to ask yourself is if you need all the layers of pre-processing? Do you really need part-of-speech tagging, chunking, entity resolution, lemmatization and etc. What if you strip out a few layers? How does this affect the performance of your models? With access to massive amounts of data, in a lot of applications you can actually let the evidence in data guide your model. Think Word2Vec. The success of Word2Vec is in its simplicity. You use large amounts of data, to draw meaning using the data itself. Layers? What layers?
When it comes to Deep Learning, use it wisely. Not all problems benefit from Deep Learning and for the problems that do, use the architectures that are easy to understand and improve on. For example, for a programming language classification task, I just used a two-layer Artificial Neural Network and realized big wins in terms of training speed and accuracy. In addition, adding a new programming language is pretty seamless as long as you have data to feed into the model. I could have complicated the model to gain some social currency by using a really complex RNN architecture straight from a research paper. But I ended up starting simple just to see how far this would get me, and now I’m at the point where I can say, what’s the need to add more complexity?
2. When in doubt, use a time-tested approach
With every NLP/text mining problems, your options are a plenty. There will always be more than one way to accomplish the same task. For example, in finding similar documents, you could use a simple bag-of-words approach and compute document similarities using the resulting tf-idf vector. Alternatively, you could do something fancier by generating embeddings of each document and compute similarities using the document embeddings. Which should you use? It actually depends on several things:
a. Which of these methods have seen a higher chance of success in practice? (Hint: We see tf-idf being used all the time for information retrieval and its super fast. How about the latter?)
b. Which of these do I understand better? Remember the more you understand something, the better your chance of tuning it and getting it to work the way you expect it to.
c. Do I have the necessary tools/data to implement either of these?
Some of these questions can be easily answered with some literature search. But you could also reach out to experts such as University Professors or other Data Scientists who have worked on similar problems to give you a recommendation. Occasionally, I run my ideas by my peers who are in the same field to make sure I am thinking about problems and potential solutions correctly, before diving right in. As you get more and more projects under your belt, the intuition factor kicks in and you would just have a very strong sense about what’s going to work and what’s not.
3. Understand your end-point extremely well
My work on topics for GitHub initially started off as topics for the purpose of repository recommendations. Those topics would have never been exposed to the user and was only intended to be internally used to compute repo to repo similarity. During development, people got really excited and suggested that these should be exposed to users directly. My immediate response was “Heck, no!”. But people wondered, why not?
Very simple, that was not the intended use of those topics. The level of noise tolerance for something you would use only internally is much higher than what you show to users as suggestions, externally. So in the case of topics, I actually spent three additional months improving the work so that it can actually be exposed to users. I can’t say this enough, but you need to know what your end goal is so that you are actually working towards providing a solution that addresses the problem. Fuzziness in the end goal your are trying to achieve would result in either a complete redo or months of extra work tuning and tweaking your models to do the right thing.
4. Pay attention to your data quality
Garbage in, garbage out is true in every sense of the word when it comes to Machine Learning and NLP. If you are trying to make predictions of sentiment classes (positive vs. negative) and your positive examples contain a large number of negative comments and vice versa, your classifier is going to be confused. Imagine if I told you 1+2=3 and the next time I tell you 1+2=4 and the next time I tell you again 1+2=3. Ugh, wouldn’t you be so confused? It’s the same analogy.
Also, if you have 90% positive examples and 10% negative ones, how well do you thing your classifier is going to perform on negative comments? Its probably going to say every comment is a positive comment. Class imbalance and lack of diversity in your data can be a real problem. The more diverse your training data, the better it will generalize. This was very evident in one of my projects on clinical text segmentation (table iv) where when we consciously forced variety in training examples, the results clearly improved.
While over pre-processing your data may be unnecessary, under pre-processing it may also be detrimental. Let’s take Tweets for example. Tweets are highly noisy. You may have out of vocabulary words like looooooove and abbreviations like lgtm. To make sense of any of this, you would probably would need to bring these back to their normal form first. Without that, this would fall right into the trap of garbage-in-garbage-out especially if you are dealing with a fairly small dataset.
5. Don’t completely believe your quantitative results.
Numbers can sometimes lie. For example, in a text summarization project, the overlap between your system summary and the human curated summary may be a 100%. However, when you actually visually inspect the machine and human summaries, you might find something astonishing. Human says: this is a great example of a bad summary. Machine says: example great this is summary a bad a of. And your overlap score=1.0. See my point? Quantitative evaluation alone is NOT ENOUGH. You need to visually inspect your results – and lots of it. Try to intuitively understand the problems that you are seeing. That’s one excellent way of getting more ideas on how to tweak your algorithm or ditch it altogether. In the summarization example, the problem was obvious: the word arrangement needs A LOT of work!
6. Think about cost and scalability.
Have you ever thought about what it would take to deploy your model in a production environment? What are your data dependencies, how long does your model take to run, how about time to predict or generate results? Also, what are the memory and computation requirements of your approach when you scale up to the real number of data points that it would be handling? All of this have a direct impact on whether you can budget wise afford to use your proposed approach and secondly if you will be able to handle a production load. If your model is gpu bound, make sure that you are able to afford the cost of serving such a model.
The earlier you think about cost and scalability, the higher your chance of success in getting your models deployed. In my projects, I always instrument time to train, classify and process different loads to approximate how well the solutions that I am developing would hold up in a production environment.
In summary, the prototypes that you develop don’t have to be throw away prototypes. It can be the start of some really powerful production level solution if you plan ahead. Think about your end point and how the output from your model will be consumed and used and don’t over-complicate your solution. You will not go wrong if you KISS and pick a technique that fits the problem instead of forcing your problem to fit your chosen technique!
Stop words are a set of commonly used words in any language. For example, in English, “the”, “is” and “and”, would easily qualify as stop words. In NLP and text mining applications, stop words are used to eliminate unimportant words, allowing applications to focus on the important words instead.
While it is fairly easy to use a published set of stop words, in many cases, using such stop words is completely insufficient for certain applications. For example, in clinical texts, terms like “mcg” “dr.” and “patient” occur almost in every document that you come across. So, these terms may be regarded as potential stop words for clinical text mining and retrieval. Similarly, for tweets, terms like “#” “RT”, “@username” can be potentially regarded as stop words. The common language specific stop word list generally DOES NOT cover such domain specific terms.
The good news is that it is actually fairly easy to construct your own domain specific stop word list. Here are a few ways of doing it assuming you have a large corpus of text from the domain of interest, you can do one or more of the following to figure out your stop words:
1. Most frequent terms as stop words
Sum the term frequencies of each unique word, w across all documents in your collection. Sort the terms in descending order of raw term frequency. You can take the top N terms to be your stop words. You can also eliminate common English words (using a publish stop list) prior to sorting so that you are sure that you target the domain specific stop words. Another option is to treat words occurring in more X% of your documents as stop words. I have personally found eliminating words that appear in 85% of documents to be effective in several text mining tasks. The benefit of this approach is that it is really easy implement, the downside however is if you have a particularly long document, the raw term frequency from just a few documents can dominate and cause the term to be at the top. One way to resolve this is to normalize the raw term frequency using a normalizer such as the document length (i.e. number of words in a given document).
2. Least frequent terms as stop words
Just as terms that are extremely frequent could be distracting terms rather than discriminating terms, terms that are extremely infrequent may also not be useful for text mining and retrieval. For example the username “@username” that occurs only once in a collection of tweets, may not be very useful. Other terms like “yoMateZ!” which could be just made-up terms by people again may not be useful for text mining applications. Note that certain terms like “yaaaaayy!!” can often be normalized to standard forms such as “yay”. However, despite all the normalization if terms still have a term frequency count of one you could remove it. This could significantly reduce your overall feature space.
3. Low IDF terms as stop words
Inverse document frequency (IDF) basically refers to the inverse fraction of documents in your collection that contains a specific term ti. Let us say you have N documents. And term ti occurred in M of the N documents. The IDF of ti is thus computed as:
So the more documents ti appears in, the lower the IDF score. This means terms that appear in each and every document will have an IDF score of 0. If you rank each ti in your collection by its IDF score in descending order, you can treat the bottom K terms with the lowest IDF scores to be your stop words. Again, you can also eliminate common English words (using a published stop list) prior to sorting so that you are sure that you target the domain specific low IDF words. This is not necessary really if your K is large enough such that it will prune both general stop words as well as domain specific stop words. You will find more information about IDFs here.
So, would stop words help my task?
So how would you know if removing domain specific stop words would be helpful in your case? Easy, test it on a subset of your data. See if whatever measure of accuracy and performance improves, stays constant or degrades. If it degrades, needless to say, don’t do it unless the degradation is negligible and you see gains in other forms such as decrease in size of model, ability to process things in memory, and etc.