How to incorporate phrases into Word2Vec – a text mining approach

Training a Word2Vec model with phrases is very similar to training a Word2Vec model with single words. The difference: you would need to add a layer of intelligence in processing your text data to pre-discover phrases. In this tutorial, you will learn how to create embeddings with phrases without explicitly specifying the number of words that should make-up a phrase (i.e. the n-gram size). This means that you could have phrases with 2 words, 3 words and in some rare cases even 4 or 5.

At a high level, the steps would include:

  • Step 1:  Discovering common phrases in your corpora
  • Step 2: Tagging your corpora with phrases
  • Step 3: Training a Word2Vec model with the newly found phrases

Step 1: Discovering common phrases in your corpora

The first step towards generating embeddings for phrases is recognizing groups of words that make up a phrase. There are many ways to recognize phrases. One way is to use a linguistic heavy approach called “chunking” to detect phrases. NLTK for example, has a chunk capability that you could use.

For this task, I will show you how you can use a text data mining approach with Spark, where you leverage the volume and evidence from your corpora for phrase detection. I like this approach because it’s lightweight, speedy and scales to the amount of data that you need to process.

So here’s how it works. At a high level, the entire corpora of text is segmented using a set of delimiter tokens. This can be special characters, stop words and other terms that can indicate phrase boundary. I specifically used some special characters and a very basic set of English stop words.

Stop words are excellent for splitting text into a set of phrases as they usually consist of connector and filler words used to connect ideas, details, or clauses together in order to make one clear, detailed sentence. You can get creative and use a more complete stop word list or you can even over-simplify this list to make it a minimal stop word list.

The code below shows you how you can use both special characters and stop words to break text into a set of candidate phrases. Check the phrase-at-scale repo for the full source code.

In the code above, we are first splitting text into coarse-grained units using some special characters like comma, period and semi-colon. This is then followed by more fine-grained boundary detection using stop words. When you repeat this process for all documents or sentences in your corpora, you will end up with a huge set of phrases. You can then surface the top phrases using frequency counts and other measures such as Pointwise Mutual Information which can measure strength of association between words in your phrase. For the phrase embedding task, we naturally have to use lots and lots of data, so frequency counts alone would suffice for this task. In some other tasks, I have combined frequency counts with Pointwise Mutual Information to get a better measure phrase quality.

To ensure scalability, I really like using Spark since you can leverage its built-in multi-threading capability on a single machine or use multiple machines to get more CPU power if you really have massive amounts of data to process. The code below shows you the PySpark method that reads your text files, cleans it up, generates candidate phrases, counts frequency of the phrases and filters it down to a set of phrases that satisfy a minimum frequency count. On a 450 MB dataset, run locally, this takes about a minute to discover top phrases and 7 minutes to annotate the entire text corpora with phrases. You can follow instructions in the phrase-at-scale repo to use this PySpark code to discover phrases for your data.

Here is a tiny snapshot of phrases found using the code above on a restaurant review dataset.

Step 2: Tagging your corpora with phrases

There are two ways you can mark certain words as phrases in your corpora. One approach is to pre-annotate your entire corpora and generate a new “annotated corpora”. The other way is to annotate your sentences or documents during the pre-processing phase prior to learning the embeddings. It’s much cleaner to have a separate layer for annotation which does not interfere with the training phase. Otherwise, it will be harder to gauge if your model is slow due to training or annotation.

In annotating your corpora, all you need to do is to somehow join the words that make-up a phrase. For this task, I just use an underscore to join the individual words. So, “…ate fried chicken and onion rings…” would become “…ate fried_chicken and onion_rings…”

Step 3: Training a Phrase2Vec model using Word2Vec

Once you have phrases explicitly tagged in your corpora the training phase is quite similar to any Word2Vec model with Gensim or any other library. You can follow my Word2Vec Gensim Tutorial for a full example on how to train and use Word2Vec.

Example Usage of Phrase Embeddings

The examples below show you the power of phrase embeddings when used to find similar concepts.  These are concepts from the restaurant domain, trained on 450 MB worth of restaurant reviews using Gensim.

Similar and related unigrams, bigrams and trigrams

Notice below that we are able to capture highly related concepts that are unigrams, bigrams and higher order n-grams.

Most similar to 'green_curry':
('panang_curry', 0.8900948762893677)
('yellow_curry', 0.884008526802063)
('panang', 0.8525004386901855)
('drunken_noodles', 0.850254237651825)
('basil_chicken', 0.8400430679321289)
('coconut_soup', 0.8296557664871216)
('massaman_curry', 0.827597975730896)
('pineapple_fried_rice', 0.8266736268997192)

Most similar to 'singapore_noodles':
('shrimp_fried_rice', 0.7932361960411072)
('drunken_noodles', 0.7914629578590393)
('house_fried_rice', 0.7901676297187805)
('mongolian_beef', 0.7796567678451538)
('crab_rangoons', 0.773795485496521)
('basil_chicken', 0.7726351022720337)
('crispy_beef', 0.7671589255332947)
('steamed_dumplings', 0.7614079117774963)

Most similar to 'chicken_tikka_masala':
('korma', 0.8702514171600342)
('butter_chicken', 0.8668922781944275)
('tikka_masala', 0.8444720506668091)
('garlic_naan', 0.8395442962646484)
('lamb_vindaloo', 0.8390569686889648)
('palak_paneer', 0.826908528804779)
('chicken_biryani', 0.8210495114326477)
('saag_paneer', 0.8197864294052124)

Most similar to 'breakfast_burrito':
('huevos_rancheros', 0.8463341593742371)
('huevos', 0.789624035358429)
('chilaquiles', 0.7711247801780701)
('breakfast_sandwich', 0.7659544944763184)
('rancheros', 0.7541004419326782)
('omelet', 0.7512155175209045)
('scramble', 0.7490915060043335)
('omlet', 0.747859001159668)

Most similar to 'little_salty':
('little_bland', 0.745500385761261)
('little_spicy', 0.7443351149559021)
('little_oily', 0.7373550534248352)
('little_overcooked', 0.7355216145515442)
('kinda_bland', 0.7207454442977905)
('slightly_overcooked', 0.712611973285675)
('little_greasy', 0.6943882703781128)
('cooked_nicely', 0.6860566139221191)

Most similar to 'celiac_disease':
('celiac', 0.8376057744026184)
('intolerance', 0.7442486882209778)
('gluten_allergy', 0.7399739027023315)
('celiacs', 0.7183824181556702)
('intolerant', 0.6730632781982422)
('gluten_free', 0.6726624965667725)
('food_allergies', 0.6587174534797668)
('gluten', 0.6406026482582092)

Similar concepts expressed differently

Here you will see that similar concepts that are expressed differently can also be captured.

Most similar to 'reasonably_priced':
('fairly_priced', 0.8588327169418335)
('affordable', 0.7922118306159973)
('inexpensive', 0.7702735066413879)
('decently_priced', 0.7376087307929993)
('reasonable_priced', 0.7328246831893921)
('priced_reasonably', 0.6946456432342529)
('priced_right', 0.6871092915534973)
('moderately_priced', 0.6844340562820435)

Most similar to 'highly_recommend':
('definitely_recommend', 0.9155156016349792)
('strongly_recommend', 0.86533123254776)
('absolutely_recommend', 0.8545517325401306)
('totally_recommend', 0.8534528017044067)
('recommend', 0.8257364630699158)
('certainly_recommend', 0.785507082939148)
('highly_reccomend', 0.7751532196998596)
('highly_recommended', 0.7553941607475281)


In summary, to generate embeddings of phrases, you would need to add a layer for phrase discovery before training a Word2Vec model. If you have lots of data, a text data mining approach has the benefit of being lightweight and scalable, without compromising on quality. In addition, you wouldn’t have to specify a phrase size in advance or be limited by a specific vocabulary. A linguistic heavy approach gives you a lot more specificity in terms of parts of speech and the types of phrases (e.g. noun phrase vs. verb phrase) that you are dealing with. If you really need that information, then you can consider a chunking approach over a text mining approach.


Here are some resources that might come handy to you:


How to read CSV & JSON files in Spark – word count example

One of the really nice things about spark is the ability to read input files of different formats right out of the box. Though this is a nice to have feature, reading files in spark is not always consistent and seems to keep changing with different spark releases. This article will show you how to read files in csv and json to compute word counts on selected fields. This example assumes that you would be using spark 2.0+ with python 3.0 and above. Full working code can be found in this repository.

Data files

To illustrate by example let’s make some assumptions about data files. Let’s assume that we have data files containing a title field and a corresponding text field. The toy example format in json is as follows:

And, the format in csv is as follows:

Assume that we want to compute word counts based on the textfield.

Reading JSON File

Reading the json file is actually pretty straightforward, first you create an SQLContext from the spark context. This gives you the capability of querying the json file in regular SQL type syntax.

In this next step, you use the sqlContext to read the json file and select only the text field. Remember that we have two fields, title and text and in this case we are only going to process the text field. This step returns a spark data frame where each entry is a Row object. In order to access the text field in each row, you would have to use row.text. Note that the select here is conceptually the same as traditional SQL where you would do: select text from .....

To view what you have just read, you can use

You should see something like this:

SQL Query to Read JSON file

Note that you can achieve the same results, by issuing an actual SQL query on the dataset. For this, you first register the dataset as a view, then you issue the query. This also returns the same DataFrame as above.

Reading CSV File

Reading the csv file is similar to json, with a small twist to it, you would use and provide a format to it as below. Note that this method of reading is also applicable to different file types including json, parquet and csv and probably others as well.

Since the csv data file in this example has a header row, this can be used to infer schema and thus header='true' as seen above. In this example, we are again selecting only the text field. This method of reading a file also returns a data frame identical to the previous example on reading a json file.

Generating Word Counts

Now that we know that reading the csv file or the json file returns identical data frames, we can use a single method to compute the word counts on the text field. The idea here is to break words into tokens for each row entry in the data frame, and return a count of 1 for each token (line 4). This function returns a list of lists where each internal list contains just the word and a count of 1 ([w, 1]). The tokenized words would serve as the key and the corresponding count would be the value. Then when you reduce by key, you can add up all counts on a per word (key) basis to get total counts for each word (see line 8). Note that add here is a python function from the operator module.

As you can see below, accessing the text field is pretty simple if you are dealing with data frames.

And whoala, now you know how to read files with pyspark and use it for some basic processing! For the full source code please see links below.

Source Code