Easily Access Pre-trained Word Embeddings with Gensim

What are pre-trained embeddings and why?

Pre-trained word embeddings are vector representation of words trained on a large dataset. With pre-trained embeddings, you will essentially be using the weights and vocabulary from the end result of the training process done by….someone else! (It could also be you)

One benefit of using pre-trained embeddings is that you can hit the ground running without the need for finding a large text corpora which you will have to preprocess and train with the appropriate settings.

Another benefit is the savings in training time. Training on a large corpora could demand high computation power and long training times which may not be something that you want to afford for quick experimentation.

If you want to avoid all of these logistics but still have access to good quality embeddings, you could use pre-trained word embeddings trained on a dataset that fits the domain you are working in.

For example, if you are working with news articles, it may be perfectly fine to use embeddings trained on a Twitter dataset as there is ongoing discussion about current issues as well as a constant stream of news related Tweets.

Accessing pre-trained embeddings is extremely easy with Gensim as it allows you to use pre-trained GloVe and Word2Vec embeddings with minimal effort. The code snippets below show you how.

Here’s the working notebook for this tutorial.

Accessing pre-trained Twitter GloVe embeddings

Here, we are trying to access GloVe embeddings trained on a Twitter dataset. This first step downloads the pre-trained embeddings and loads it for re-use. These vectors are based on 2B tweets, 27B tokens, 1.2M vocab, uncased. The original source of the embeddings can be found here: https://nlp.stanford.edu/projects/glove/. The 25 in the model name below refers to the dimensionality of the vectors.

Once you have loaded the pre-trained model, just use it as you would with any Gensim Word2Vec model. Here are a few examples:

This next example prints the word vectors for trump and obama.

Notice that it prints only 25 values for each word. This is because our vector dimensionality is 25.  For vectors of other dimensionality use the appropriate model names from here or reference the gensim-data GitHub repo:

  • glove-twitter-25 (104 MB)
  • glove-twitter-50 (199 MB)
  • glove-twitter-100 (387 MB)
  • glove-twitter-200 (758 MB)

Accessing pre-trained Wikipedia GloVe embeddings

The GloVe embeddings below was trained on an English Wikipedia dump and English Gigaword 5th Edition dataset. Its dimensionality is 100 and has 6B tokens (uncased). The original source of the embeddings can be found here: https://nlp.stanford.edu/projects/glove/

Once you have loaded the pre-trained model, just use it as usual. Here is a similarity example:

For vectors of other dimensionality you can use the appropriate model names from below or reference the gensim-data repository :

  • glove-wiki-gigaword-50 (65 MB)
  • glove-wiki-gigaword-100 (128 MB)
  • gglove-wiki-gigaword-200 (252 MB)
  • glove-wiki-gigaword-300 (376 MB)

Accessing pre-trained Word2Vec embeddings

So far, you have looked at a few examples using GloVe embeddings. In the same way, you can also load pre-trained Word2Vec embeddings. Here are some of your options for Word2Vec:

  • word2vec-google-news-300 (1662 MB) (dimensionality: 300)
  • word2vec-ruscorpora-300 (198 MB) (dimensionality: 300)

Be warned that the google news embeddings is sizable, so ensure that you have sufficient disk  space before using it.

What can you use pre-trained word embeddings for?

You can use pre-trained word embeddings for a variety of tasks including:

  • Finding word or phrase similarities
  • As feature weights for text classification
  • For creating an embedding layer for neural network based text classification
  • For machine translation
  • Query expansion for search enhancements
  • To create sentence embeddings through vector averaging

The possibilities are actually endless, but you may not always get better results than just a bag-of-words approach. For example, I’ve tried sentence embeddings for a search reranking task and the rankings actually deteriorated. The only way to know if it helps, is to try it and see if it improves your evaluation metrics!

Example of using GloVe embeddings to rank phrases by similarity

Here is an example of using the glove-twitter-25 GloVe embeddings to find phrases that are most similar to the query phrase.

Let’s say we have the following phrases and a corresponding query phrase with several misspellings (missing ‘r’ in barack and ‘a’ instead of ‘e’ in hussein).

The goal here is given the query phrase, rank all other phrases by semantic similarity (using the glove twitter embeddings) and compare that with surface level similarity using the jaccard similarity index. Jaccard has no notion of semantics so it sees a token as is.

The code above splits each candidate phrase as well as the query into a set of tokens (words).  The n_similarity(tokens_1,tokens_2)  takes the average of the word vectors for the query (tokens_2) and the phrase (tokens_1) and computes the cosine similarity using the resulting averaged vectors. The results are later sorted by descending order of cosine similarity scores.

This method of vector averaging assumes that the words within tokens_1 share a common concept which is amplified through word averaging. The same is the case for, tokens_2. As pointed out by Radim (creator of Gensim),  this crude method works surprisingly well.

Below, you will see the ranking of phrases using the word embeddings method vs. the surface similarity method with Jaccard.

Notice that even with misspellings, we are able to produce a decent ranking of most similar phrases using the GloVe vectors. This is because misspellings are common in tweets.  If a misspelled word is present in the vocabulary, then it will have a corresponding weight vector.

In comparison, Jaccard similarity does slightly worse (visually speaking) as all it knows are the tokens given to it and is ignorant to misspellings, nor does it have any notion of semantics.

I hope this article and accompanying notebook will give you a quick start in using pre-trained word embeddings. Can you think of any other use cases for how you would use these embeddings? Leave a comment below with your ideas!

See Also: A step-by-step Word2Vec tutorial with Gensim


How to incorporate phrases into Word2Vec – a text mining approach

Training a Word2Vec model with phrases is very similar to training a Word2Vec model with single words. The difference: you would need to add a layer of intelligence in processing your text data to pre-discover phrases. In this tutorial, you will learn how to create embeddings with phrases without explicitly specifying the number of words that should make-up a phrase (i.e. the n-gram size). This means that you could have phrases with 2 words, 3 words and in some rare cases even 4 or 5.

At a high level, the steps would include:

  • Step 1:  Discovering common phrases in your corpora
  • Step 2: Tagging your corpora with phrases
  • Step 3: Training a Word2Vec model with the newly found phrases

Step 1: Discovering common phrases in your corpora

The first step towards generating embeddings for phrases is recognizing groups of words that make up a phrase. There are many ways to recognize phrases. One way is to use a linguistic heavy approach called “chunking” to detect phrases. NLTK for example, has a chunk capability that you could use.

For this task, I will show you how you can use a text data mining approach with Spark, where you leverage the volume and evidence from your corpora for phrase detection. I like this approach because it’s lightweight, speedy and scales to the amount of data that you need to process.

So here’s how it works. At a high level, the entire corpora of text is segmented using a set of delimiter tokens. This can be special characters, stop words and other terms that can indicate phrase boundary. I specifically used some special characters and a very basic set of English stop words.

Stop words are excellent for splitting text into a set of phrases as they usually consist of connector and filler words used to connect ideas, details, or clauses together in order to make one clear, detailed sentence. You can get creative and use a more complete stop word list or you can even over-simplify this list to make it a minimal stop word list.

The code below shows you how you can use both special characters and stop words to break text into a set of candidate phrases. Check the phrase-at-scale repo for the full source code.

In the code above, we are first splitting text into coarse-grained units using some special characters like comma, period and semi-colon. This is then followed by more fine-grained boundary detection using stop words. When you repeat this process for all documents or sentences in your corpora, you will end up with a huge set of phrases. You can then surface the top phrases using frequency counts and other measures such as Pointwise Mutual Information which can measure strength of association between words in your phrase. For the phrase embedding task, we naturally have to use lots and lots of data, so frequency counts alone would suffice for this task. In some other tasks, I have combined frequency counts with Pointwise Mutual Information to get a better measure phrase quality.

To ensure scalability, I really like using Spark since you can leverage its built-in multi-threading capability on a single machine or use multiple machines to get more CPU power if you really have massive amounts of data to process. The code below shows you the PySpark method that reads your text files, cleans it up, generates candidate phrases, counts frequency of the phrases and filters it down to a set of phrases that satisfy a minimum frequency count. On a 450 MB dataset, run locally, this takes about a minute to discover top phrases and 7 minutes to annotate the entire text corpora with phrases. You can follow instructions in the phrase-at-scale repo to use this PySpark code to discover phrases for your data.

Here is a tiny snapshot of phrases found using the code above on a restaurant review dataset.

Step 2: Tagging your corpora with phrases

There are two ways you can mark certain words as phrases in your corpora. One approach is to pre-annotate your entire corpora and generate a new “annotated corpora”. The other way is to annotate your sentences or documents during the pre-processing phase prior to learning the embeddings. It’s much cleaner to have a separate layer for annotation which does not interfere with the training phase. Otherwise, it will be harder to gauge if your model is slow due to training or annotation.

In annotating your corpora, all you need to do is to somehow join the words that make-up a phrase. For this task, I just use an underscore to join the individual words. So, “…ate fried chicken and onion rings…” would become “…ate fried_chicken and onion_rings…”

Step 3: Training a Phrase2Vec model using Word2Vec

Once you have phrases explicitly tagged in your corpora the training phase is quite similar to any Word2Vec model with Gensim or any other library. You can follow my Word2Vec Gensim Tutorial for a full example on how to train and use Word2Vec.

Example Usage of Phrase Embeddings

The examples below show you the power of phrase embeddings when used to find similar concepts.  These are concepts from the restaurant domain, trained on 450 MB worth of restaurant reviews using Gensim.

Similar and related unigrams, bigrams and trigrams

Notice below that we are able to capture highly related concepts that are unigrams, bigrams and higher order n-grams.

Most similar to 'green_curry':
('panang_curry', 0.8900948762893677)
('yellow_curry', 0.884008526802063)
('panang', 0.8525004386901855)
('drunken_noodles', 0.850254237651825)
('basil_chicken', 0.8400430679321289)
('coconut_soup', 0.8296557664871216)
('massaman_curry', 0.827597975730896)
('pineapple_fried_rice', 0.8266736268997192)

Most similar to 'singapore_noodles':
('shrimp_fried_rice', 0.7932361960411072)
('drunken_noodles', 0.7914629578590393)
('house_fried_rice', 0.7901676297187805)
('mongolian_beef', 0.7796567678451538)
('crab_rangoons', 0.773795485496521)
('basil_chicken', 0.7726351022720337)
('crispy_beef', 0.7671589255332947)
('steamed_dumplings', 0.7614079117774963)

Most similar to 'chicken_tikka_masala':
('korma', 0.8702514171600342)
('butter_chicken', 0.8668922781944275)
('tikka_masala', 0.8444720506668091)
('garlic_naan', 0.8395442962646484)
('lamb_vindaloo', 0.8390569686889648)
('palak_paneer', 0.826908528804779)
('chicken_biryani', 0.8210495114326477)
('saag_paneer', 0.8197864294052124)

Most similar to 'breakfast_burrito':
('huevos_rancheros', 0.8463341593742371)
('huevos', 0.789624035358429)
('chilaquiles', 0.7711247801780701)
('breakfast_sandwich', 0.7659544944763184)
('rancheros', 0.7541004419326782)
('omelet', 0.7512155175209045)
('scramble', 0.7490915060043335)
('omlet', 0.747859001159668)

Most similar to 'little_salty':
('little_bland', 0.745500385761261)
('little_spicy', 0.7443351149559021)
('little_oily', 0.7373550534248352)
('little_overcooked', 0.7355216145515442)
('kinda_bland', 0.7207454442977905)
('slightly_overcooked', 0.712611973285675)
('little_greasy', 0.6943882703781128)
('cooked_nicely', 0.6860566139221191)

Most similar to 'celiac_disease':
('celiac', 0.8376057744026184)
('intolerance', 0.7442486882209778)
('gluten_allergy', 0.7399739027023315)
('celiacs', 0.7183824181556702)
('intolerant', 0.6730632781982422)
('gluten_free', 0.6726624965667725)
('food_allergies', 0.6587174534797668)
('gluten', 0.6406026482582092)

Similar concepts expressed differently

Here you will see that similar concepts that are expressed differently can also be captured.

Most similar to 'reasonably_priced':
('fairly_priced', 0.8588327169418335)
('affordable', 0.7922118306159973)
('inexpensive', 0.7702735066413879)
('decently_priced', 0.7376087307929993)
('reasonable_priced', 0.7328246831893921)
('priced_reasonably', 0.6946456432342529)
('priced_right', 0.6871092915534973)
('moderately_priced', 0.6844340562820435)

Most similar to 'highly_recommend':
('definitely_recommend', 0.9155156016349792)
('strongly_recommend', 0.86533123254776)
('absolutely_recommend', 0.8545517325401306)
('totally_recommend', 0.8534528017044067)
('recommend', 0.8257364630699158)
('certainly_recommend', 0.785507082939148)
('highly_reccomend', 0.7751532196998596)
('highly_recommended', 0.7553941607475281)


In summary, to generate embeddings of phrases, you would need to add a layer for phrase discovery before training a Word2Vec model. If you have lots of data, a text data mining approach has the benefit of being lightweight and scalable, without compromising on quality. In addition, you wouldn’t have to specify a phrase size in advance or be limited by a specific vocabulary. A linguistic heavy approach gives you a lot more specificity in terms of parts of speech and the types of phrases (e.g. noun phrase vs. verb phrase) that you are dealing with. If you really need that information, then you can consider a chunking approach over a text mining approach.


Here are some resources that might come handy to you:


Gensim Word2Vec Tutorial – Full Working Example

The idea behind Word2Vec is pretty simple. We’re making an assumption that the meaning of a word can be inferred by the company it keeps. This is analogous to the saying, “show me your friends, and I’ll tell who you are”.

If you have two words that have very similar neighbors (meaning: the context in which it’s used is about the same), then these words are probably quite similar in meaning or are at least related. For example, the words shocked, appalled and astonished are usually used in a similar context.

“The meaning of a word can be inferred by the company it keeps”

Using this underlying assumption, you can use Word2Vec to:

  • Surface similar concepts
  • Find unrelated concepts
  • Compute similarity between two words and more!

Down to business

In this tutorial, you will learn how to use the Gensim implementation of Word2Vec (in python) and actually get it to work! I‘ve long heard complaints about poor performance, but it really is a combination of two things: (1) your input data and (2) your parameter settings. Check out the Jupyter Notebook if you want direct access to the working example, or read on to get more context.

Side note: The training algorithms in the Gensim package were actually ported from the original Word2Vec implementation by Google and extended with additional functionality.

Imports and logging

First, we start with our imports and get logging established:


Next, is finding a really good dataset. The secret to getting Word2Vec really working for you is to have lots and lots of text data in the relevant domain. For example, if your goal is to build a sentiment lexicon, then using a dataset from the medical domain or even wikipedia may not be effective. So, choose your dataset wisely. As Matei Zaharia says,

It’s your data, stupid

That was said in the context of data quality, but it’s not just quality it’s also using the right data for the task.

For this tutorial, I am going to use data from the OpinRank dataset from some of my Ph.D work. This dataset has full user reviews of cars and hotels. I have specifically concatenated all of the hotel reviews into one big file which is about 97 MB compressed and 229 MB uncompressed. We will use the compressed file for this tutorial. Each line in this file represents a hotel review.

Now, let’s take a closer look at this data below by printing the first line.

You should see the following:

You can see that this is a pretty good full review with many words and that’s what we want. We have approximately 255,000 such reviews in this dataset.

To avoid confusion, the Gensim’s Word2Vec tutorial says that you need to pass a list of tokenized sentences as the input to Word2Vec. However, you can actually pass in a whole review as a sentence (i.e. a much larger size of text), if you have a lot of data and it should not make much of a difference. In the end, all we are using the dataset for is to get all neighboring words (the context) for a given target word.

Read files into a list

Now that we’ve had a sneak peak of our dataset, we can read it into a list so that we can pass this on to the Word2Vec model. Notice in the code below, that I am directly reading the compressed file. I’m also doing a mild pre-processing of the reviews using gensim.utils.simple_preprocess (line). This does some basic pre-processing such as tokenization, lowercasing, etc. and returns back a list of tokens (words). Documentation of this pre-processing method can be found on the official Gensim documentation site.

Training the Word2Vec model

Training the model is fairly straightforward. You just instantiate Word2Vec and pass the reviews that we read in the previous step. So, we are essentially passing on a list of lists. Where each list within the main list contains a set of tokens from a user review. Word2Vec uses all these tokens to internally create a vocabulary. And by vocabulary, I mean a set of unique words.

The step above, builds the vocabulary, and starts training the Word2Vec model. We will get to what these parameters actually mean later in this article. Behind the scenes, what’s happening here is that we are training a neural network with a single hidden layer where we train the model to predict the current word based on the context (using the default neural architecture). However, we are not going to use the neural network after training! Instead, the goal is to learn the weights of the hidden layer. These weights are essentially the word vectors that we’re trying to learn. The resulting learned vector is also known as the embeddings. You can think of these embeddings as some features that describe the target word. For example, the word king may be described by the gender, age, the type of people the king associates with, etc.

Training on the Word2Vec OpinRank dataset takes several minutes so sip a cup of tea, and wait patiently.

Some results!

Let’s get to the fun stuff already! Since we trained on user reviews, it would be nice to see similarity on some adjectives. This first example shows a simple look up of words similar to the word ‘dirty’. All we need to do here is to call the most_similar function and provide the word ‘dirty’ as the positive example. This returns the top 10 similar words.

Gensim most similar to the word "dirty"

Ooh, that looks pretty good. Let’s look at more.

Similar to polite:

Similar to france:

Similar to shocked:

Overall, the results actually make sense. All of the related words tend to be used in similar contexts.

Now you could even use Word2Vec to compute similarity between two words in the vocabulary by invoking the similarity(...) function and passing in the relevant words.

Under the hood, the above three snippets compute the cosine similarity between the two specified words using word vectors (embeddings) of each. From the scores above, it makes sense that dirty is highly similar to smelly but dirty is dissimilar to clean. If you do a similarity between two identical words, the score will be 1.0 as the range of the cosine similarity can go from [-1 to 1] and sometimes bounded between [0,1] depending on how it’s being computed. You can read more about cosine similarity scoring here.

You will find more examples of how you could use Word2Vec in my Jupyter Notebook.

A closer look at the parameter settings

To train the model earlier, we had to set some parameters. Now, let’s try to understand what some of them mean. For reference, this is the command that we used to train the model.


The size of the dense vector to represent each token or word (i.e. the context or neighboring words). If you have limited data, then size should be a much smaller value since you would only have so many unique neighbors for a given word. If you have lots of data, it’s good to experiment with various sizes. A value of 100–150 has worked well for me for similarity lookups.


The maximum distance between the target word and its neighboring word. If your neighbor’s position is greater than the maximum window width to the left or the right, then, some neighbors would not be considered as being related to the target word. In theory, a smaller window should give you terms that are more related. Again, if your data is not sparse, then the window size should not matter too much, as long as it’s not overly narrow or overly broad. If you are not too sure about this, just use the default value.


Minimium frequency count of words. The model would ignore words that do not satisfy the min_count. Extremely infrequent words are usually unimportant, so its best to get rid of those. Unless your dataset is really tiny, this does not really affect the model in terms of your final results. The settings here probably has more of an effect on memory usage and storage requirements of the model files.


How many threads to use behind the scenes?


Number of iterations (epochs) over the corpus. 5 is a good starting point. I always use a minimum of 10 iterations.

When should you use Word2Vec?

There are many application scenarios for Word2Vec. Imagine if you need to build a sentiment lexicon. Training a Word2Vec model on large amounts of user reviews helps you achieve that. You have a lexicon for not just sentiment, but for most words in the vocabulary.

Beyond raw unstructured text data, you could also use Word2Vec for more structured data. For example, if you had tags for a million stackoverflow questions and answers, you could find related tags and recommend those for exploration. You can do this by treating each set of co-occuring tags as a “sentence” and train a Word2Vec model on this data. Granted, you still need a large number of examples to make it work.

See Also: How to use pre-trained embeddings with Gensim?


Papers to read

These are some of the recommended readings: