Word2Vec is a widely used word representation technique that uses neural networks under the hood. The resulting word representation or embeddings can be used to infer semantic similarity between words and phrases, expand queries, surface related concepts and more. The sky is the limit when it comes to how you can use these embeddings for different NLP tasks.
In this article, we will look at how the different neural network architectures for training a Word2Vec model behave in practice. The idea here is to help you make an informed decision on which architecture to use given the problem you are trying to solve.
Word2Vec in Brief
With Word2Vec, we train a neural network with a single hidden layer to predict a target word based on its context (neighboring words). The assumption here is that the meaning of a word can be inferred by the company it keeps.
In the end, the goal of training with a neural network, is not to use the resulting neural network itself. Instead, we are looking to extract the weights from the hidden layer with the believe that the these weights encode the meaning of words in the vocabulary.
Think of the this process as extracting a table of weights for each word in the vocabulary. Where each row encodes some meaning information for the word (See example in Figure 1).

CBOW, SkipGram & Subword Neural Architectures
In training a Word2Vec model, there can actually be different ways to represent the neighboring words to predict a target word. In the original Word2Vec article, 2 different architectures were introduced. One known as CBOW
for continuous bag-of-words and the other called SKIPGRAM
.

CBOW and SkipGram
The CBOW model learns to predict a target word leveraging all words in its neighborhood. The sum of the context vectors are used to predict the target word. The neighboring words taken into consideration is determined by a pre-defined window size surrounding the target word.
The SkipGram model on the other hand, learns to predict a word based on a neighboring word. To put it simply, given a word, it learns to predict another word in it’s context.
SkipGram with Subwords (Char n-grams)
More recently, building on the SkipGram idea, a more granular approach was introduced where a bag of character n-grams (also known as subwords) are used to represent a word. As shown in Figure 3, each word is represented by the sum of it’s n-gram vectors.

The idea behind leveraging character n-grams is two-folded. First, it is said to help morphologically rich languages. For example, in languages like German, certain phrases are expressed as a single word. For instance the phrase table tennis is written in as Tischtennis.
If you learned the representation of tennis
and Tischtennis
separately, it would be harder to infer that they are in fact related. However, by learning the character n-gram representation of these words, tennis
and Tischtennis
will now share overlapping n-grams, making them closer in vector space.
Another use of character n-gram representation is to infer the meaning of unseen words. For example, if you are looking for the similarity of filthy
and your corpora does not carry this word, you can still infer its meaning from its subwords such as filth
.
Now that you get the intuition behind these different architectures, it’s time to get to the practical side of things. While these different architectures have been tested in different applications from a research perspective, it’s always good to have an understanding of how these behave in practice, using a domain specific dataset.
Training Dataset
For this comparison, we will use the OpinRank dataset which we previously used in the Gensim tutorial. It has about 255,000 user reviews of hotels and is ~97MB compressed. The dataset can be downloaded directly here.
Also note that I used Gensim to train the CBOW, SkipGram and SkipGram with Subword Information (SkipGramSI) models. This was done on my local machine with the following settings:
- dimensionality=150
- window size=10
- min word count=2
- training epochs=10
- ngrams=3-6 (for SkipGramSI only)
Training Time
First, let’s look at the differences in training time between the three architectures.

Notice that CBOW is the fastest to train and SkipGramSI is the slowest. At least, it’s not hours for a decently sized dataset.
SkipGram takes longer than CBOW as for every word, you are trying to predict one word from its context. By including character level n-grams with SkipGramSI, you are essentially adding an additional layer of complexity and thus it takes more time.
Task 1: Finding Similar Concepts
Let’s look at how CBOW, SkipGram and SkipGramSI differ when it comes to finding the most similar concepts. Figure 5a, 5b, 5c and 5d show top 8 similar concepts for various terms.
Most Similar to ‘hotel’ and ‘room’

Most Similar to ‘bathroom’

Visually speaking, CBOW is the most consistent in bringing up conceptually relevant, sometimes interchangeable concepts.
With SkipGram, it’s a hit or miss. In some cases it brings up the neighboring terms as seen in Figure 5a, with others it brings up conceptually related and sometimes interchangeable words as in Figure 5c. Given this behavior, for tasks like query expansion and synonyms curation, CBOW may be a better option.
SkipGramSI behaves a bit differently for this task. It tends to bring up near duplicates of the input word (see Figure 5a, b, and c) as well as compound words that contain the input word (Figure 5d). This is not necessarily bad, especially if you want to surface potential misspellings of words or surface compound words containing a specific stem (e.g. firefly and gunfire
if the input is fire
).
Most Similar to ‘cheap’

Most Similar to ‘fire’

Task 2: Finding Similarity Between Words
Now, let’s look at how the three models behave when it comes to word to word similarity.
Figure 6, shows you two words labeled a_word
and b_word
and also a manual classification of how they are related in the concept_type
column.

Neighboring concepts. Notice that SkipGram does a good job at detecting neighboring concepts where the cosine similarity between the word vectors are above 0.6 (rows 0, 1, 2). In contrast, CBOW and SkipGramSI are less effective at this.
Synonymous concepts. In terms of capturing synonymous concepts, all three models seem to be doing a reasonable job, with the added advantage that SkipGramSI could produce a higher similarity score when there are overlapping n-grams.
Near duplicates. Compared to CBOW and SkipGram, SkipGramSI does a good job in detecting near duplicates. This is not surprising as SkipGramSI uses character level embeddings. This means that even though there may be an unseen word or a word with misspellings, if it shares overlapping n-grams with a seen word, SkipGramSI can “guess” how related the concepts are.
Unfortunately, unless misspelled words are present in the vocabulary of CBOW and SkipGram, similarity between near duplicates for these architectures can be quite unreliable.
Task 3: Phrase and Sentence Similarity
Word embeddings can be used to compute similarity between phrases and sentences. One way to do this is to average the word vectors for individual words in a phrase / sentence. The intuition here is that we are inferring the general meaning of the phrase by averaging the word vectors.
As this is slightly harder to analyze visually, I generated a small dataset in English with labels that indicate if two phrases should be considered similar as shown in Figure 7.

The last column is a binary value with 1
indicating similar
and 0
for dissimilar. Using these labels, we are going to compute precision, recall and f-score to evaluate this phrase similarity task.
Just to recap, precision tells us what percentage of the phrases predicted as similar are in fact similar. Recall on the other hand tells us, out of all the similar phrases what percentage were captured. Ideally, we want a balance between the two. That’s where the f-score comes in.
If two phrases have a cosine similarity > 0.6 (similar conclusions for stricter thresholds), then it’s considered similar, otherwise, not. Figure 8 shows how the three models perform on this phrase similarity task.


similar
column is the gold standard
.Based on Figure 8 and 9, the following observations can be made:
- SkipGram has the highest recall when it comes to the similarity task with word averaging. Which means, SkipGram is able to capture many semantically similar phrases. It also provides a good balance between precision and recall.
- SkipGramSI does not do so well on the phrase similarity task for English. It mostly finds the phrases to be dissimilar. This could be because it tends to mostly capture and encode words that share n-grams.
- Word embeddings are sentiment agnostic and capture conceptual similarity but not necessarily sentiment similarity. This is my observation in general and can also be seen in rows 1 and 2 in Figure 9. They are conceptually similar, but not sentimentally similar.
Final Thoughts
While word embeddings are useful in various NLP tasks, in that it can be trained fairly quickly, captures related concepts, detects similar phrases and more, it does have its limitations.
For example, while Word2Vec based embeddings does a good job at capturing conceptual similarity between words and phrases, it doesn’t necessarily capture fine-grained semantics such as sentiment orientation. This would require additional tweaking as explored in the following paper.
Also, you cannot directly replace a word with a similar word closest in vector space. The word could share a syntagmatic relationship or a paradigmatic relationship. Because we are not leveraging directional information in forming these embeddings, it’s hard to determine which one of the relationships we are dealing with without adding another layer of processing.
Another thing to keep mind is that the quality of the embeddings is only as good as the data that it is fed. You’re going to have a lot trouble if you train with sparse or low quality data where the neighbors (of words), vocabulary and contextual diversity is limited.
Excellent article by you. I am reading all of your articles. I appreciate your work.
.very interesting to read this article .my view is that CBOW technique is better than other two based on compiling time . accuracy point of view is not clear…if both are giving same result,, then CBOW is my choice