One of the questions that often comes up is what’s the difference between fastText and Word2Vec? Aren’t they both the same?
Yes and no. They are conceptually the same, but there is a minor difference—fastText operates at a character level but Word2Vec operates at a word level. Why this difference?
Before we dive into fastText , let’s quickly recap what Word2Vec is. With Word2Vec, we train a neural network with a single hidden layer to predict a target word based on its context (neighboring words). The assumption is that the meaning of a word can be inferred by the company it keeps. Under the hood, when it comes to training you could use two different neural architectures to achieve this—CBOW and SkipGram.
And as you know, after the training phase using either architecture, you can use the learned vectors in creative ways. For example, for recommendations, synonyms extraction, and more. The SkipGram architecture from Word2Vec was taken one level deeper, to operate at a character n-gram level—essentially using a bag of character n-grams. This is fastText.
What is a character n-gram?
A character n-gram is a set of co-occurring characters within a given window. It’s very similar to word n-grams, only that the window size is at the character level. And a bag of character n-grams in the fastText case means a word is represented by a sum of its character n-grams. If
n=2, and your word is
this your resulting n-grams would be:
<t th hi is s> this
The last item is a special sequence. Here’s a visual example of how the neighboring word this is modeled in learning the representation for the word visual (remember: a meaning of a word, is inferred by the company it keeps).
The intuition behind fastText is that by using a bag of character n-grams, you can learn representations for morphologically rich languages.
For example, in languages such as German, certain phrases are expressed as a single word. The phrase table tennis, for example, is written in as Tischtennis. In plain vanilla Word2Vec you’ll learn the representation of
tischtennis separately. This makes it harder to infer that
tischtennis are in fact related.
However, by learning the character n-gram representation of these words,
tischtennis will now share overlapping n-grams, making them closer in vector space. And thus, would make it easier to surface related concepts.
Another use of character n-gram representation is to infer the meaning of unseen words. For example, if you are looking for the similarity of
courageous and your corpora does not carry this word, you can still infer its meaning from its subwords such as
Some Interesting Tidbits
- From the original fastText paper, the authors found that the use of character n-grams was more useful in morphologically rich languages such as Arabic, German and Russian than for English (evaluated using rank correlation with human judgment). I can attest to this as I did try subword information for English similarity and the results were not as good as using CBOW.
- The authors found that using n-grams with
n<=6worked best. But the The optimal n-gram size really depends on the task and language and should be tuned appropriately.
- For analogy tasks, subword information significantly improved syntactic analogy tasks but did not help with semantic (meaning) analogy tasks.
Summing up fastText vs. Word2Vec
In summary, conceptually Word2Vec and fastText have the same goal: to learn vector representations of words. But unlike Word2Vec, which under the hood uses words to predict words, fastText operates at a more granular level with character n-grams. Where words are represented by the sum of the character n-gram vectors.
Is fastText better than Word2Vec? In my opinion, no. It does better on some tasks and maybe in non-English languages. But for tasks in English, I’ve found Word2Vec to be just as good or better.