Gensim Word2Vec Tutorial – Full Working Example

Gensim Word2Vec Tutorial – Full Working Example

The idea behind Word2Vec is pretty simple. We’re making an assumption that the meaning of a word can be inferred by the company it keeps. This is analogous to the saying, “show me your friends, and I’ll tell who you are”. If you have two words that have very similar neighbors (meaning: the context in which its used is about the same), then these words are probably quite similar in meaning or are at least related. For example, the words…

Read More Read More

Reading CSV & JSON files in Spark – Word Count Example

Reading CSV & JSON files in Spark – Word Count Example

One of the really nice things about spark is the ability to read input files of different formats right out of the box. Though this is a nice to have feature, reading files in spark is not always consistent and seems to keep changing with different spark releases. This article will show you how to read files in csv and json to compute word counts on selected fields. This example assumes that you would be using spark 2.0+ with python…

Read More Read More

What is ROUGE and how it works for evaluation of summaries?

What is ROUGE and how it works for evaluation of summaries?

ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It is essentially of a set of metrics for evaluating automatic summarization of texts as well as machine translation. It works by comparing an automatically produced summary or translation against a set of reference summaries (typically human-produced). This article provides an intuitive explanation of how ROUGE works. 

What is text similarity?

What is text similarity?

When talking about text similarity, different people have a slightly different notion on what text similarity means. In essence, the goal is to compute how ‘close’ two pieces of text are in (1) meaning or (2) surface closeness. The first is referred to as semantic similarity and the latter is referred to as lexical similarity. Although the methods for lexical similarity are often used to achieve semantic similarity (to a certain extent), achieving true semantic similarity is often much more involved. In this article, I mainly focus on lexical similarity…

Read More Read More

Abstractive Summarization Papers

Abstractive Summarization Papers

While much work has been done in the area of extractive summarization, there has been limited study in abstractive summarization as this is much harder to achieve (going by the definition of true abstraction). This page contains a very small collection of  summarization methods that are non-extractive…

User Review Datasets

User Review Datasets

If you are looking for user review data sets for opinion analysis / sentiment analysis tasks, there are quite a few out there. These dataset below contain reviews from Rotten Tomatoes, Amazon, TripAdvisor, Yelp, Edmunds.com and so on.Here are some of the many dataset available out there: Dataset Domain Description Courtesy Of Movie Reviews Data Set Movies This is a collection of movie reviews used for various opinion analysis tasks;  You would find reviews split into positive and negative classes as…

Read More Read More