Home

Home

Reading CSV & JSON files in Spark – Word Count Example
One of the really nice things about spark is the ability to read input files of different formats right out of the box. This article will show you how you can read files in csv and json to compute word counts on selected fields in pyspark.
Finally, transitioned from drupal to wordpress
I have been trying to make the transition from Drupal to Wordpress for a very long time and only managed to get things moving today. What a relieve! The transition wasn't as easy as I thought...
What is ROUGE and how it works for evaluation of summaries?
ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It is essentially of a set of metrics for evaluating automatic summarization of texts as well as machine translation. It works by comparing an automatically produced summary or translation against a set of reference summaries (typically human-produced). This article provides an intuitive explanation of how ROUGE works. 
Abstractive Summarization Papers
While much work has been done in the area of extractive summarization, there has been limited study in abstractive summarization as this is much harder to achieve (going by the definition of true abstraction). This page contains a very small collection of  summarization methods that are non-extractive...
User Review Datasets
If you are looking for review datasets for opinion analysis tasks, there are quite a few out there. Please see an updated list in my text mining blog. Here are some that I am familiar with...
Useful tips on using MEAD Summarization Toolkit
These are some handy notes for MEAD. What is MEAD? MEAD is a publicly available framework for summarization. It is not really an 'algorithm'. By default (I guess when it was first implemented) it was developed based on a centroid based approach...
Basics of Setting up ROUGE Toolkit for Evaluation of Summarization Tasks
This article talks about how to work with ROUGE for evaluation of summarization tasks. The original ROUGE toolkit uses a perl implementation which is really hard to understand, so I decided to piece together some information that may be helpful to others...
A Practical Guide to Using Terrier 2.2
A guide to searching fields, incremental indexing, parameter control and more… There are different text retrieval toolkits out there that you can use to build search engines or simply test your new search algorithm. I have used Lucene, Lemur and Terrier. Lucene is a snap when it comes to building an application layer over the search functionality. It is fast