Dataset

Dataset for Text Mining and NLP tasks

Stack Overflow Dataset

This data set contains 20,000 stack overflow questions in json. The dataset contains the following 19 attributes.

Reading the dataset from python is really simple:

Links

  • Stack Overflow Dataset in Json
  • Keyword extraction using this dataset
  • Micropinion Generation Dataset

    Dataset Description

    This dataset is based on user reviews from CNET. The reviews are on products from various categories like tv, cell phones, gps etc. You will find two versions of the dataset :- “raw” and “pre-processed”.  The “raw” folder has the original reviews from CNET without any pre-processing (each review is delimited by “$$;”). The “pre-processed” folder contains sentences from the full review section of the reviews. All the pros and cons from the original reviews are omitted in this version and this was the version used for summarization (See Section 5.1 of paper).  In addition in the pre-processed version, a simple sentence splitter was used to split the review texts into different sentences.

    Downloads

    Citing Dataset

    If you use this dataset for your own research please cite the following to mark the dataset:

    Ganesan, K. A., C. X. Zhai, and Evelyne Viegas, Micropinion Generation: An Unsupervised Approach To Generating Ultra-Concise Summaries Of Opinions“, Proceedings of the 21st International Conference on World Wide Web 2012 (WWW ’12).

     

    User Review Datasets

    If you are looking for user review data sets for opinion analysis / sentiment analysis tasks, there are quite a few out there. These dataset below contain reviews from Rotten Tomatoes, Amazon, TripAdvisor, Yelp, Edmunds.com and so on.Here are some of the many dataset available out there:

    Dataset Domain Description Courtesy Of
    Movie Reviews Data Set Movies This is a collection of movie reviews used for various opinion analysis tasks;  You would find reviews split into positive and negative classes as well as reviews split into subjective and objective sentences.

    This dataset was initially used to predict polarity ratings (+ve/-ve).

    Pang & Lee
    Multi-Domain Sentiment Dataset Products (books, dvds..) Product reviews  from Amazon.com covering various product types (such as books, dvds, musical instruments). The data has been split into positive and negative reviews. There are more than  100,000 reviews in this dataset. The reviews come with corresponding rating stars.

    This dataset was initially used to predict polarity ratings (+ve/-ve).

    Blitzer et. al
    LARA Review Dataset Hotels & Products Reviews from Amazon.com and TripAdvisor. It contains attributes such as author name, content, date and the ratings.

    This dataset was initially used to decompose user reviews to preference rating on aspects.

    Wang et. al
    Opinosis Review Dataset Hotels, Cars, Electronics Topic related sentences extracted from user reviews. You will find 51 topics with  approximately 100 sentences each (on average). The reviews were obtained from multiple sources – Tripadvisor (hotels), Edmunds.com (cars) and Amazon.com (various electronics).

    This dataset was used for text summarization of opinions.

    Ganesan et. al
    OpinRank Tripadvisor and Edmunds.com Dataset  Hotels & Cars Reviews of cars and and hotels collected from Tripadvisor (~259,000 reviews) and Edmunds (~42,230 reviews). For cars, the extracted fields include dates, author names, favorites and the full textual review. For hotels, the fields include date, review title and the full review and also includes gold standard judgments for ranking.

    This dataset was initially used for opinion-based entity ranking.

    Ganesan & Zhai
    Restaurant Review Dataset Restaurants Contains a total 52077 reviews. The fields contain rating information, review counts, percent and cuisine type Elhadad
    SNAP Review Dataset Products Contains a 34,686,770 Amazon user reviews from 6,643,669 users.

    This dataset was initially used for recommendation systems.

    McAuley
    MovieLens Dataset Movies
    • 100,000 ratings (1-5) from 943 users on 1682 movies.
    • Each user has rated at least 20 movies.
    • Simple demographic info for the users (age, gender, occupation, zip)
    Please note that the review text is not available
    GroupLens Research Project at the University of Minnesota.
    Micropinion Generation Dataset (CNET) Electronics 330 review texts. The reviews are on products from various categories like tv, cell phones, gps etc.
    This dataset was used for text summarization of opinions.
    Ganesan & Zhai

    Opinosis Dataset – Topic related review sentences

    Description:

    This dataset contains sentences extracted from user reviews on a given topic. Example topics are “performance of Toyota Camry” and “sound quality of ipod nano”, etc. In total there are 51 such topics  with each topic having approximately 100 sentences (on average). The reviews were obtained from various sources – Tripadvisor (hotels), Edmunds.com (cars) and Amazon.com (various electronics).  This dataset was used for the following automatic text summarization project .

    The dataset file also comes with gold standard summaries  used for the summarization paper listed above. I have also provided some scripts to help with the summarization/evaluation tasks using ROUGE. Detailed information about the dataset and the list of scripts is provided in the documentation.

    Please send me an email if you have any questions regarding this dataset.

    Downloads

    Citing Dataset

    If you use this dataset for your own research please cite the following paper to mark the dataset:

    OpinRank Data – Reviews From TripAdvisor & Edmunds

     

    Dataset Overview

    This data set contains full reviews for cars and and hotels collected from Tripadvisor (~259,000 reviews) and Edmunds (~42,230 reviews).

    Car Reviews

    Dataset Description

    • Full reviews of cars for model-years 2007, 2008, and 2009
    • There are about 140-250 cars for each model year
    • Extracted fields include dates, author names, favorites and the full textual review
    • Total number of reviews: ~42,230
      • Year 2007 -18,903 reviews
      • Year 2008 -15,438 reviews
      • Year 2009 – 7,947 reviews

    Format
    There are three different folders (2007,2008,2009) representing the three model years. Each file (within these 3 folders) would contain all reviews for a particular car. The filename represents the name of the car. Within each car file, you would see a set of reviews in the following format:

    <DOC>
    <DATE>06/15/2009</DATE>
    <AUTHOR>The author</AUTHOR>
    <TEXT>The review goes here..</TEXT>
    <FAVORITE>What are my favorites about this car</FAVORITE>

    </DOC>

    Note that each review is enclosed within a element as shown above and all the extracted items are within this element.

    Hotel Reviews

    Dataset Description

    • Full reviews of hotels in 10 different cities (Dubai, Beijing, London, New York City, New Delhi, San Francisco, Shanghai, Montreal, Las Vegas, Chicago)
    • There are about 80-700 hotels in each city
    • Extracted fields include date, review title and the full review
    • Total number of reviews: ~259,000

    Format
    There should be 10 different folders representing the 10 cities mentioned earlier. Each file (within these 10 folders) would contain all reviews related to a particular hotel. The filename represents the name of the hotel. Within each file, you would see a set of reviews in the following format:

    Date1<tab>Review title1<tab>Full review 1
    Date2<tab>Review title2<tab>Full review 2
    …………….
    …………….

    Each line in the file represents a separate review entry. Tabs are used to separate the  different fields.

    Downloads

    Citing Dataset

    If you use this dataset for your own research please cite the following to mark the dataset: 
    Ganesan, K. A., and C. X. Zhai, “
    Opinion-Based Entity Ranking“, Information Retrieval.