opinosis

Opinosis: A Graph Based Approach to Abstractive Summarization of Highly Redundant Opinions

Abstract

We present a novel graph-based summarization framework (Opinosis) that generates concise abstractive summaries of highly redundant opinions. Evaluation results on summarizing user reviews show that Opinosis summaries have better agreement with human summaries compared to the baseline extractive method. The summaries are readable, reasonably well-formed and are informative enough to convey the major opinions.

Download Links

Opinosis External Discussions/Usage

The Big Idea

The Opinosis Summarization framework focuses on generating very short abstractive summaries from large amounts of text. These summaries can resemble micropinions or “micro-reviews” that you see on sites like twitter and four squares. The idea of the algorithm is to use a word graph data structure referred to as the Opinosis-Graph to represent the text to be summarized. Then, the resulting graph is repeatedly explored to find meaningful paths which in turn becomes candidate summary phrases. The Opinosis summarizer is considered a “shallow” abstractive summarizer as it uses the original text itself to generate summaries (this makes it shallow) but it can generate phrases that were previously not seen in the original text because of the way paths are explored (and this makes it abstractive rather than purely extractive). The summarization framework was evaluated on an opinion (user review) dataset. The approach itself is actually very general in that, it can be applied to any corpus containing high amounts of redundancies, for example, Twitter comments or user comments on blog/news articles.

Here is another example of an Opinosis summary for a Car (Acura 2007) generated using the OpinRank Edmunds data set. :

Additional Thoughts

While most research projects in data mining and NLP focus on technical complexity, the focus of Opinosis was its practicality, in that it uses very shallow representation of text, relying mostly on redundancy to help generate summaries. This is not too much to ask given that we live in an era of big data, and we have ample user reviews on the Web to work with. Even though the Opinosis paper uses part-of-speech tags in its graph representation, you don’t have to use this at all and the algorithm will still work fine as long as you have sufficient volume of reviews and you make a few tweaks in finding sentence breaks.

Related summarization works

Other works using a similar graph data structure

  • Discovering Related Clinical Concepts – This paper focuses on using a concept graph similar to the Opinosis-Graph to mine clinical concepts that are highly related. For example, the drug advair is highly related to concepts like inhaler, puff, diskus, singulair, tiotropium, albuterol, combivent, spiriva. Such concepts are easily discovered using the Concept-Graph in this paper.
  • Multi-sentence compression: Finding shortest paths in word graphs
    Katja’s work was used to summarize news (google news) for both English and Spanish while Opinosis was evaluated on user reviews from various sources (English only). She studies the informativeness and grammaticality of sentences and in a similar way we evaluate these aspects by studying how close the Opinosis summaries are compared to the human composed summaries in terms of information overlap and readability (using a human assessor
  • Peilin Yang and Hui Fang – Contextual Suggestion – Another related work uses the Opinosis Algorithm to extract terms from reviews for the purpose of Contextual Suggestion. This was done as  part of the Contextual Suggestion TREC Task. It turns out that Yang and Fang had the highest rank and MRR scores in this track. Their paper can be found here: An Opinion-aware Approach to Contextual Suggestion. The details of the TREC run can be found here: Overview of the TREC 2013 Contextual Suggestion Track.

    Opinosis Presentation Slides


    Citation

Opinosis Text Summarization Web API

The Opinosis REST API is available to all academic researchers. You can use a command line tool like cURL to access the API or you can also easily access the API from any programming language using HTTP request and response libraries.

The nice thing with using the REST API version versus the Java jar file is that you can integrate the API into your code base, making evaluation and the ability to build on the API output much easier. Please follow these steps to start using the API:

Steps for Consuming API

  • Create a Mashape account (Mashape manages API keys and access) and subscribe to the basic plan of this API. The usage is free up to a certain limit.
  • You can use the examples below to start using the Opinosis Web API
  • You can use this page to learn how to set the opinosis parameters.

Example JSON Input & Output:

The Opinosis Web API accepts a JSON input and returns a JSON output.

Here is the sample request: JSON request.
This is the sample response for this request: JSON response 

Example JSON Request using cURL

Example JSON Request using Python

Example JSON Request using PHP

 

Opinosis Summarization Demo Software (Command Line Jar)

The Opinosis Summarizer Software is a demo version of a summarizer that generates concise abstractive summaries of highly redundant text. It  was primarily used to summarize opinions, and thus it can  be regarded as a opinion summarization software. However, since the underlying approach is general and  assumes no domain knowledge, with a few minor tweaks it can be used on any highly redundant text (e.g. twitter comments, comments on blog or news articles).  Note that this requires code changes. The demo version mainly works with user reviews.

The Opinosis Summarizer is a simple jar file. All it requires is that you have a work directory defined. This directory will hold all the input files, output files and any other resources. The following instructions will guide you through generating summaries using the Opinosis Summarizer. Please also note that the jar file has to be run from the command line and cannot be integrated into your existing code base. The Web API version will allow you to do that.

Platform: platform independent
Required Software: JRE 1.6 and above
License: Demo

Links

Opinosis Summarizer Usage


Download Library & Set-Up Directory Structure

Once you have unpacked the zip file, you will see the following items in the directory:

opinosis_lib/ - Contains helper jar files
opinosis.jar – The library that performs the summarization task
documentation.pdf – Set-Up instructions
opinosis_sample – Sample directory structure of the work directory.

Now you need to define a new work directory similar to opinosis_sample. You must have the following directory structure.

<your_work_folder>/
input/  - All the text to be summarized. One file per document.
output/ - Summarization Results (opinosis summaries)
etc/    - Other resources like opinosis.properties will be stored here.

Now copy the opinosis.properties file from opinosis_sample/etc/ into <your_work_folder>/etc/. This is the file that would contain all the application specific settings. See below on how to change these settings.


Set-up Input Files

Currently, Opinosis only accepts POS annotated sentences as the input. We assume that each input file contains a set of related sentences (one line per sentence) with POS annotations in the following format:

"that/DT has/VBZ never/RB happened/VBN before/RB ./."
"It/NN never/RB happened/VBN before/RB ./."
"xx/NN yy/VB ......."

To generate POS annotations in the above format, you could use the following POS Tagger. Each input file would represent one summarization task, so it should contain a set of clustered sentences. For example one file for all sentences related to the “battery life of an ipod nano”, and another for all sentences related to the “ease of use of the ipod nano”. Please make sure that you have sufficient redundancies in each input file (i.e. > 60 related sentences).


Running the Opinosis Summarizer

Assuming you have gone over the first two steps above, to start generating summaries type the following:

java -jar opinosis.jar -b <path_to_work_folder>
-b: base directory where input and output directories are found (work directory).

All the Opinosis generated summaries will be found in <path_to_work_folder>/output/. If you want to run the examples from opinosis_sample/, execute the following:

java -jar opinosis.jar -b opinosis_sample/


Opinosis Summarizer Parameter Settings

To change the various properties for summary generation, just look into the opinosis.properties file found in the <your_work_folder>/etc/ directory. This file contains a list of configurable parameters. Here is an explanation of these parameters:

Opinosis Parameter Settings
redundancy : Controls the minimum redundancy requirement.  This enforces that a selected path contains at least the minimum specified redundancy. This has to be an absolute value. Setting this value to more than 2 is not recommended unless you have very high redundancies. This corresponds to sigma_r in the paper.
gap : Controls the minimum gap allowed between 2 adjacent nodes. If you set this to a very large value, then your summaries may have grammatical issues. The setting recommended is between 2 and 5. The minimum acceptable setting is 2, and the default is 3. This corresponds to sigma_gap in the paper and has to be an absolute value.
max_summary : The number of candidates to select as the summary. This corresponds to the summary size, sigma_ss in the paper. This has to be an absolute value.
scoring_function :
Which scoring functions to use?
1-    only redundancy
2-    2- redundancy & path length
3-    3- redundancy & log(path length) — default (and recommended)
collapse :
Should we collapse structures? Recall may be low when structures are not collapsed. Possible values are true or false
run_id :
This is just to give the current run a logical name. Any string describing the run would be ideal.

 

Opinosis Dataset – Topic related review sentences

Description:

This dataset contains sentences extracted from user reviews on a given topic. Example topics are “performance of Toyota Camry” and “sound quality of ipod nano”, etc. In total there are 51 such topics  with each topic having approximately 100 sentences (on average). The reviews were obtained from various sources – Tripadvisor (hotels), Edmunds.com (cars) and Amazon.com (various electronics).  This dataset was used for the following automatic text summarization project .

The dataset file also comes with gold standard summaries  used for the summarization paper listed above. I have also provided some scripts to help with the summarization/evaluation tasks using ROUGE. Detailed information about the dataset and the list of scripts is provided in the documentation.

Please send me an email if you have any questions regarding this dataset.

Downloads

Citing Dataset

If you use this dataset for your own research please cite the following paper to mark the dataset: