Opinosis Summarization Demo Software (Command Line Jar)

Opinosis Summarization Demo Software (Command Line Jar)

The Opinosis Summarizer Software is a demo version of a summarizer that generates concise abstractive summaries of highly redundant text. It  was primarily used to summarize opinions, and thus it can  be regarded as a opinion summarization software. However, since the underlying approach is general and  assumes no domain knowledge, with a few minor tweaks it can be used on any highly redundant text (e.g. twitter comments, comments on blog or news articles).  Note that this requires code changes. The demo version mainly works with user reviews.

The Opinosis Summarizer is a simple jar file. All it requires is that you have a work directory defined. This directory will hold all the input files, output files and any other resources. The following instructions will guide you through generating summaries using the Opinosis Summarizer. Please also note that the jar file has to be run from the command line and cannot be integrated into your existing code base. The Web API version will allow you to do that.

Platform: platform independent
Required Software: JRE 1.6 and above
License: Demo

Links

Opinosis Summarizer Usage


Download Library & Set-Up Directory Structure

Once you have unpacked the zip file, you will see the following items in the directory:

opinosis_lib/ - Contains helper jar files
opinosis.jar – The library that performs the summarization task
documentation.pdf – Set-Up instructions
opinosis_sample – Sample directory structure of the work directory.

Now you need to define a new work directory similar to opinosis_sample. You must have the following directory structure.

<your_work_folder>/
input/  - All the text to be summarized. One file per document.
output/ - Summarization Results (opinosis summaries)
etc/    - Other resources like opinosis.properties will be stored here.

Now copy the opinosis.properties file from opinosis_sample/etc/ into <your_work_folder>/etc/. This is the file that would contain all the application specific settings. See below on how to change these settings.


Set-up Input Files

Currently, Opinosis only accepts POS annotated sentences as the input. We assume that each input file contains a set of related sentences (one line per sentence) with POS annotations in the following format:

"that/DT has/VBZ never/RB happened/VBN before/RB ./."
"It/NN never/RB happened/VBN before/RB ./."
"xx/NN yy/VB ......."

To generate POS annotations in the above format, you could use the following POS Tagger. Each input file would represent one summarization task, so it should contain a set of clustered sentences. For example one file for all sentences related to the “battery life of an ipod nano”, and another for all sentences related to the “ease of use of the ipod nano”. Please make sure that you have sufficient redundancies in each input file (i.e. > 60 related sentences).


Running the Opinosis Summarizer

Assuming you have gone over the first two steps above, to start generating summaries type the following:

java -jar opinosis.jar -b <path_to_work_folder>
-b: base directory where input and output directories are found (work directory).

All the Opinosis generated summaries will be found in <path_to_work_folder>/output/. If you want to run the examples from opinosis_sample/, execute the following:

java -jar opinosis.jar -b opinosis_sample/


Opinosis Summarizer Parameter Settings

To change the various properties for summary generation, just look into the opinosis.properties file found in the <your_work_folder>/etc/ directory. This file contains a list of configurable parameters. Here is an explanation of these parameters:

Opinosis Parameter Settings
redundancy : Controls the minimum redundancy requirement.  This enforces that a selected path contains at least the minimum specified redundancy. This has to be an absolute value. Setting this value to more than 2 is not recommended unless you have very high redundancies. This corresponds to sigma_r in the paper.
gap : Controls the minimum gap allowed between 2 adjacent nodes. If you set this to a very large value, then your summaries may have grammatical issues. The setting recommended is between 2 and 5. The minimum acceptable setting is 2, and the default is 3. This corresponds to sigma_gap in the paper and has to be an absolute value.
max_summary : The number of candidates to select as the summary. This corresponds to the summary size, sigma_ss in the paper. This has to be an absolute value.
scoring_function :
Which scoring functions to use?
1-    only redundancy
2-    2- redundancy & path length
3-    3- redundancy & log(path length) — default (and recommended)
collapse :
Should we collapse structures? Recall may be low when structures are not collapsed. Possible values are true or false
run_id :
This is just to give the current run a logical name. Any string describing the run would be ideal.