# Text Summarization

## Micropinion Generation: An Unsupervised Approach to Generating Ultra-Concise Summaries of Opinions

Ganesan, Kavita, ChengXiang Zhai, and Evelyne Viegas. “Micropinion generation: an unsupervised approach to generating ultra-concise summaries of opinions.” Proceedings of the 21st international conference on World Wide Web. ACM, 2012.

## Abstract

This paper presents a new unsupervised approach to generating ultra-concise summaries of opinions. We formulate the problem of generating such a micropinion summary as an optimization problem, where we seek a set of concise and non-redundant phrases that are readable and represent key opinions in text. We measure representativeness based on a modified mutual information function and model readability with an n-gram language model. We propose some heuristic algorithms to efficiently solve this optimization problem. Evaluation results show that our unsupervised approach outperforms other state of the art summarization methods and the generated summaries are informative and readable.

## Micropinion Generation Presentation Slides

View more PowerPoint from Kavita Ganesan

## Abstract

We present a novel graph-based summarization framework (Opinosis) that generates concise abstractive summaries of highly redundant opinions. Evaluation results on summarizing user reviews show that Opinosis summaries have better agreement with human summaries compared to the baseline extractive method. The summaries are readable, reasonably well-formed and are informative enough to convey the major opinions.

## The Big Idea

The Opinosis Summarization framework focuses on generating very short abstractive summaries from large amounts of text. These summaries can resemble micropinions or “micro-reviews” that you see on sites like twitter and four squares. The idea of the algorithm is to use a word graph data structure referred to as the Opinosis-Graph to represent the text to be summarized. Then, the resulting graph is repeatedly explored to find meaningful paths which in turn becomes candidate summary phrases. The Opinosis summarizer is considered a “shallow” abstractive summarizer as it uses the original text itself to generate summaries (this makes it shallow) but it can generate phrases that were previously not seen in the original text because of the way paths are explored (and this makes it abstractive rather than purely extractive). The summarization framework was evaluated on an opinion (user review) dataset. The approach itself is actually very general in that, it can be applied to any corpus containing high amounts of redundancies, for example, Twitter comments or user comments on blog/news articles.

Here is another example of an Opinosis summary for a Car (Acura 2007) generated using the OpinRank Edmunds data set. :

While most research projects in data mining and NLP focus on technical complexity, the focus of Opinosis was its practicality, in that it uses very shallow representation of text, relying mostly on redundancy to help generate summaries. This is not too much to ask given that we live in an era of big data, and we have ample user reviews on the Web to work with. Even though the Opinosis paper uses part-of-speech tags in its graph representation, you don’t have to use this at all and the algorithm will still work fine as long as you have sufficient volume of reviews and you make a few tweaks in finding sentence breaks.

## Other works using a similar graph data structure

• Discovering Related Clinical Concepts – This paper focuses on using a concept graph similar to the Opinosis-Graph to mine clinical concepts that are highly related. For example, the drug `advair` is highly related to concepts like `inhaler`, `puff`, `diskus`, `singulair`, `tiotropium`, `albuterol`, `combivent`, `spiriva`. Such concepts are easily discovered using the Concept-Graph in this paper.
• Multi-sentence compression: Finding shortest paths in word graphs
Katja’s work was used to summarize news (google news) for both English and Spanish while Opinosis was evaluated on user reviews from various sources (English only). She studies the informativeness and grammaticality of sentences and in a similar way we evaluate these aspects by studying how close the Opinosis summaries are compared to the human composed summaries in terms of information overlap and readability (using a human assessor
• Peilin Yang and Hui Fang – Contextual Suggestion – Another related work uses the Opinosis Algorithm to extract terms from reviews for the purpose of Contextual Suggestion. This was done as  part of the Contextual Suggestion TREC Task. It turns out that Yang and Fang had the highest rank and MRR scores in this track. Their paper can be found here: An Opinion-aware Approach to Contextual Suggestion. The details of the TREC run can be found here: Overview of the TREC 2013 Contextual Suggestion Track.

## What is ROUGE and how it works for evaluation of summaries?

ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It is essentially of a set of metrics for evaluating automatic summarization of texts as well as machine translation. It works by comparing an automatically produced summary or translation against a set of reference summaries (typically human-produced).Let us say, we have the following system and reference summaries:

System Summary (what the machine produced):

Reference Summary (gold standard – usually by humans) :

If we consider just the individual words, the number of overlapping words between the system summary and reference summary is 6. This however, does not tell you much as a metric. To get a good quantitative value, we can actually compute the precision and recall using the overlap.

### Precision and Recall in the Context of ROUGE

Simplistically put, Recall in the context of ROUGE simply means how much of the reference summary is the system summary recovering or capturing? If we are just considering the individual words, it can be computed as:
$\dfrac{number\_of\_overlapping\_words}{total\_words\_in\_reference\_summary}$

In this example, the Recall would thus be:

$Recall=\dfrac{6}{6}=1.0$
This means that all the words in the reference summary has been captured by the system summary, which indeed is the case for this example. Whoala! this looks really good for a text summarization system. However, it does not tell you the other side of the story. A machine generated summary (system summary) can be extremely long, capturing all words in the reference summary. But, much of the words in the system summary may be useless, making the summary unnecessarily verbose. This is where precision comes into play. In terms of precision, what you are essentially measuring is, how much of the system summary was in fact relevant or needed? Precision is measured as:

$\dfrac{number\_of\_overlapping\_words}{total\_words\_in\_system\_summary}$

In this example, the Precision would thus be:

$Precision=\dfrac{6}{7}=0.86$

This simply means that 6 out of the 7 words in the system summary were in fact relevant or needed. If we had the following system summary, as opposed to the example above:

System Summary 2:

The Precision now becomes:

$Precision=\dfrac{6}{11}=0.55$
Now, this doesn’t look so good, does it? That is because we have quite a few unnecessary words in the summary. The precision aspect becomes really crucial when you are trying to generate summaries that are concise in nature. Therefore, it is always best to compute both the Precision and Recall and then report the F-Measure. If your summaries are in some way forced to be concise through some constraints, then you could consider using just the Recall since precision is of less concern in this scenario.

### So What is ROUGE-N, ROUGE-S & ROUGE-L ?

ROUGE-N, ROUGE-S and ROUGE-L can be thought of as the granularity of texts being compared between the system summaries and reference summaries. For example, ROUGE-1 refers to overlap of unigrams between the system summary and reference summary. ROUGE-2 refers to the overlap of bigrams between the system and reference summaries. Let’s take the example from above. Let us say we want to compute the ROUGE-2 precision and recall scores.

System Summary :
Reference Summary :
System Summary Bigrams:
Reference Summary Bigrams:

Based on the bigrams above, the ROUGE-2 recall is as follows:

$ROUGE2_{Recall}=\dfrac{4}{5}=0.8$

Essentially, the system summary has recovered 4 bigrams out of 5 bigrams from the reference summary which is pretty good! Now the ROUGE-2 precision is as follows:

$ROUGE2_{Precision}=\dfrac{4}{6}=0.67$

The precision here tells us that out of all the system summary bigrams, there is a 67% overlap with the reference summary.  This is not too bad either. Note that as the summaries (both system and reference summaries) get longer and longer, there will be fewer overlapping bigrams especially in the case of abstractive summarization where you are not directly re-using sentences for summarization.

The reason one would use ROUGE-1 over or in conjunction with ROUGE-2 (or other finer granularity ROUGE measures), is to also show the fluency of the summaries or translation. The intuition is that if you more closely follow the word orderings of the reference summary, then your summary is actually more fluent.

### Short Explanation of a few Different ROUGE measures

• ROUGE-N – measures unigram, bigram, trigram and higher order n-gram overlap
• ROUGE-L –  measures longest matching sequence of words using LCS. An advantage of using LCS is that it does not require
consecutive matches but in-sequence matches
that reflect sentence level word order. Since it automatically includes
longest in-sequence common n-grams, you don’t need a predefined n-gram length.
• ROUGE-S – Is any pair of word in a sentence in order, allowing for arbitrary gaps. This can also be called skip-gram coocurrence. For example, skip-bigram measures the overlap of word pairs that can have a maximum of two gaps in between words. As an example, for the phrase “cat in the hat” the skip-bigrams would be “cat in, cat the, cat hat, in the, in hat, the hat”.
For more in-depth information about these evaluation metrics you can refer to Lin’s paper. Which measure to use depends on the specific task that you are trying to evaluate. If you are working on extractive summarization with fairly verbose system and reference summaries, then it may make sense to use ROUGE-1 and ROUGE-L. For very concise summaries, ROUGE-1 alone may suffice especially if you are also applying stemming and stop word removal.

## Opinosis Text Summarization Web API

The Opinosis REST API is available to all academic researchers. You can use a command line tool like cURL to access the API or you can also easily access the API from any programming language using HTTP request and response libraries.

The nice thing with using the REST API version versus the Java jar file is that you can integrate the API into your code base, making evaluation and the ability to build on the API output much easier. Please follow these steps to start using the API:

## Steps for Consuming API

• Create a Mashape account (Mashape manages API keys and access) and subscribe to the basic plan of this API. The usage is free up to a certain limit.
• You can use the examples below to start using the Opinosis Web API
• You can use this page to learn how to set the opinosis parameters.

## Example JSON Input & Output:

The Opinosis Web API accepts a JSON input and returns a JSON output.

Here is the sample request: JSON request.
This is the sample response for this request: JSON response