ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It is essentially of a set of metrics for evaluating automatic summarization of texts as well as machine translation. It works by comparing an automatically produced summary or translation against a set of reference summaries (typically human-produced).Let us say, we have the following system and reference summaries:
System Summary (what the machine produced):
the cat was found under the bed
Reference Summary (gold standard – usually by humans) :
the cat was under the bed
Precision and Recall in the Context of ROUGE
In this example, the Recall would thus be:
In this example, the Precision would thus be:
This simply means that 6 out of the 7 words in the system summary were in fact relevant or needed. If we had the following system summary, as opposed to the example above:
System Summary 2:
the tiny little cat was found under the big funny bed
The Precision now becomes:
So What is ROUGE-N, ROUGE-S & ROUGE-L ?
the cat was found under the bed
the cat was under the bed
System Summary Bigrams:
the cat,
cat was,
was found,
found under,
under the,
the bed
Reference Summary Bigrams:
the cat,
cat was,
was under,
under the,
the bed
Based on the bigrams above, the ROUGE-2 recall is as follows:
Essentially, the system summary has recovered 4 bigrams out of 5 bigrams from the reference summary which is pretty good! Now the ROUGE-2 precision is as follows:
The precision here tells us that out of all the system summary bigrams, there is a 67% overlap with the reference summary. This is not too bad either. Note that as the summaries (both system and reference summaries) get longer and longer, there will be fewer overlapping bigrams especially in the case of abstractive summarization where you are not directly re-using sentences for summarization.
The reason one would use ROUGE-1 over or in conjunction with ROUGE-2 (or other finer granularity ROUGE measures), is to also show the fluency of the summaries or translation. The intuition is that if you more closely follow the word orderings of the reference summary, then your summary is actually more fluent.
Short Explanation of a few Different ROUGE measures
- ROUGE-N – measures unigram, bigram, trigram and higher order n-gram overlap
- ROUGE-L – measures longest matching sequence of words using LCS. An advantage of using LCS is that it does not require
consecutive matches but in-sequence matches
that reflect sentence level word order. Since it automatically includes
longest in-sequence common n-grams, you don’t need a predefined n-gram length. - ROUGE-S – Is any pair of word in a sentence in order, allowing for arbitrary gaps. This can also be called skip-gram coocurrence. For example, skip-bigram measures the overlap of word pairs that can have a maximum of two gaps in between words. As an example, for the phrase “cat in the hat” the skip-bigrams would be “cat in, cat the, cat hat, in the, in hat, the hat”.
ROUGE Evaluation Packages
- Perl implementation of ROUGE – this is the original implementation of ROUGE
- Java based ROUGE – implementation in Java which supports evaluation of unicode texts.
- Javascript implementation of ROUGE