What is ROUGE and how it works for evaluation of summaries?

ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It is essentially of a set of metrics for evaluating automatic summarization of texts as well as machine translation. It works by comparing an automatically produced summary or translation against a set of reference summaries (typically human-produced).Let us say, we have the following system and reference summaries:

System Summary (what the machine produced):

Reference Summary (gold standard – usually by humans) :

If we consider just the individual words, the number of overlapping words between the system summary and reference summary is 6. This however, does not tell you much as a metric. To get a good quantitative value, we can actually compute the precision and recall using the overlap.


Precision and Recall in the Context of ROUGE

Simplistically put, Recall in the context of ROUGE simply means how much of the reference summary is the system summary recovering or capturing? If we are just considering the individual words, it can be computed as:

In this example, the Recall would thus be:

This means that all the words in the reference summary has been captured by the system summary, which indeed is the case for this example. Whoala! this looks really good for a text summarization system. However, it does not tell you the other side of the story. A machine generated summary (system summary) can be extremely long, capturing all words in the reference summary. But, much of the words in the system summary may be useless, making the summary unnecessarily verbose. This is where precision comes into play. In terms of precision, what you are essentially measuring is, how much of the system summary was in fact relevant or needed? Precision is measured as:


In this example, the Precision would thus be:

This simply means that 6 out of the 7 words in the system summary were in fact relevant or needed. If we had the following system summary, as opposed to the example above:

System Summary 2:

The Precision now becomes:

Now, this doesn’t look so good, does it? That is because we have quite a few unnecessary words in the summary. The precision aspect becomes really crucial when you are trying to generate summaries that are concise in nature. Therefore, it is always best to compute both the Precision and Recall and then report the F-Measure. If your summaries are in some way forced to be concise through some constraints, then you could consider using just the Recall since precision is of less concern in this scenario.


ROUGE-N, ROUGE-S and ROUGE-L can be thought of as the granularity of texts being compared between the system summaries and reference summaries. For example, ROUGE-1 refers to overlap of unigrams between the system summary and reference summary. ROUGE-2 refers to the overlap of bigrams between the system and reference summaries. Let’s take the example from above. Let us say we want to compute the ROUGE-2 precision and recall scores.
System Summary :
Reference Summary :
System Summary Bigrams:
Reference Summary Bigrams:

Based on the bigrams above, the ROUGE-2 recall is as follows:

Essentially, the system summary has recovered 4 bigrams out of 5 bigrams from the reference summary which is pretty good! Now the ROUGE-2 precision is as follows:

The precision here tells us that out of all the system summary bigrams, there is a 67% overlap with the reference summary.  This is not too bad either. Note that as the summaries (both system and reference summaries) get longer and longer, there will be fewer overlapping bigrams especially in the case of abstractive summarization where you are not directly re-using sentences for summarization.

The reason one would use ROUGE-1 over or in conjunction with ROUGE-2 (or other finer granularity ROUGE measures), is to also show the fluency of the summaries or translation. The intuition is that if you more closely follow the word orderings of the reference summary, then your summary is actually more fluent.

Short Explanation of a few Different ROUGE measures

  • ROUGE-N – measures unigram, bigram, trigram and higher order n-gram overlap
  • ROUGE-L –  measures longest matching sequence of words using LCS. An advantage of using LCS is that it does not require
    consecutive matches but in-sequence matches
    that reflect sentence level word order. Since it automatically includes
    longest in-sequence common n-grams, you don’t need a predefined n-gram length.
  • ROUGE-S – Is any pair of word in a sentence in order, allowing for arbitrary gaps. This can also be called skip-gram coocurrence. For example, skip-bigram measures the overlap of word pairs that can have a maximum of two gaps in between words. As an example, for the phrase “cat in the hat” the skip-bigrams would be “cat in, cat the, cat hat, in the, in hat, the hat”. 
For more in-depth information about these evaluation metrics you can refer to Lin’s paper. Which measure to use depends on the specific task that you are trying to evaluate. If you are working on extractive summarization with fairly verbose system and reference summaries, then it may make sense to use ROUGE-1 and ROUGE-L. For very concise summaries, ROUGE-1 alone may suffice especially if you are also applying stemming and stop word removal.

ROUGE Evaluation Packages

Papers to Read

ROUGE 2.0 – Overview

ROUGE 2.0 is a Java Package for Evaluation of Summarization Tasks building on the Perl Implementation of ROUGE.

ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It consists of a set of metrics for evaluating automatic summarization of texts as well as machine translation. It works by comparing an automatically produced summary or translation against a set of reference summaries (typically human-produced) or translations.

ROUGE 2.0 is a lightweight open-source tool that allows for easy evaluation of summaries or translation by limiting the amount of formatting needed in terms of reference summaries as well as system summaries. In addition, it also allows for evaluation of unicode texts known to be an issue with other implementations of ROUGE. One can also add new evaluation metrics to the existing code base or improve on existing ones.

More info can be found on Github.

A Step by Step Guide to Working with ROUGE for Summary Evaluation

I have been trying to use the ROUGE Perl toolkit to evaluate one of my research projects but have been finding it really hard to get proper documentation on its usage. So I decided to piece together some information that may be helpful to others. Actually, I learnt the basics of using ROUGE from the MEAD documentation! If you have successfully installed ROUGE and  need to set up the evaluation mechanism, read on. If you need information on how to install ROUGE, go through the README file in the the ROUGE package. Basically, the trick is in the successful installation of the perl modules. If you need to understand at a high level how ROUGE works as a metric, you can read this article.

First off, to evaluate a summarization system you have two types of summaries. One is the system generated summaries that is referred to as ‘peer summaries’  and then you have the reference summaries or gold standard summaries known as ‘model summaries’. Reference summaries are usually written by humans and it has been shown that using multiple reference summaries yields in more reliable ROUGE scores than using just one reference summary. Note that ROUGE can handle any number of peer summaries (if generated by multiple systems) and any number of model summaries. All you have to really do, is specify all of this in an xml file.

New! Edit 2018 – Java Based ROUGE: For those who cannot get the perl version to work on Windows or your Linux Machine, you can use the ROUGE 2.0 package with unicode support and full documentation. The settings are super simplified and there is no special formatting needed for the reference and system summaries.

Getting Started

To get started, create a directory structure as folllows anywhere on your system:

  • <your-project-name>/
    • models/  —- contains all reference summaries that will be used for evaluation. Each file can be identified by the the the set of documents for which the summary was generated. Say, a summary was generated for document set 3, by human 2. Then the file name can be something like human2_doc3.html
    • systems/  —  contains all system generated summaries. Each file can be identified by the id of the system and the set of documents for which the summary was generated. Say, a summary was generated for document set 3, by system 1. Then the file name can be something like system1_doc3.html.
    • settings.xml — This file is the core file that specifies which peer summaries should use which model summaries for evaluation.  Detailed explanation as below.

How to format settings.xml ?

Here I will only explain the basic syntax for formatting the core settings file. I am assuming that this file will be generated using some script so the formatting is really important. To learn how to format the system and model files look at the examples in  <ROUGE_HOME>/sample-test/SL2003 or check out the samples below.

  1. The file should typically start with: <ROUGE_EVAL version="1.55">
  2. Then, for each summarization task you need to enclose it between these tags:<EVAL ID="TASK_#">TASK_DETAILS</EVAL>
  3. Within this enclosure you need to specify where to find the model and peer summaries. So make sure to include these tags:
    <MODEL-ROOT> parent_dir_to_model_files  </MODEL-ROOT>
    <PEER-ROOT>  parent_dir_to_peer_files  </PEER-ROOT>
  4. Followed by:  <INPUT-FORMAT TYPE="SEE">  </INPUT-FORMAT>
  5. For each summarization task, we need to specify the system summaries and the reference summaries to evaluate against. Here is an example:<PEERS> – list of system generated summaries for the same task
    <P ID=”1″>1.html</P> — system 1’s summary found in 1.html
    <P ID=”2″>2.html</P> — system 2’s summary found in 2.html
    <M ID=”0″>0.html</M> — reference summary 1 for this task is in 0.html
    <M ID=”1″>1.html</M> — reference summary 2 for this task is in 1.html

For the next summarization task, repeat from point 2. Finally, finish by closing the XML tag with </ROUGE_EVAL>

How to format my model / peer summaries ?

The format that I use, is usually html. I am not sure if other formats are supported.The same format is used for both your reference/model/gold standard summaries and your peer summaries/system summaries. You may have to generate this using  a script. Each summary will have its own file and each sentence from each summary will have to be on its own line. You may thus have to segment your summaries (if not already segmented). Here is an example of a model summary in a format that ROUGE understands. It has 3 sentences, as indicated by the id.

<head><title>filename_here</title> </head>
<body bgcolor=”white”>
<a name=”1″>[1]</a> <a href=”#1″ id=1>This unit is generally quite accurate.  </a>
<a name=”2″>[2]</a> <a href=”#2″ id=2>Set-up and usage are considered to be very easy. </a>
<a name=”3″>[3]</a> <a href=”#3″ id=3>The maps can be updated, and tend to be reliable.</a> </body>

Where to obtain gold-standard/model  summaries ?

Well this really depends on your application. If you have a handful of documents that you need to summarize, then, just get your peers to write summaries for you. About 3-5 would be a good number in my opinion. Just give the summary writers very general instructions and make sure you influence them in no way. If you have a large number of documents to summarize you could consider using an online workforce like Amazon’s Mechanical Turk.

How to run my evaluation tasks ?

Once you have prepared the system summaries, model summaries and settings file as described above, its actually pretty straightforward. Here is an example:

./ -e data -f A -a -x -s -m -2 -4 -u < your-project-name>/settings.xml

This example is to evaluate using ROUGE-SU4

  • -e specifies the location of the data directory that comes with ROUGE. This is mandatory, because it contains the stop-words files within it.
  • -a specifies which systems you want to evaluate
  • -m specifies the usage of stemming
  • -2 -4 -u says use ROUGE SU with a skip-bigrams of 4 and also compute unigram scores
  • -x is to say that you do not want ROUGE-L to be computed (this is computed by default)

To get a list of adjustable parameters, just run ./ without any parameters.

How do I analyze my ROUGE scores?

ROUGE produces output in a format that cannot be easily analyzed.You have to essentially write a script to parse the results into a format suitable to you. I have written a perl script to parse the results into a CSV format. It allows you to visualize and analyze your results in Open Office or Excel. All you need to do is pipe all your ROUGE results to a text file and provide that as input to the perl script. Download the tool here.

Jackknifing with ROUGE

Jackknifing is typically used when human summaries need to be comparable with system generated ones. This is assuming you have multiple human (reference/model) summaries.  ROUGE used to internally implement jackknifing, but this was removed as of version 1.5.5. I do not know the rationale for this but if you need to implement it its pretty simple.

Say you have K reference summaries, you compute ROUGE scores over K sets of K-1 reference summaries. Which means, you leave out one reference summary each time. If you are attempting to compute human performance, then the reference summary that you leave out, will temporarily be your ‘system’ or ‘peer’ summary. Once you have the K ROUGE scores, you just need to average it to get the final ROUGE score. The Rouge2CSV perl tool, will help you combine and average these scores if you pipe all your ROUGE results to one file.

I am Having WordNet Exceptions

The Wordnet stuff seems to be a problem that a lot of people run into. You essentially need to build a link to the WordNet exception. This was the solution given by Chin Yew Lin :

cd data/WordNet-2.0-Exceptions/
./ . exc WordNet-2.0.exc.db

cd ../
ln -s WordNet-2.0-Exceptions/WordNet-2.0.exc.db WordNet-2.0.exc.db

Where do I find latest version of ROUGE?

rouge2csv – Script to Interpret ROUGE Scores

This is a perl script that helps in interpreting ROUGE scores generated by the perl (original) implementation of ROUGE. If you need Instructions on how to set-up ROUGE for evaluation of your summarization tasks go here.

Assuming you have piped all your ROUGE results to a file, this tool will collect all rouge scores into separate CSV files depending on the n-grams used. For example, all ROUGE-1 scores will be collected into a ROUGE-1.csv file, similarly all ROUGE-2 scores will be in a ROUGE-2.csv. The precision, recall and f-scores will be comma separated. This will allow you to easily visualize your results in Excel or OpenOffice. If you have ROUGE scores of identical runs (usually happens when you use Jackknifing), the scores will be averaged.

Here is a sample input and corresponding output file: [ Input | Output ]. You will notice multiple results with the same run id+ROUGE-N combination in the input file. This is due to the Jackknifing procedure that I used. In the output however, you will see only one instance as the scores have been averaged. If you do not use Jackknifing, you will most likely have one ROUGE score for one particular run, so you need not worry about this.