These are some handy notes for MEAD.
What is MEAD?
MEAD is a publicly available framework for summarization. It is not really an ‘algorithm’. By default (I guess when it was first implemented) it was developed based on a centroid based approach. The centroid based approach is still a very nice approach to summarization. Today, there are many other algorithms that have been implemented within the MEAD framework (ex. LexRank). You can even implement your very own algorithm and experiment it within the MEAD framework. Some of the many things you could change are the rerankers and the algorithm for scoring sentences.
Scoring of Sentences in Mead:
By default, the scoring of sentences is based on 3 parameters: Sentence length, centroid, and position in text. Each sentence receives a score that is a linear combination of the features listed except for the “Length” feature. To change these values, you can specify it in the mead config file or specify it on the command line. Note that the “Length”, if it is given, is a cutoff feature. Any sentence with a length shorter than
“Length” is automatically given a score of 0, regardless of other features. The default scoring script is bin/default-classifier.pl
. Here is an example to change the parameter values in the command line :
perl bin/mead.pl -classifier "perl bin/default-classifier.pl Length 3 Centroid 4 Position 0" -absolute 3 GA3
The default weights for Centroid and Position are both 1. The default Length cutoff is 9. If you want to add new features to the default-classifier, see section 8.12 of the MEAD documentation.
Converting HTML/Text files to docsent (text2cluster.pl)
- You first need to make sure that MEAD ADDONS UTIL.pm must be in your perl lib path. To do this:
PERL5LIB=$PATH:/<mypath>/mead/bin/addons/formatting/
- Next, you need to set the absolute path of the dtd directory variable $DTD DIR on line 18 of MEAD ADDONS UTIL.pm I am not sure what exactly this does, but I just set it before running the text2cluster.pl
$DTD_DIR ="<mypath>/mead/dtd";
- To generate the docsent file invoke text2cluster.pl on the file. This will generate a file with a
.docsent
extension.
perl bin/addons/formatting/text2cluster.pl <file_to_convert>