技术控

    今日:172| 主题:49347
收藏本版 (1)
最新软件应用技术尽在掌握

[其他] The Berkeley Document Summarizer: Learning-Based, Single-Document Summarization

[复制链接]
Miss听日落 发表于 2016-10-2 09:18:35
135 2

立即注册CoLaBug.com会员,免费获得投稿人的专业资料,享用更多功能,玩转个人品牌!

您需要 登录 才可以下载或查看,没有帐号?立即注册

x
berkeley-doc-summarizer

  The Berkeley Document Summarizer is a learning-based single-document summarization system. It compresses source document text based on constraints from constituency parses and RST discourse parses. Moreover, it can improve summary clarity by reexpressing pronouns whose antecedents would otherwise be deleted or unclear.
  Preamble

  The Berkeley Document Summarizer is described in:
  "Learning-Based Single-Document Summarization with Compression and Anaphoricity Constraints" Greg Durrett, Taylor Berg-Kirkpatrick, and Dan Klein. ACL 2016.
  See    http://www.eecs.berkeley.edu/~gdurrett/for papers and BibTeX.  
  Questions? Bugs? Email me at    [email protected]  
  License

  Copyright (c) 2013-2016 Greg Durrett. All Rights Reserved.
  This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
  This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
  You should have received a copy of the GNU General Public License along with this program. If not, see    http://www.gnu.org/licenses/  
  Setup

  Models and Data

  Models are not included in GitHub due to their large size. Download the latest models from    http://nlp.cs.berkeley.edu/projects/summarizer.shtml. These are necessary for both training the system (you need the EDU segmenter, discourse parser, and coreference model) as well as running it (you need the EDU segmenter, discourse parser, and summarization model, which contains the coreference model). All of these are expected in the models/ subdirectory.  
  We also require    number and gender data(default path the system expects this data at: data/gender.data).  
  GLPK

  For solving ILPs, our system relies on GLPK. The easiest way to install GLPK is with    homebrew. Additionally, when running the system, you must have    glpk-java-1.1.0.jaron the build path (already done if you're using sbt) as well as make sure the Java Native Interface (JNI) libraries are accessible. These allow the GLPK Java bindings to interact with the native GLPK code.  
  You can try this out with    edu.berkeley.nlp.summ.GLPKTest; if this class runs without error, you're good! If you do get an error, you may need to augment the Java library path with the location of the libglpk_java libraries as follows:  
  1. -Djava.library.path="<current library path>:<location of libglpk_java libraries>"
复制代码
On OS X, this may be located in    /usr/local/lib/jni.    run-summarize.shattempts to set this automatically, but may not work for your system.  
  Building from source

  The easiest way to build is with SBT:    https://github.com/harrah/xsbt/wiki/Getting-Started-Setup  
  then run
  1. sbt assembly
复制代码
which will compile everything and build a runnable jar.
  You can also import it into Eclipse and use the Scala IDE plug-in for Eclipse    http://scala-ide.org  
  Running the system

  The two most useful main classes are    edu.berkeley.nlp.summ.Mainand    edu.berkeley.nlp.summ.Summarizer. The former is a more involved harness for training and evaluating the system on the New York Times corpus (see below for how to acquire this corpus), and the latter simply takes a trained model and runs it. Both files contain descriptions of their functionality and command-line arguments.  
  An example run on new data is included in    run-summarizer.sh. The main prerequisite for running the summarizer on new data is having that data preprocessed in the CoNLL format with constituency parses, NER, and coreference. For a system that does this, see the    Berkeley Entity Resolution System. The    test/directory already contains a few such files.  
  The summarizer then does additional processing with EDU segmentation and discourse parsing. These use the models that are by default located in    models/edusegmenter.ser.gzand    models/discoursedep.ser.gz. You can control these with command-line switches.  
  The system is distributed with several pre-trained variants:
  
       
  •       summarizer-extractive.ser.gz: a sentence-extractive summarizer   
  •       summarizer-extractive-compressive.ser.gz: an extractive-compressive summarizer   
  •       summarizer-full.ser.gz: an extractive-compressive summarizer with the ability to rewrite pronouns and additional coreference features and constraints  
  Training

  New York Times Dataset

  The primary corpus we use for training and evaluation is the New York Times Annotated Corpus (Sandhaus, 2007), LDC2008T19. We distribute our preprocessing as standoff annotations which replace words with (line, char start, char end) triples, except for some cases where words are included manually (e.g. when tokenization makes our data non-recoverable from the original file). A few scattered tokens are included explicitly, plus roughly 1% of files that our system couldn't find a suitable alignment for.
  To prepare the dataset, first you need to extract all the XML files from 2003-2007 and flatten them into a single directory. Not all files have summaries, so not all of these will be used. Next, run
  1. mkdir train_corefner
  2. java -Xmx3g -cp <jarpath> edu.berkeley.nlp.summ.preprocess.StandoffAnnotationHandler \
  3.   -inputDir train_corefner_standoff/ -rawXMLDir <path_to_flattened_NYT_XMLs> -outputDir train_corefner/
复制代码
This will take the train standoff annotation files and reconstitute the real files using the XML data, writing to the output directory. Use    evalinstead of    trainto reconstitute the test set.  
  To reconstitute abstracts, run:
  1. java -Xmx3g -cp <jarpath> edu.berkeley.nlp.summ.preprocess.StandoffAnnotationHandler \
  2.   -inputDir train_abstracts_standoff/ -rawXMLDir <path_to_flattened_NYT_XMLs> -outputDir train_abstracts/ \
  3.   -tagName "abstract"
复制代码
and similarly swap out for    evalappropriately.  
  ROUGE Scorer

  We bundle the system with a version of the ROUGE scorer that will be called during execution.    rouge-gillick.shhardcodes command-line arguments used in this work and in Hirao et al. (2013)'s work. The system expects this in the    rouge/ROUGE/directory under the execution directory, along with the appropriate data files (which we've also bundled with this release).  
  See    edu.berkeley.nlp.summ.RougeComputer.evaluateRougeNonTokfor a method you can use to evaluate ROUGE in a manner consistent with our evaluation.  
  Training the system

  To train the full system, run:
  1. java -Xmx80g -cp <jarpath> -Djava.library.path=<library path>:/usr/local/lib/jni edu.berkeley.nlp.summ.Main \
  2.   -trainDocsPath <path_to_train_conll_docs> -trainAbstractsPath <path_to_train_summaries> \
  3.   -evalDocsPath <path_to_eval_conll_docs> -evalAbstractsPath <path_to_eval_summaries> -abstractsAreConll \
  4.   -modelPath "models/trained-model.ser.gz" -corefModelPath "models/coref-onto.ser.gz" \
  5.   -printSummaries -printSummariesForTurk \
复制代码
where    <jarpath>,    <library path>, and the data paths are instantiated accordingly. The system requires a lot of memory due to caching 25,000 training documents with annotations.  
  To train the sentence extractive version of the system, add:
  1. -doPronounReplacement false -useFragilePronouns false -noRst
复制代码
To train the extractive-compressive version, add:
  1. -doPronounReplacement false -useFragilePronouns false
复制代码
The results you get using this command should be:
  
       
  • extractive: ROUGE-1 recall: 38.6 / ROUGE-2 recall: 23.3   
  • extractive-compressive: ROUGE-1 recall: 42.2 / ROUGE-2 recall: 26.1   
  • full: ROUGE-1 recall: 41.9 / ROUGE-2 recall: 25.7  
  (Results are slightly different from those in the paper due to minor changes for this release.)
友荐云推荐




上一篇:Tsickle - TypeScript to Closure Annotator
下一篇:走近科学:“爱因斯坦”(EINSTEIN)计划综述
酷辣虫提示酷辣虫禁止发表任何与中华人民共和国法律有抵触的内容!所有内容由用户发布,并不代表酷辣虫的观点,酷辣虫无法对用户发布内容真实性提供任何的保证,请自行验证并承担风险与后果。如您有版权、违规等问题,请通过"联系我们"或"违规举报"告知我们处理。

明丞明k 发表于 2016-10-2 12:46:56
在神经的人群里呆久了,我发现我正常了。
回复 支持 反对

使用道具 举报

celover 发表于 2016-11-21 01:23:06
一粒盐,发了脾气就是海。
回复 支持 反对

使用道具 举报

*滑动验证:
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

我要投稿

推荐阅读

扫码访问 @iTTTTT瑞翔 的微博
回页顶回复上一篇下一篇回列表手机版
手机版/CoLaBug.com ( 粤ICP备05003221号 | 文网文[2010]257号 )|网站地图 酷辣虫

© 2001-2016 Comsenz Inc. Design: Dean. DiscuzFans.

返回顶部 返回列表