Skip to content

DeepBank_OneZero

StephanOepen edited this page Oct 28, 2013 · 18 revisions

Background

This page documents Version 1.0 of DeepBank, released in October 2013. In this release, there are annotations for Sections 00–21 of the venerable Wall Street Journal (WSJ) text from the Penn TreeBank (PTB). The selection of sentences included in DeepBank is aligned with the PTB, but otherwise it is fully independent of the original PTB annotations, i.e. none of the linguistic information in DeepBank is derivative of the PTB.

Treebank Counts

Sections 00–21 comprise a total of 43,541 sentences, of which 37,085 (or 85.2%) have manually validated HPSG analyses. In a small number of cases (167 sentences), there is more than one gold-standard HPSG analysis; for another 27 sentences (not overlapping with the ambiguously annotated cases), the annotator has indicated a minor deficiency in the HPSG analysis. To reflect this latter distinction, we occassionally talk about gold- vs. silver-standard annotations.

For the almost 15% of sentences for which the HPSG system either did not provide any candidate analyses (within certain bounds on time and memory), or where all available analyses were rejected during annotation, we seek to fill the resulting ‘coverage gap’ in the treebank through automated parsing with the robust, approximative parser of Zhang & Krieger (2011).

Clone this wiki locally