This page describes the DeepBank project. For more details about the beta-release of the DeepBank (v0.9), please read here.
The DeepBank project sets the goal at annotating the one million words of 1989 Wall Street Journal texts (the same set of sentences annotated in the original Penn Treebank project) with the English Resource Grammar, assisted with a robust approximating PCFG for its complete coverage. DeepBank contains rich linguistic annotation on both syntactic and semantic structures of the sentences and is available in a variety of representation formats (see the description on formats below).
The project is hosted at the Department of Computational Linguistics of Saarland University and the Language Technology Lab of the German Research Center for Artificial Intelligence in Saarbrücken, Germany, and in close collaboration with CSLI Stanford. Other institutes, including (but not limited to) Humboldt University of Berlin and University of Oslo have also contributed to the development and release of the resource. In the long term, the DeepBank will be further supported by the DELPH-IN community with updates and maintenance.
The project is technically built on top of the resources grown out of the long-term grammar and software engineering under the collaborative umbrella of DELPH-IN. Following the earlier practices in the development of Redwoods Treebanks, manual annotations are done through the discrimant-based treebanking environment provided by [incr tsdb()] on top of the candidate deep analyses proposed by the English Resource Grammar.
For the first public release of the DeepBank, most of the data has gone through at least two rounds of human annotation with independent annotators. Also, the linguistic analyses in DeepBank were made independently from the previous treebank annotations of the same data (i.e. PTB), distinguishing itself from other PTB-derived treebanks (e.g. the Enju HPSG treebank, CCGBank, CoNLL syntactic dependency bank, to name a few).
For the completeness of the annotation, the public release of the DeepBank also include a set of analyses licensed by an approximating PCFG for sentences not correctly analysed by the current version of the ERG. Semantic structures are also composed robustly for these sentences.
Stages of Development
The development of DeepBank started in the fall of 2008 as an internally funded project at the Department of Computational Linguistics, Saarland University and the LT-Lab of DFKI, under the supervision of Valia Kordoni and Yi Zhang. Thanks to the partial financial support of the Erasmus Mundus European Masters Program in Language and Communication Technologies (LCT), part-time student annotators were employed and trained for the first round of annotation. Dan Flickinger, the main ERG developer, has provided grammar updates throughout the project. He also went through a complete and thorough (second) round of annotation which eventually composes the first public release of the DeepBank. Both the ERG and the DeepBank have significantly evolved over the years during the project. But the dynamic nature of the annotation has kept them in unison.
By the summer of 2012, the development of DeepBank reached a mature stage where a significant amount of the data has gone through two rounds of careful annotation. The resource is since then open for internal review (alpha release) among several sites, including University of Oslo, University of Washington, Melbourne University, University of Barcelona, Bulgarian Academy of Science, University of Lisbon, etc. Many suggestions and feedbacks helped us prepare for the public release of the DeepBank.
At the end of November 2012, a part of the DeepBank (WSJ section 00-15) is open for public preview through a beta release announced at TLT in Lisbon. The beta version (v0.9) is now available for download. This beta-release only includes annotation for WSJ sections 00-15 in the original [incr tsdb()] format. Further sections and other formats will be released in the final release (v1.0), which is expected to arrive in January 2013.
The public release (v1.0) of the DeepBank will include annotation in multiple formats. The combination of the raw [incr tsdb()] profiles with a corresponding version of the ERG can reconstruct all detailed analyses. The HPSG derivations and the MRSes are recorded in these profiles and can be extracted directly.
For the convenience of usage, DeepBank is also available in other representation formats (though not all details are preserved in the converted representation), including the (modified) Penn-style-like constituent tree representation with labeled brackets and the CoNLL-style syntactic and semantic dependency representation with tabbed format. The conversion software will be available to the public and maintained collaboratively between Oslo and Saarbrücken.
For further information about the treebank, please feel free to contact Yi Zhang.
We are grateful to the Erasmus Mundus European Masters Program in Language and Communication Technologies (LCT, EM Grant Number: 2007-0060) for the financial support of the project.
We are equally grateful to the following student annotators for their diligent and patient work. All remaining errors in the treebank are of course ours.
- Ming Wen
- Maria Sukhareva
- Lea Frermann
- Iliana Simova
The involvement of Yi Zhang in the project is also partially sponsored by the German Cluster of Excellence on "Multimodal Computing and Interaction" (MMCI) funded by the DFG, and the Deependance project funded by BMBF (01IW11003).
Dan Flickinger, Valia Kordoni and Yi Zhang. DeepBank: A Dynamically Annotated Treebank of the Wall Street Journal. In Proceedings of TLT-11, Lisbon, Portugal, 2012.
- Angelina Ivanova, Stephan Oepen, Lilja Øvrelid, and Dan Flickinger. Who did what to whom? a contrastive study of syntacto-semantic dependencies. In Proceedings of the Sixth Linguistic Annotation Workshop, pages 2–11, Jeju, Republic of Korea, 2012.
- Yi Zhang and Hans-Ulrich Krieger. Large-scale corpus-driven PCFG approximation of an HPSG. In Proceedings of the 12th International Conference on Parsing Technologies, pages 198–208, Dublin, Ireland, 2011.
- Yi Zhang, Valia Kordoni. Discriminant Ranking for Efficient Treebanking. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), Beijing, China, 2010.
- Valia Kordoni, Yi Zhang. Disambiguating Compound Nouns for a Dynamic HPSG Treebank of Wall Street Journal Texts. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10), Malta, 2010.
- Valia Kordoni, Yi Zhang. Annotating Wall Street Journal Texts Using a Hand-Crafted Deep Linguistic Grammar. In Proceedings of the Third Linguistic Annotation Workshop, Singapore, 2009.