Overview
The LinGO Redwoods treebank is a collection of hand-annotated corpora analysed with the LinGO ERG. For each utterance from a corpus, the treebank records (in principle) all analyses hypothesized by the grammar, together with an annotator decision as to which reading is preferred in context.
The key innovative aspect of the Redwoods approach to treebanking is the anchoring of all linguistic data captured in the treebank to the HPSG framework and a generally-available broad-coverage grammar of English, viz. the LinGO English Resource Grammar. Unlike existing treebanks, there is no need to define a (new) form of grammatical representation specific to the treebank (and, consequently, less dissemination effort in establishing this representation). Instead, the treebank records complete syntacto-semantic analyses as defined by the LinGO ERG; tools are provided to extract many different types of linguistic information at varying granularity.
Other relevant aspects of the Redwoods treebank include the integration of alternate, though dispreferred analyses for each utterance and the dynamic nature of the annotations: as the underlying grammar evolves and improves its analyses, there is a provision for a (nearly) fully automated update of the treebank against a version of the original corpus analysed with the revised grammar. As a methodological results, part of the Redwoods data are now regularly maintained as part of the grammar regression cycle with each new release of the ERG.
Current Development Status
As of October 2011, we are in the process of releasing the 45,000-sentence Seventh Growth, a substantially enlarged new revision of the Redwoods treebank, consisting of data sets from several distinct domains, including not only the Verbmobil and ecommerce corpora from earlier releases, but now also data from the LOGON Norwegian-English MT corpus, the WeScience 100-article portion of the English Wikipedia and a portion of the semantically tagged subset of the Brown corpus (SemCor). The version of the grammar used in parsing this data is "ERG (1110)".
The following table summarizes the Seventh Growth in terms of the total number of utterances, average string length, and average ambiguity rates for three sub-divisions, viz. rejected items (t-active = 0), fully disambiguated items (t-active = 1), and a small number of items for which annotators considered more than one analysis active (t-active > 1), typically where the ambiguity resides in tokenization alternatives. The profile name abbreviations are as follows: CB = Cathedral and Bazaar essay, CSLI = syntactic test suite, EC* = ecommerce corpus, FRACAS = semantic test suite, HIKE and JH* and PS* and RONDANE and TG* = LOGON corpus, MRS = semantic test suite, RTC* = Tanaka corpus, SC* = SemCor corpus, TREC = TREC 9 corpus, WS* = WeScience corpus.
Items
Parsed
t-active = 0
t-active = 1
t-active > 1
CB
769
677
96
28.58
440
581
17.82
312
0
0.00
0
CSLI
1348
917
0
0.00
0
917
6.44
8
0
0.00
0
ECOC
1254
1216
34
11.47
254
1181
7.57
88
1
11.00
222
ECOS
1678
1596
88
10.87
173
1505
8.43
102
3
5.00
29
ECPA
1654
1580
92
11.35
170
1486
8.22
79
2
7.50
20
ECPR
1207
1168
66
12.14
177
1102
9.32
110
0
0.00
0
FRACAS
640
636
5
12.60
196
631
7.60
50
0
0.00
0
HIKE
330
329
2
13.50
327
327
12.85
192
0
0.00
0
JH0
261
247
8
31.00
473
239
18.85
358
0
0.00
0
JH1
1353
1319
29
21.86
341
1287
13.17
217
3
20.67
500
JH2
1307
1240
84
20.64
409
1153
13.68
236
3
12.67
500
JH3
1443
1401
70
23.70
434
1329
12.86
211
2
21.50
278
JH4
1603
1540
58
21.57
374
1479
12.83
216
3
31.33
500
JH5
464
437
21
21.57
281
416
12.09
203
0
0.00
0
JHK
250
245
11
19.27
407
234
12.49
190
0
0.00
0
JHU
294
286
8
25.75
438
278
12.87
200
0
0.00
0
MRS
107
107
0
0.00
0
107
4.47
3
0
0.00
0
PSK
45
42
2
4.00
22
40
9.37
91
0
0.00
0
PS
965
932
26
20.08
382
903
13.64
229
3
20.67
392
PSU
45
42
5
6.20
6
37
12.59
214
0
0.00
0
RONDANE
1402
1271
97
22.10
405
1170
14.36
250
4
20.00
403
RTC000
1500
1442
32
16.34
155
1410
11.46
42
0
0.00
0
RTC001
1500
1440
47
15.45
114
1392
11.50
46
1
18.00
500
SC01
1000
918
73
27.21
380
845
15.34
236
0
0.00
0
SC02
1103
1006
95
26.46
395
906
15.07
238
5
30.20
500
SC03
1000
922
77
26.84
387
843
14.75
234
2
11.00
251
TG1
1013
970
62
20.47
384
907
13.51
235
1
24.00
500
TG2
1001
958
57
20.95
343
900
14.14
248
1
8.00
8
TGK
90
84
4
16.50
381
78
14.68
249
2
18.00
333
TGU
90
88
4
27.75
365
83
14.08
296
1
13.00
417
TREC
693
685
5
11.60
132
680
6.91
34
0
0.00
0
VM6
4037
3883
201
11.41
248
3668
7.53
150
13
11.46
212
VM13
3408
3256
174
13.52
279
3075
8.08
153
4
12.00
447
VM31
3914
3763
164
10.68
205
3595
5.90
91
2
9.00
257
VM32
1034
1013
21
11.48
172
992
7.46
127
0
0.00
0
WS01
805
707
90
28.46
425
615
15.76
251
2
21.50
262
WS02
946
880
66
27.55
433
810
15.15
264
4
27.25
500
WS03
920
821
78
24.62
403
740
14.76
255
3
29.33
500
WS04
988
884
103
27.79
444
775
14.93
247
6
22.17
421
WS05
911
774
106
27.02
431
660
15.48
265
8
11.87
236
WS06
890
791
73
23.56
389
713
15.28
272
5
19.80
405
WS07
807
723
63
25.98
442
649
14.68
250
11
19.36
361
WS08
904
791
69
29.30
439
709
17.26
266
13
20.69
342
WS09
940
861
48
24.79
407
812
14.58
255
1
23.00
500
WS10
914
815
93
25.98
387
710
15.34
248
12
27.67
460
WS11
746
660
60
28.87
449
598
14.83
266
2
11.00
269
WS12
786
682
53
25.13
402
627
16.37
295
2
22.50
268
WS13
1001
888
74
26.22
425
800
14.78
255
14
25.64
464
Totals
51360
47933
2794
44994
139
Earlier relevant Redwoods revisions include the Second Growth, Third Growth, and Fifth Growth.
Data Format
Like the previous Redwoods Fifth Growth revision, the Seventh Growth is distributed in [incr tsdb()] profile form exclusively (see below for instructions on how to expand the data into a textual export format), but we have limited the number of dispreferred analyses per item to a maximum of the 500 best analyses according to our MaxEnt model trained on an interim version of this treebank. In principle, Redwoods users could use the LKB or PET parsers to obtain the complete set of analyses and then use the [incr tsdb()] update facility to automatically produce a version of the treebank against the unrestricted profile. However, we expect that the reduced distribution provides a sufficiently large portion of the dispreferred analyses for high-quality stochastic modelling and that the substantial reduction in overall size will actually benefit experimentation.
(fix: update instructions for LOGON 'redwoods' script)
See the LkbInstallation instructions for details, but the following should just be sufficient to obtain a full installation of the LKB, ERG, [incr tsdb()], and Redwoods Seventh Growth data for the Linux (x86) environment (the choice of DELPHINHOME, the root directory for the DELPH-IN source tree, can be varied, of course; the example below assumes a sub-directory `delphin' in the user home directory):
export DELPHINHOME=${HOME}/delphin
wget http://lingo.delph-in.net/etc/install
bash install --redwoods
Expanding and Exporting
Assuming a functional installation of the LKB, ERG, and [incr tsdb()], the process of exporting all or parts of the Redwoods Seventh Growth data into a collection of plain text files can be fully automated by virtue of a shell script provided in the [incr tsdb()] data directory. By default, the script will include the following representations in the export
- derivation tree: primary, labeled in terms of grammar-internal identifiers;
- phrase structure tree: derived, labeled using a set of abbreviatory symbols;
- attibute value matrix: derived, the full HPSG sign, including all daughters;
- MRS: derived, in two flavours ('raw' and 'indexed'), meaning representation;
- dependencies: derived, elementary dependency relations (reduced form of MRS).
FIXME below
Setting the parameter *redwoods-export-values* in the script (see below) to a sub-set of the above may result in significant savings in export time and disk space requirements. The default set of (close to all) export representations requires several cpu days and around 20 gbytes of disk space (as a set of gzip(1)-compressed files) for the full Redwoods Seventh Growth. Following is an example session to export just the VM6 section:
cd $DELPHINHOME/lkb/src/tsdb/home ./export redwoods/jun-04/vm6/04-06-11
A full export can be fairly memory-intense for highly ambiguous items, i.e. it is advisable to run the above in a suitable machine (with, say, two gbytes of RAM or above). Consult the export script for further configuration options, and ItsdbTreebanking/ItsdbExporting for various possible formats.
For example, to export triples from only the first parse of a non-treebanked profile:
./export --binary --condition "result-id=0" --format triples PROFILE
Bibliography
Following is an incomplete selection of publications on the creation and use of the Redwoods treebank.
Oepen, Stephan, Kristina Toutanova, Stuart Shieber, Christopher Manning, Dan Flickinger, and Thorsten Brants (2002). The LinGO Redwoods Treebank: Motivation and Preliminary Applications. In Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), Taipei, Taiwan (pages 1253-1257).
Oepen, Stephan, Dan Flickinger, Kristina Toutanova, and Christoper D. Manning (2002). LinGO Redwoods. A Rich and Dynamic Treebank for HPSG. In Proceedings of The First Workshop on Treebanks and Linguistic Theories (TLT 2002), Sozopol, Bulgaria.
Toutanova, Kristina, Christoper D. Manning, and Stephan Oepen (2002). Parse Ranking for a Rich HPSG Grammar. In Proceedings of The First Workshop on Treebanks and Linguistic Theories (TLT 2002), Sozopol, Bulgaria.
Toutanova, Kristina and Christopher D. Manning (2002). Feature Selection for a Rich HPSG Grammar Using Decision Trees. In Proceedings of the Sixth Conference on Natural Language Learning (CoNLL 2002), Taipei, Taiwan.
Velldal, Erik, Stephan Oepen, and Dan Flickinger (2004). Paraphrasing Treebanks for Stochastic Realization Ranking. In Proceedings of The Third Workshop on Treebanks and Linguistic Theories (TLT 2004), Tuebingen, Germany.
An overview presentation on many of the methodological aspects of the Redwoods initiative is available from an invited presentation at the 2003 Treebanks and Linguistic Theories workshop.
Acknowledgements
The Redwoods treebank has been under active development at the CSLI LinGO Laboratory since sometime early in 2001. The annotation environment was built from the combination of the LKB tree comparison window (originally developed by Rob Malouf) and the [incr tsdb()] profiling tools; Stephan Oepen did the bulk of the Redwoods software development. Dan Flickinger, as the main developer of the ERG, has been an invaluable source of inspiration on the treebank design and has also been the main treebanker since Redwoods Second Growth. Chris Manning and Kristina Toutanova, and Stuart Shieber, as early adopters and consultants on the overall design of the resource and representations, have greatly influenced the evolution of the treebank and pioneered its use for stochastic parse selection. Ezra Callahan was the first annotator, constructing what has been released as the First Growth during a ten-week summer internship. John Beavers did the annotations of the new ecommerce sections. Francis Bond and his colleagues at the NTT Research Laboratory have been vigorous supporters, adapted the Redwoods approach for Japanese (dubbing their treebank Hinoki), and thus helped a lot in scaling up the technology. Marty Mayberry, Jason Baldridge, Alex Lascarides, and Miles Osborne, as active users of the ERG and Redwoods data, have provided crucial feedback on the representations and software and positively contributed to recent developments. Tim Baldwin, Emily M. Bender, Kathryn Campbell-Kibler, Ann Copestake, Andreas Eisele, Rob Malouf, Rebecca Neil, Ivan Sag, Erik Velldal, and Tom Wasow have all helped through advice and productive critique in various stages of the project.
The development of the Redwoods treebank was financed opportunistically from numerous sources, including multiple donations to CSLI from YY Technologies (Mountain View, CA), a CSLI Seeding Grant, the Stanford Symbolic Systems Program (through multiple sponsored summer internships), the Commission of the European Community (through the Deep-Thought project), Scottish Enterprise (through the ROSIE project), Nippon Telegraph and Telephone Corporation (NTT) (through a sponsored research contract to the LinGO Laboratory), and the Norwegian LOGON Initiative (through financial support to Dan Flickinger and Stephan Oepen).