This page contains a summary of the discussion of the Birds of a Feather meeting held at the 2011 ACL on June 19, 2011. We are grateful to the participants for their valuable input.

Main points

Possible uses

Community uptake will be motivated by the sheer size of the dataset, but will depend on its accessibility. Provide scripts to create views on the data that make it as familiar as possible, so as to provide a bridge to using the more complete information. User to keep in mind: Graduate student exploring possible resources to use in a new project.

Corpus fundamentals

Output formats

If we don't create the mapping for our representations to the current de-facto standards, then others will, redundantly and most likely less accurately, as they would lack complete understanding of the inputs.

Create a library of scripts to map to at least:

in at least two ways each:

  1. Superficial form
    • dependencies: format of triples, labels of dependency types
    • phrase structure: tokenization, label set
  2. Closest possible match
    • dependencies: put back semantically empty words, match direction of dependency, break up MWE rels (_get_along -> get along)

    • phrase structure: put in traces, move punctuation attachment
  3. "core dependencies": no semantically empty words, but in a format that can be used for evaluation of something trained on closest possible match view.

Do this first, possibly on WikiWoods/WSJ data, to learn what information might be needed from human annotators before we being the hand-annotation process.

If we provide multiple trees for some or all sentences, consider forest representation. Unpacking tools can be expensive to write, so if this is too difficult, likely won't get up-take.

Resource documentation

The interpretation of the annotations (especially in our "native" output formats) needs to be accessible without digesting the entire ERG.

Include information on mapping from one level of representation to the other (cf. Hindi/Urdu treebank project).

Annotation quality assurance

Quality of annotation depends on the usability of the tool, and the extent to which it makes it easy for annotators to maintain consistency. In addition, the project will need to set up on-going QA procedures to make sure annotators don't drift away from guidelines.

Consistency of both annotator decisions and grammar writer decisions are important. (Cf. examples of group of noun v. name of noun and today vs. day after tomorrow.)

Specific general suggestions:

Specific tool suggestions:

Things to consider adding to annotation

Things already there in the annotation that will be a win

How to handle inputs without (correct) spanning parses

BirdsofaFeather2011Summary (last edited 2011-10-08 21:12:15 by localhost)

(The DELPH-IN infrastructure is hosted at the University of Oslo)