Skip to content

ReproducibilityStandards

PeterAdolphs edited this page Jul 5, 2010 · 6 revisions

In order to facilitate comparison with and reproducibility of experiments using DELPH-IN data and tool sets, this page documents standard training and testing data sets for each grammar, and standard evaluation metrics and terminology. We encourage everyone to use the standards listed here, or to describe any deviations in terms of these standards.

Data

Evaluation Metrics

Coverage

  • observed coverage: percentage of items that received at least one parse
  • verified coverage: percentage of items for which a gold standard analysis was found during treebanking

Accuracy

It is important to specify whether these metrics are calculated over:

  • all items in a data set
  • all items that have a gold standard analysis
  • all items that received a parse
  • the intersection of the last two

metrics

Clone this wiki locally