Discussion on Reproducibility from the 6th DELPH-IN Summit
Moderator: StephanOepen; Scribe: FrancisBond
Start with mail from Rebecca:
Now we are slowly seeing more DELPH-IN work getting published, it should be easier to tick off the requirement for "comparison to previous work", but that is not as easy as it could be. One thing that would make it easier is to have an up-to-date publications list, perhaps even by topic, so anyone who gets, for example, a parse selection paper accepted adds it to the wiki under parse selection. Another, more necessary thing would be using comparable data. Many of our papers seem to use slightly different (often unspecified) test and training sets, even when we are using the same basic data. A few suggestions for that problem: * clearly document test and training splits on released profiles * where possible, stick to previously used data sets * when creating new corpora, consider creating a held-out test set (perhaps by taking out every x item) and designating it as testing from the start * annotate the publication list above with technical details that weren't perhaps relevant to the paper, but would help someone else replicate the results (data set, SVN version, cheap options etc) I'd like to see a discussion of whether any of the above suggestions could work, and also whether there are other practical suggestions to help get more DELPH-IN work published (eg particular results that need to be published so they can be referenced?)
YiZhang: we were having issues with people not getting the same results, and hard to verify --- different data sets may give differences of more than 20% exact match for parsing accuracy. We should set up a standard train/dev/text split for various grammars.
Oe: some people also use 10-fold
Yi: keeping the same split means that there will be no differences in e.g. unknown words.
Ann: But we don't want to fall into the trap of making one standard data set and only doing that (the WSJ problem). Also add additional metrics --- 10-fold, additional data sets, ... We don't to use the same test set for 20 years.
Antske: the fixed set is important for a baseline to check that all is working
Oe: as we are setting a new standard, we want to do it right.
Ann: CL is trying to guarantee that papers are reproducible --- so we need to to store the treebank, grammar, parser etc.
Antske: This is hard to do, as there are so many parts.
Oe: it is all open source. So it should be reproducible. We should be able to point to the exact branch and the revision.
Yi: sometimes you don't need to reproduce everything, but we should be able to compare on the same data
Francis: we may need to be more aggressive about putting stuff in the svn and allowing experimental branches.
Oe: Perhaps we need more documentation on various data sets and the their standard test/dev/train splits.
Francis: perhaps we should put it in a Readme in the skeleton file, which goes into the svn and then we can link to it from the wiki.
Oe, Emily: that makes sense
Oe: We could also do a bit more about standard metrics.
coverage --- raw parser, survival (observed and verified). [Not raw and cooked]
precision --- exact match, n-best, ???, granular matches (from Rebecca), labelled match, dmrs style metrics
efficiency --- over which subsets, ....
Emily: put these on our outward facing pages
Oe: I will try to get some of these from the tsdb manual.
Ann: will we enforce this when we review
Francis: encourage at least
Emily: we should hold ourselves to very high standards
Hans: We need this, even some famous people give you their stuff but we can't reproduce the results. We need a trust centre.
ToDo
- add Readme's abot data splits to skeletons/treebanks
- link to these from the wiki
make a wiki page describing "standard" test sets (so we can have a share baseline) link to version in SVN: ReproducibilityStandards
- add a metric glossary to the wiki (and make it suitable for non-delphin members)
- establish a trust center