Document Parsing

This is a companion website to the publication:

Rebecca Dridan and Stephan Oepen. (2013). Document parsing: Towards realistic syntactic analysis. In Proceedings of the 13th International Conference on Parsing Technologies (IWPT), Nara, Japan.

Details of software versions and options used are spelt out below.

External pre-processors


Rule-based tokeniser, see ReppTop.


Rule-based sentence segmenter, downloaded from Since the webpage has since disappeared, and the code is GPL'd, we make available the original sources (corresponding to the most recent release, tokenizer 1.0) through SVN

  svn co -r 16422


Charniak & Johnson reranking parser

Charniak, E., & Johnson, M. (2005). Coarse-to-fine n-best parsing and MaxEnt discriminative reranking. In Proceedings of the 43rd Meeting of the Association for Computational Linguistics (p. 173 – 180). Ann Arbor, MI, USA.

Downloaded from on 26th March, 2012.

Document parsing (default):

cat wsj23.txt | ./tokenizer -L en-u8 -P -S -E '' |\
 perl -ne 'next if /^$/; chomp; $sent++; print "<s $sent> $_ </s>\n";' |\

Berkeley parser

Petrov, S., Barrett, L., Thibaux, R., & Klein, D. (2006). Learning accurate, compact, and interpretable tree annotation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Meeting of the Association for Computational Linguistics (p. 433 – 440). Sydney, Australia.

Java 1.6 version of berkeleyParser.jar and, downloaded from on 13th May, 2013.

Document parsing (default):

cat wsj23.txt | ./tokenizer -L en-u8 -P -S -E '' |\
 perl -pe 's/\(/-LRB-/g; s/\)/-RRB-/g;' |\
 java -jar berkeleyParser.jar -gr -tokenize -accurate

Stanford CoreNLP parser

Klein, D., & Manning, C. D. (2003). Accurate unlexicalized parsing. In Proceedings of the 41st Meeting of the Association for Computational Linguistics (p. 423 – 430). Sapporo, Japan.

Version 3.20, downloaded from on 12th August, 2013.

Document parsing (default):

java -Xmx3g -cp "stanford-parser-full-2013-06-20/*:" edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat oneline edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz wsj23.txt

WeSearch/DocumentParsing (last edited 2013-11-27 18:48:43 by StephanOepen)

(The DELPH-IN infrastructure is hosted at the University of Oslo)