Parallel Corpora for DELPH-IN

Collections/Samples of available parallel corpora

Europarl Corpus

OPUS: Technical Documentation (plus Europarl and European Constitution)

- URL: http://logos.uio.no/opus/

The Sofie Treebank

- The treebank was developed by the participants of the Nordic Treebank Network, in which academic institutions from Denmark, Estonia, Finland, Iceland, Norway, and Sweden took part. Information about status, availability, formats and analyses can be found at - URL: http://www.hf.uio.no/tekstlab/prosjekter/SOFIE.html

This is not redistributable:

Translations in other languages exist (including Japanese), which we may be able to get permission for.

The JRC-Acquis Multilingual Parallel Corpus

Cathedral and the Bazaar

We decided to use this as a corpus, the full description is now up at MatrixMrsCatb.

Universal Declaration of Human Rights

The preamble (a multi paragraph sentence) is impossible, but apart from that it isn't too difficult, and gets some nice universal quantifiers and modals. It is a little short (65 sentences), but there are many other declarations. There are 369 different translations (4 more than last year), most of excellent quality --- the multilinguality is the main selling point. It is freely available. There is a little synergy as it is the de facto standard for testing Unicode fonts --- it should print nicely.

Scroogled

http://craphound.com/?p=1902

A short story with many free translations. It is a bit short: about 500 sentences.

Some criteria for choosing a corpus

  1. difficulty -- we need to have some hope of parsing it
  2. size --- to build statistical models it has to be a certain size
  3. quality --- the language should be natural (often a problem for translations)
  4. availability --- we need to be able to share the data
  5. multilinguality --- it would be nice to have exisiting translations
  6. relevance --- the genre should be one you are interested in
  7. synergy --- it is nice to reuse/complement existing markup
  8. diversity --- it can be interesting to experiment with a mixture of corpora, of different text types

FeforParCorp (last edited 2011-10-08 21:12:09 by localhost)

(The DELPH-IN infrastructure is hosted at the University of Oslo)