Discussion: Parallel Corpora

Moderator: Francis Bond, Antonio Branco; Scribe: Micha Jellinghaus



The Cathedral and the Bazaar

Francis reports on The Cathedral and the Bazaar. This corpus is available in a lot of languages, though some translations are missing a few paragraphs because they were translated from ealier versions. The texts should be converted to tsdb profiles. There are important formatting guidelines for this on the mentioned wiki page.

Currently, the English profile is already distributed with the ERG. The Japanese one is not yet available but will be soon. Francis encourages others to make profiles for other languages.

Dan adds that The Cathedral and the Bazaar is one of the hardest texts to parse due to its creative use of language. The initial coverage was 70 %, though during treebanking it became apparent that the correct result was often not included. Others should not be discouraged by this, though.

Rebecca asks about what domain The Cathedral and the Bazaar fits in. Dan and Francis think it's Essay.

Emily: Does its difficulty (i.e. wide range of phenomena) make it a good corpus? Dan: In that respect, yes. Otherwise no. Francis adds that It's also good for showing off what your grammar can do and that it's also one of the very few truly multilingual corpora there are.

Other Parallel Corpora

Francisco expresses his concern that The Cathedral and the Bazaar is too difficult for emerging grammars like the Portugues one and that It would be good to also have some easier corpus. He suggests to take easy Wikipedia articles and translate them to other languages.

Hans replies that one disadvantage of Wikipedia articles is that there are no semantic chains of a certain type in this genre, e.g. very few pronouns, or NP-to-NP coreferences.

Scott suggests to use the Universal Declaration of Human Rights or other international documents, e.g. airport regulations, shipping regulations, soccer rules etc.

Software documentation is also mentioned, though Francis objects that these texts are often not good examples of a language because the genre is just too weird.

Berthold says that for his work on Hausa, he is in contact with Deutsche Welle who should also be willing to share their parallel data.

Emily suggests children's literature. They constitute an interesting source of short sentences, though copyright might be a problem. According to Francis, there are also open-source children's books, though they are not always translated.

Francis repeats that there is a desire for more parallel corpora that are freely redistributable. For other criteria what makes a good corpus, see FeforParCorp.

Suitable Wikipedia Articles

Stephan asks who would be prepared to translate some select Wikipedia articles to their own language.

Francis: NICT has translated some Wikipedia articles about Kyoto tourism that could be used. There exists also an extemely high-quality article on Kendo.

Scott: The articles to work on would have to be brand-new articles.

Hans: There has been a very similar discussion on the EuroMatrix Plus Project. The consensus there was that good articles are those on the geography, history, and culture of a certain language community.

Francis proposes that we could translate completely independently from existing Wikipedia articles in other languages.

Stephan suggests to start articles on deep linguistic processing and translate those.

Francis: We could do that on the DELPH-IN Wiki instead.

A general discussion of Wikipedia etiquette ensues.


Francis summarizes the discussion:

BarcelonaCorpora (last edited 2011-10-08 21:12:15 by localhost)

(The DELPH-IN infrastructure is hosted at the University of Oslo)