Discussion: Parallel Corpora
Moderator: Francis Bond, Antonio Branco; Scribe: Micha Jellinghaus
- Actually start work on the Cathedral and the Bazaar
- Make sure we all agree on the format
- Other parallel texts
- Acquisition from parallel texts
The Cathedral and the Bazaar
Francis reports on The Cathedral and the Bazaar. This corpus is available in a lot of languages, though some translations are missing a few paragraphs because they were translated from ealier versions. The texts should be converted to tsdb profiles. There are important formatting guidelines for this on the mentioned wiki page.
Currently, the English profile is already distributed with the ERG. The Japanese one is not yet available but will be soon. Francis encourages others to make profiles for other languages.
Dan adds that The Cathedral and the Bazaar is one of the hardest texts to parse due to its creative use of language. The initial coverage was 70 %, though during treebanking it became apparent that the correct result was often not included. Others should not be discouraged by this, though.
Rebecca asks about what domain The Cathedral and the Bazaar fits in. Dan and Francis think it's Essay.
Emily: Does its difficulty (i.e. wide range of phenomena) make it a good corpus? Dan: In that respect, yes. Otherwise no. Francis adds that It's also good for showing off what your grammar can do and that it's also one of the very few truly multilingual corpora there are.
Other Parallel Corpora
Francisco expresses his concern that The Cathedral and the Bazaar is too difficult for emerging grammars like the Portugues one and that It would be good to also have some easier corpus. He suggests to take easy Wikipedia articles and translate them to other languages.
Hans replies that one disadvantage of Wikipedia articles is that there are no semantic chains of a certain type in this genre, e.g. very few pronouns, or NP-to-NP coreferences.
Scott suggests to use the Universal Declaration of Human Rights or other international documents, e.g. airport regulations, shipping regulations, soccer rules etc.
Software documentation is also mentioned, though Francis objects that these texts are often not good examples of a language because the genre is just too weird.
Berthold says that for his work on Hausa, he is in contact with Deutsche Welle who should also be willing to share their parallel data.
Emily suggests children's literature. They constitute an interesting source of short sentences, though copyright might be a problem. According to Francis, there are also open-source children's books, though they are not always translated.
Francis repeats that there is a desire for more parallel corpora that are freely redistributable. For other criteria what makes a good corpus, see FeforParCorp.
Suitable Wikipedia Articles
Stephan asks who would be prepared to translate some select Wikipedia articles to their own language.
Francis: NICT has translated some Wikipedia articles about Kyoto tourism that could be used. There exists also an extemely high-quality article on Kendo.
Scott: The articles to work on would have to be brand-new articles.
Hans: There has been a very similar discussion on the EuroMatrix Plus Project. The consensus there was that good articles are those on the geography, history, and culture of a certain language community.
Francis proposes that we could translate completely independently from existing Wikipedia articles in other languages.
Stephan suggests to start articles on deep linguistic processing and translate those.
Francis: We could do that on the DELPH-IN Wiki instead.
A general discussion of Wikipedia etiquette ensues.
Francis summarizes the discussion:
- Go ahead with The Cathedral and the Bazaar,
- tell us about other good corpora,
- put the corresponding information on the wiki.