Cathedral and the Bazaar (cb)

This is an early essay on Open Source by Eric Raymond. It is about 800 sentences, which is small, but there are more essays if we want more data. There are several good translations (not all linked to the main page). Wikipedia also has a number of links to different translations: C (see on the left) (AF). It is freely available, but I (FCB) checked with the author anyway as a matter of courtesy and he was enthusiastic about us using it. There will be some clean up work involved in getting the translations aligned (there are several versions of the essay).

It was proposed (by FrancisBond) and accepted (by everyone) at the Kyoto Summit (2008) that we use this as a multilingual shared test suite to enable us to compare parses across different grammars. This page describes the steps we are taking to prepare the translations of the essays as a corpus. As the data becomes available, we will also link it to this pages.

The Cathedral and the Bazaar in different languages

Language

Grammar

Web

Version

Profile

Item

Catalan (ca)

ca

?

Chinese (zh) traditional

zh (big5)

1.42

Chinese (zh) simplified

zh

?

English (en)

ERG

en

1.57

catb.en.txt

catb.en.item

French (fr)

Grenouille

fr

1.4

German (de)

GG

de

1.45

Greek, Modern (el)

MGRG

el

?

Japanese (ja)

Jacy

ja

1.40

Korean (ko)

KRG

ko

1.32

Norwegian (no)

Norsource

NTNU

Portuguese (pt)

LXgram

pt

1.42

Spanish (es)

SRG

es

1.28

Swedish (sv)

sv

1.51

Thai (th)

th

?

At NiCT we also have a 201 sentence aligned subset of en,ko,zh,de,pt,it,fr which we use for MT testing. Sugita Sho used it to compare various MT systems 「機械翻訳の精度分析」 "An Analysis of Machine Translation Precision".

Timeline

This is the timeline agreed on at the Kyoto Summit, moved back a month to fit with the current reality.

Formatting Guidelines

Treebanking this text leads to several interesting issues with text cleansing: italics, embedded quotations, list numbers and so forth. In this section we will discuss what we have done in non-straightforward cases.

Note that we are not treating this as a corpus for testing the robustness of our systems to raw text, but rather as a set of sentences for comparing the semantic representations across languages. Therefore, we will try to make the input text as easy to parse as possible. In our corpus all markup is removed and obvious infelicities (typos, mispellings, bad translations) should be corrected. If and when we want to look at robustness issues, we will choose a new text (possibly the next essay in this series).

For the profile, we will use the itsdb text file format, which can be automatically converted into [incr tsdb()] bitext profiles.

Profile Name

We will call the input file cb-xx.txt, where xx is the iso 639 language code. The resulting profile should contain cb-xx/item. If we distribute just the item file then please call it cb-xx.item.

Markup

We have removed all markup (hyperlinks, italics, paragraph boundaries, ...). These can be added in when we have more of a handle on how to deal with them.

Examples:

Structure

Mark headers as headers (with a preceding + in the text profile, as XP in the item file):

Keep list item numbers in the first sentence in the list item.

Quotations

[18200] ``Somebody finds the problem,'' he says, ``and somebody else understands it.
[18300] And I'll go on record as saying that finding it is the bigger challenge.''

Typos

We should correct obvious typos in the profile, and also send them upstream to the maintainer of the essay/translation.

Anything that is not clearly in error should be left as is.

Sentence Numbering

[1010] +The Cathedral and the Bazaar
...
[8690] Finally, Linus Torvalds's comments were helpful and his early endorsement very encouraging.

;; [1010] +The Cathedral and the Bazaar
[10] +伽藍 と バザール ;; en/1010

The commented out English sentence is just there to aid the translator and to help non-native speakers who are interested in your language follow it.

The en/1010 is there so that we can produce texts aligned between different languages. We should be able to align each language with English (e.g. in this case ja/10 with en/1010), and then use English as a pivot to align other languages. Note that this will appear in the item file as a comment.

;; [1290] Chance handed me a perfect way to test my theory, in the form of an open-source project that I could consciously try to run in the bazaar style.
[290] そしてその頃まったくの偶然から、自分の理論を試してみる完璧な機会がやってきた。
[300] 意識的にバザール方式で運営できるようなフリーソフトプロジェクトという形で。

;;[1480] So I went out on the Internet and found one.
;;[1490] Actually, I found three or four.
[500] そこでネットで探してみると、3つか4つ見つかった。

Having many misaligned sentences makes cross language comparison just that much harder, ...

MatrixMrsCatb (last edited 2011-10-08 21:12:18 by localhost)

(The DELPH-IN infrastructure is hosted at the University of Oslo)