Japanese English Statistical Machine Translation

Disclaimer: This page is for notes and discussion of work in progress on SMT between Japanese and English. It is unlikely to be understandable or useful to anyone outside the project.

Results (no MERT):

Model

Factors

Test 1 BLEU

Test 2 BLEU

Test 3 BLEU

Average BLEU

Time taken

Comments

Mecab; No Punctuation)

surface-->surface

JE

19.76

19.36

20.44

19.85

JST data

Mecab; Punctuation)

surface-->surface

JE

21.11

21.39

21.84

21.45

Mecab Tokenization & Chasen POS)

surface-->surface pos-->pos

JE

19.14

19.56

20.14

19.58

Juman No Punctuation)

surface-->surface pos-->pos

JE

18.98

17.55

17.66

18.71

Juman & Punctuation, Lemmas too)

surface-->surface pos-->pos lemma,pos-->lemma lemma,pos-->surface

JE

20.72

21.44

21.72

21.29

Mecab Punctuation, POS, Lemmas, Morph

t2:surface-->surface, t0:lemma-->lemma, g1:lemma-->pos, t1: morph-->pos, g2: pos,lemma-->surface

JE

19.68

19.87

19.59

19.71

Next:

Model

Test 1 BLEU

Test 2 BLEU

Test 3 BLEU

Average BLEU

Time taken

Comments

1

EJ

JST data

2

EJ

3 (Mecab)

EJ

24.67

4 (Juman)

EJ

3 (reversed)

EJ

Eric's systems:

Model

Factors

Corpus

Pair

pre-MERT

BLEU

2nd Run

Comments

Time

punctuation; lowercase

none

IWSLT06

JE

--

--

tokenization: Mecab; Moses baseline script

punctuation; lowercase

none

Tanaka

JE

14.39

17.69

tokenization: Mecab; Moses baseline script

punctuation; lowercase

surface->surface+pos

Tanaka

JE

11.39

19.06

17.75

EN factors: tree tagger

< 24 hrs

punctuation; lowercase

t: surface->surface; g: surface->pos

Tanaka

JE

11.39

17.89

--

EN factors: tree tagger

11 hrs

punctuation; lowercase

t: lemma->lemma; g: lemma->pos; t: morph->pos g: lemma+pos->surface

Tanaka

JE

18.67

--

JA factors: Mecab, morph == pos; EN factors: tree tagger

punctuation; lowercase

t: lemma->lemma; g: lemma->pos; t: morph->pos g: lemma+pos->surface

Tanaka

JE

9.66

--

JA factors: Mecab, morph == morph form, type; EN factors: tree tagger, morpha

punctuation; lowercase

t: lemma->lemma; g: lemma->pos; t: pos+morph->pos g: lemma+pos->surface

Tanaka

JE

6.91

--

JA factors: Mecab, morph == morph form, type; EN factors: tree tagger, morpha

punctuation; lowercase

none

Tanaka

EJ

26.87

--

tokenization: Moses baseline script; Mecab

punctuation; lowercase

surface->surface+pos

Tanaka

EJ

26.10

--

JA factors: Mecab

Models Under Construction

Model 1: (Lowercase English, No Punctuation; Mecab Tokenized Japanese, POS from MeCab (is it really chasen??) & no punctuation) sentence.lc.np.pose.en sentence.np.tokm.posm.ja

Model 2: (Lowercase English; Punctuation; Mecab Tokenized Japanese, POS from MeCab (is it really chasen??) & Punctuation) sentence.lc.p.pose.en sentence.p.tokm.posm.ja

Model 3: (Lowercase English, No Punctuation; Mecab Tokenized Japanese, POS fromChasen & no punctuation) sentence.lc.np.pose.en sentence.np.tokc.posm.ja

Model 4: (Lowercase English, No Punctuation; Juman Tokenized Japanese, POS from Juman & no punctuation) sentence.lc.np.pose.en sentence.np.tokj.posm.ja

Model 5: (Lowercase English, No Punctuation; lemmatized in both languages & no punctuation) sentence.lc.np.dicm.pose.en sentence.np.tokj.dicj.posm.ja

Model 6: (best of 1-5 + NE) do NE on both languages and add as a factor Francis|n|name-B Bond|n|name-M was|v|O here|n|O (or here|n|place-B, depending on your tagger)

Model 7: (best of 1-5 + NE variant) do NE on both languages and filter out NEs that don't align in preprocessing. e.g: compare the results (maybe taking only the intersection) then you can ge better results, as the cues must be different in the two languages.

lc = lowercase np = no punctuation p = punctuation tokc = Chasen Tokenized tokj = Juman Tokenized tokm = Mecab Tokenized dicj = Root Form from Juman dicm = Root Form from MORPH English Morphological Tagger pose = POS Adawati Maximum Entropy Tagger posc = POS Chasen posj = POS Juman posm = POS MeCab en = english ja = japanese

Sample File Names: sentences.lc.p.pose.en sentences.p.tokm.posm.ja

Other Ideas

Data Sources

Other Experiments

NE_Tagging_For_Improving_SMT

MtJaenSmt (last edited 2011-10-08 21:12:11 by localhost)

(The DELPH-IN infrastructure is hosted at the University of Oslo)