Tool for automatic extraction of transfer rules from parallel corpora

This page presents a tool for extracting transfer rules from parallel corpora. The general idea is to

The tool is currently used to extract rules for the Jaen MT system. It is located in the directory $LOGONROOT/uio/tm/jaen/extr-rules/.

Preliminaries

You need to install statistical phrase aligners (MOSES and Anymalign). In the procedure described below, we will only use Anymailgn, which is very easy to install and execute.

Install Anymalign in the extr-rules directory:

cd $LOGONROOT/uio/tm/jaen/extr-rules/
wget http://perso.limsi.fr/Individu/alardill/anymalign/latest/anymalign2.5.zip
unzip anymalign2.5.zip

If you need to part-of-speech tag Japanese text, you may need to install MeCab:

sudo apt-get install python-yaml
sudo apt-get install mecab-ipadic-utf8 python-mecab

Use the tool

The tool comes with two tiny parallel corpora to test the system (located in the directory corpora).

The script extr-rule.bash executes all the commands needed to extract Japanese English semantic transfer rules from a parallel corpus. In order to extract rules from the smallest test corpus, give the following command:

bash extr-rule.bash corpora/mini.ja corpora/mini.en mini jaen

The different commands invoked by extr-rule.bash are listed below. During development of new templates, only the last command needs to be repeated.

  1. Part-of-speech tag the Japanese corpus:
     python jaen/ja2yy.py corpora/mini.ja > corpora/mini.ja.pos
  2. Divide the corpus into '$n' profiles
     n=$(python divide-corpus.py 1500 mini corpora/mini.ja.pos corpora/mini.en jaen/)
  3. Batch parsing '$n' Japanese profiles with Jacy:
     i="0"
     while [ $i -lt $n ]
     do
         i=$[$i+1]
         mkdir -p jaen/mini-profiles/mini${i}/source/
         cheap -comment-passthrough -mrs -nsolutions=1 -results=1 -packing=15 -timeout=10 -yy -default-les -tsdbdum=jaen/mini-profiles/mini${i}/source -inputfile=jaen/mini-profiles/mini${i}/bitext/original ~/logon/dfki/jacy/japanese &> jaen/mini-profiles/mini${i}/source/log
     done 
  4. Batch parsing '$n' English profiles with the ERG:
     i="0"
     while [ $i -lt $n ]
     do
         i=$[$i+1]
         mkdir -p jaen/mini-profiles/mini${i}/target/
         cheap -repp -tagger -default-les=all -cm -packing -mrs -nsolutions=1 -results=1 -packing=15 -timeout=10 -inputfile=jaen/mini-profiles/mini${i}/bitext/object -tsdbdump jaen/mini-profiles/mini${i}/target  ~/logon/lingo/erg/english.grm &> jaen/mini-profiles/mini${i}/target/log
     done
  5. Creating a parallel corpus of MRSs from 'n' parsed profiles:
    • indicating valency of verbs with suffixes to the relations
    • marking nominalized verb relations with an 'nmz_' prefix
    • marking proper name predicates with an 'nmd_' prefix
     python profiles2mrsparcorp.py jaen/ mini n
  6. Using the Anymalign phrase aligner to produce a phrase table from the parallel corpus of MRSs. The program runs until it is stopped with Ctrl-C
     python anymalign2.5/anymalign.py jaen/mini_mrs_source.txt jaen/mini_mrs_target.txt > jaen/mini-anymalign.mrs
  7. Choosing the most probable phrase alignments. This script can take both the output of Anymalign and the output of MOSES as input.
     python phrtab-thin.py jaen/mini
  8. Reading the existing transfer rule files
     python hand-rules.py $LOGONROOT jaen/ >  jaen/hand-rules
  9. Representing the lexicons of the parsing grammar and generating grammar as tables
     python lex.py ${LOGONROOT}/lingo/erg/lexicon.tdl > jaen/target-lex.tab
     python lex.py ${LOGONROOT}/dfki/jacy/lexicon.tdl > jaen/source-lex.tab
  10. Reading the processed phrase table and matching with templates. If your source language is not Japanese you need to change the src_prefix in the top of the file to an empty string. The script calls a function in 'jaen/templates.py' with language specific templates. You may need to modify the templates in this file.
     python thin2mtr.py $LOGONROOT mini jaen/

The resulting transfer rule files 'mini-single.mtr' and 'mini-mwe.mtr' will be printed in the 'jaen' directory.

Adapting the tool to another language pair

You will need a parsing grammar, a generating grammar, and a sentence aligned parallel corpus.

MtRuleExtraction (last edited 2012-08-29 08:40:59 by PetterHaugereid)

(The DELPH-IN infrastructure is hosted at the University of Oslo)