Tool for automatic extraction of transfer rules from parallel corpora

This page presents a tool for extracting transfer rules from parallel corpora. The general idea is to

The tool is currently used to extract rules for the Jaen MT system. It is located in the directory $LOGONROOT/uio/tm/jaen/extr-rules/.


You need to install statistical phrase aligners (MOSES and Anymalign). In the procedure described below, we will only use Anymailgn, which is very easy to install and execute.

Install Anymalign in the extr-rules directory:

cd $LOGONROOT/uio/tm/jaen/extr-rules/

If you need to part-of-speech tag Japanese text, you may need to install MeCab:

sudo apt-get install python-yaml
sudo apt-get install mecab-ipadic-utf8 python-mecab

Use the tool

The tool comes with two tiny parallel corpora to test the system (located in the directory corpora).

The script extr-rule.bash executes all the commands needed to extract Japanese English semantic transfer rules from a parallel corpus. In order to extract rules from the smallest test corpus, give the following command:

bash extr-rule.bash corpora/mini.ja corpora/mini.en mini jaen

The different commands invoked by extr-rule.bash are listed below. During development of new templates, only the last command needs to be repeated.

  1. Part-of-speech tag the Japanese corpus:
     python jaen/ corpora/mini.ja > corpora/mini.ja.pos
  2. Divide the corpus into '$n' profiles
     n=$(python 1500 mini corpora/mini.ja.pos corpora/mini.en jaen/)
  3. Batch parsing '$n' Japanese profiles with Jacy:
     while [ $i -lt $n ]
         mkdir -p jaen/mini-profiles/mini${i}/source/
         cheap -comment-passthrough -mrs -nsolutions=1 -results=1 -packing=15 -timeout=10 -yy -default-les -tsdbdum=jaen/mini-profiles/mini${i}/source -inputfile=jaen/mini-profiles/mini${i}/bitext/original ~/logon/dfki/jacy/japanese &> jaen/mini-profiles/mini${i}/source/log
  4. Batch parsing '$n' English profiles with the ERG:
     while [ $i -lt $n ]
         mkdir -p jaen/mini-profiles/mini${i}/target/
         cheap -repp -tagger -default-les=all -cm -packing -mrs -nsolutions=1 -results=1 -packing=15 -timeout=10 -inputfile=jaen/mini-profiles/mini${i}/bitext/object -tsdbdump jaen/mini-profiles/mini${i}/target  ~/logon/lingo/erg/english.grm &> jaen/mini-profiles/mini${i}/target/log
  5. Creating a parallel corpus of MRSs from 'n' parsed profiles:
    • indicating valency of verbs with suffixes to the relations
    • marking nominalized verb relations with an 'nmz_' prefix
    • marking proper name predicates with an 'nmd_' prefix
     python jaen/ mini n
  6. Using the Anymalign phrase aligner to produce a phrase table from the parallel corpus of MRSs. The program runs until it is stopped with Ctrl-C
     python anymalign2.5/ jaen/mini_mrs_source.txt jaen/mini_mrs_target.txt > jaen/mini-anymalign.mrs
  7. Choosing the most probable phrase alignments. This script can take both the output of Anymalign and the output of MOSES as input.
     python jaen/mini
  8. Reading the existing transfer rule files
     python $LOGONROOT jaen/ >  jaen/hand-rules
  9. Representing the lexicons of the parsing grammar and generating grammar as tables
     python ${LOGONROOT}/lingo/erg/lexicon.tdl > jaen/
     python ${LOGONROOT}/dfki/jacy/lexicon.tdl > jaen/
  10. Reading the processed phrase table and matching with templates. If your source language is not Japanese you need to change the src_prefix in the top of the file to an empty string. The script calls a function in 'jaen/' with language specific templates. You may need to modify the templates in this file.
     python $LOGONROOT mini jaen/

The resulting transfer rule files '' and '' will be printed in the 'jaen' directory.

Adapting the tool to another language pair

You will need a parsing grammar, a generating grammar, and a sentence aligned parallel corpus.

MtRuleExtraction (last edited 2012-08-29 08:40:59 by PetterHaugereid)

(The DELPH-IN infrastructure is hosted at the University of Oslo)