Overview

Ubertagging is what we have called the process of supertagging over ambiguous tokenisation. This process filters the lexical lattice prior to full parsing according to a statistical model (a trigram semi-HMM, see Dridan, 2013 for details). As of the 1214 version of the ERG, mechanisams are in place to use ubertagging functionality when parsing with PET and the ERG.

Using the ubertagging-enabled binary

Grammar setup

PET will look for ubertagging specific files in an ut/ subdirectory of the grammar. There are two types of files you will find here:

In addition, various options need to be set. These are handled using the standard PET settings mechanism, with .set files under the pet/ subdir. See below for the actual settings.

Runtime configuration

To use ubertagging with PET, give the -ut[=file] option to the parser. The file should be a settings file in the pet/ subdir of the grammar. The required options are:

Other possible options:

The options regarding tag type, caseclass separator and whether or not to map generics are all set at model training time, and as such are selected by selecting the right model.

An example file, ut.set is shown below. This would be invoked by giving cheap the option -ut=ut

ut-model := tri-nanc-wsjr5-noaffix.
ut-threshold := "0.01".
;; uncomment to turn on full Viterbi filtering
;;ut-viterbi := true.

generics_map := "generics.cfg".
prefixes := "prefix.cfg".
suffixes := "suffix.cfg".

;;for model creation, set from model when tagging
ut-caseclass_separator := ▲.
ut-tagtype := NOAFFIX.
ut-mapgen := true.

Training a model

Code for training the ubertagging models (and also for running viterbi on lexical profiles) is available at http://svn.delph-in.net/ut/trunk. Training a model requires training data in the form of <word>\t<tag> and can use the same configuration file as the parser, but requires some extra options:

In order to train a model, checkout the code and then run:

autoreconf -i
./configure
make

cat redwoods-train.tt |./traintrigram -c ./etc/ut.set redwoods-train

Training data can also be read from files given on the command line, instead of via stdin. An example of the training data used to create the models included with the grammar is shown below, with each token on a separate line, and items separated by blank lines.

well,▲non_capitalized+lower     av_-_s-cp-mc-pr_le:w_comma_plr
i▲capitalized+non_mixed n_-_pr-i_le
am▲non_capitalized+lower        v_prd_am_le
free▲non_capitalized+lower      aj_-_i_le
on▲non_capitalized+lower        p_np_i-tmp_le
monday▲capitalized+lower        n_-_c-dow_le:n_sg_ilr
except for▲non_capitalized+lower        p_np_i_le
ten▲non_capitalized+lower       n_-_pn-hour_le
to▲non_capitalized+lower        n_np_x-to-y-sg_le
twelve▲non_capitalized+lower    n_-_pn-hour_le
in▲non_capitalized+lower        p_np_i-tmp_le
the▲non_capitalized+lower       d_-_the_le
morning.▲non_capitalized+lower  n_-_c-dpt-df-sg_le:w_period_plr

and▲non_capitalized+lower       c_xp_and_le
i▲capitalized+non_mixed n_-_pr-i_le
am▲non_capitalized+lower        v_prd_am_le
free,▲non_capitalized+lower     aj_-_i_le:w_comma_plr
on▲non_capitalized+lower        p_np_i-tmp_le
tuesday▲capitalized+lower       n_-_c-dow_le:n_sg_ilr
in▲non_capitalized+lower        p_np_i-tmp_le
the▲non_capitalized+lower       d_-_the_le
afternoon.▲non_capitalized+lower        n_-_c-dpt-df-sg_le:w_period_plr

Code for extracting this training data is also included in the ut SVN repository:

./leafextract -h
Usage: ./leafextract [options] grammar-file profile
Options:
  -h [ --help ]             This usage information.
  -c [ --config ] arg       Configuration file that sets caseclass separator
  -g [ --goldonly ]         Only extract tags from 'gold' trees
  -s [ --single ] arg (=-1) Select a specific item, default (-1): all
  -r [ --result ] arg (=-1) Select a specific result number, default (-1): all.

To extract training data from an annotated profile (single or virtual), run

/leafextract -c etc/ut.set -g $LOGONROOT/lingo/erg/english.tdl $LOGONROOT/lingo/erg/tsdb/gold/redwoods > redwoods-train.tt

For unannotated data (for semi-self-training), leave off the -g option. If your profile has more than one result per item, and you only want to extract from the top result, use -r 0. For standardisation with the ubertagging code, leafextract takes the same configuration file as the other programs, but only reads the ut-caseclass_separator option. If this is not given, the code defaults to using ▲.

UtTop (last edited 2016-11-21 09:38:18 by RebeccaDridan)

(The DELPH-IN infrastructure is hosted at the University of Oslo)