Reading List

RD: Superficially relevant, but not really comparable. As far as I understand, they use a CFG approximation of the parse forest as a filter when training the supertagger to 'speed up' the perceptron training process. It doesn't actually speed it up, but they get more accurate tags after the same number of iterations as the baseline perceptron, but each iteration takes 25-70 times longer... They didn't show how accurate the baseline perceptron without the forest filtering was when run for the same amount of time. As I understand it, in the limit, both methods should reach the same answer. It's basically the same CFG-filter they use at the input of their parser. We'd get similar information (except exact, not approximate) by extracting tag sequences from the top 500 parses if we wanted to use it in training, but that is going to depend on how we train the model, and I didn't think we were planning to look at pointwise tag classifiers?

Relevant stuff from ASR

ASR analogy

Aim: (argmax) P(Labels|Obs) = P(Obs|Labels)P(Labels)/P(Obs)

ASR

Supertags

Obs

acoustic slice

tokens (words)

Labels

words

supertags

Hidden States

subphone x in word y

??

Data Characteristics

obs combinations/label: low, #labels: v. high

obs combinations/label: v. high, #labels: moderate

non-Viterbi ASR

A* decoding

I've never looked at this before, but from a quick read it seems like it could be a structure for the varying length observations + length penalty idea Erik was talking about. Partial path hypothesis agenda scored by

Score(partial path) = P(partial path) + h*(partial path)

where P(partial path) is P(Obs_so_far|Tags_so_far)P(Tags_so_far) and h*(partial path) is an estimation of the score from the rest of the sequence, which could just be proportional to the length of the sequence remaining.

MMIE training

Maximum Mutual Information Estimation, which is the same as conditional max likelihood: discriminatively train the HMM. Does this negate the problem of using varying tokenisations of the observation sequence, since its not trying to estimate the joint probability anymore? (Assuming we'd just sum over what was in the lexical chart.)

More complicated

LatticeSupertagging (last edited 2011-10-08 21:12:09 by localhost)

(The DELPH-IN infrastructure is hosted at the University of Oslo)