LatticeSupertagging

Reading List

Yao-zhong Zhang, Takuya Matsuzaki, & Jun’ichi Tsujii (2010). Forest-guided Supertagger Training. COLING 2010

RD: Superficially relevant, but not really comparable. As far as I understand, they use a CFG approximation of the parse forest as a filter when training the supertagger to 'speed up' the perceptron training process. It doesn't actually speed it up, but they get more accurate tags after the same number of iterations as the baseline perceptron, but each iteration takes 25-70 times longer... They didn't show how accurate the baseline perceptron without the forest filtering was when run for the same amount of time. As I understand it, in the limit, both methods should reach the same answer. It's basically the same CFG-filter they use at the input of their parser. We'd get similar information (except exact, not approximate) by extracting tag sequences from the top 500 parses if we wanted to use it in training, but that is going to depend on how we train the model, and I didn't think we were planning to look at pointwise tag classifiers?

Relevant stuff from ASR

ASR analogy

Aim: (argmax) P(Labels|Obs) = P(Obs|Labels)P(Labels)/P(Obs)


	ASR	Supertags
Obs	acoustic slice	tokens (words)
Labels	words	supertags
Hidden States	subphone x in word y	??
Data Characteristics	obs combinations/label: low, #labels: v. high	obs combinations/label: v. high, #labels: moderate

acoustic slides are smaller than the subphones, and the states self-loop to 'use up' all the input observations
Viterbi algorithm for ASR actually computes most likely sequence of states, not words. Approximation falls down when there are many pronunciations per word (~ many realisations per tag). Solution is either multi-pass decoding, using n-best list Viterbi or word lattice plus re-ranking, or A* decoding.
Lexical chart is already an n-best set/word lattice?

non-Viterbi ASR

A* decoding

I've never looked at this before, but from a quick read it seems like it could be a structure for the varying length observations + length penalty idea Erik was talking about. Partial path hypothesis agenda scored by

Score(partial path) = P(partial path) + h*(partial path)

where P(partial path) is P(Obs_so_far|Tags_so_far)P(Tags_so_far) and h*(partial path) is an estimation of the score from the rest of the sequence, which could just be proportional to the length of the sequence remaining.

MMIE training

Maximum Mutual Information Estimation, which is the same as conditional max likelihood: discriminatively train the HMM. Does this negate the problem of using varying tokenisations of the observation sequence, since its not trying to estimate the joint probability anymore? (Assuming we'd just sum over what was in the lexical chart.)

More complicated

A probabilistic framework for segment-based speech recognition Glass(2003): uses varying length segments, rather than constant acoustic frames
Graphical model architectures for speech recognition Bilmes & Bartels (2005): I've been meaning to look at hierarchical models since I talked about them with Mark Johnson 2 years ago, maybe I should actually read this 🙂

Home | Forum | Discussions | Events

Provide feedback

Saved searches

Use saved searches to filter your results more quickly