Overview

SMAF is the proposed XML-input format for use with the DELPH-IN deep processors. A SMAF document describes a segment (generally, a sentence) of data packaged for input to a deep processor/parser such as the LKB or PET. This XML-input format is an amalgamation of ideas taken from (and intended to subsume) MAF, the Pet Input Chart used in the HoG system (HeartofgoldTop), the SPPP (LkbSppp) mode implemented in the LKB, and SAF, and incorporates RMRS XML.

SMAF follows the principles of standoff annotation. This means that:

Each SMAF document describes a segment of the primary data for input to a deep parser (such a segment typically corresponds to the notion of a sentence). The following properties are global to each standoff document:

Properties of each edge:

SMAF/LKB

See SmafLkb.

SMAF/PET

See SmafPet.

SMAF configuration

On receiving a SMAF document as input, a deep parser will map the SMAF object into internal data structures. The format has been designed so that this mapping is reasonably straightforward for specific deep parser implementation + grammar combinations (but also general enough to abstract over the specifics of individual software components and grammars). Although many SMAF properties map fairly directly into the internal data structures of individual processors, a certain amount of configuration is required to make this go smoothly.

The lattice structure of the edges (source, target) and inter-edge dependencies (deps) can be mapped straightforwardly into internal data structures of a chart parser. The cfrom/cto properties of edges may be copied as is.

However, configuration is necessary to correctly map content (slots, fs's, rmrs's) into internal data structures. The edge type may be used to configure and constrain this mapping (eg. the content expected for a token edge sill differ to that for a pos edge will differ to that for a named-entity edge etc.).

Sample SMAF configuration settings:

;; PROCESSOR settings

token.[] -> edgeType='tok' tokenStr=content
wordForm.[] -> edgeType='morph' stem=content.stem partialTree=content.partial-tree
pos.[] -> edgeType='morph'
oscar.[] -> edgeType='tok+morph' tokenStr=content.surface

;; GRAMMAR settings

;; "slot" definitions

define gMap.type ()

;; syn(sem) type

oscar.[type='compound'] -> gMap.type='n_proper_nale'
oscar.[type='substance'] -> gMap.type='n_proper_nale'
oscar.[type='element'] -> gMap.type='n_proper_nale'
oscar.[type='namender'] -> gMap.type='n_proper_nale'
oscar.[type='adjective'] -> gMap.type='adj_intrans_nale'

;; semantics (REL + CARG)

oscar.[] -> rmrs=rmrs

SAMPLE SMAF XML

See also: SmafSample

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE smaf SYSTEM 'smaf.dtd'>
<smaf document='/some/path/x.txt'>
 <text>The dog barks.</text>
 <lattice init='v0' final='v3' cfrom='0' cto='14'>
  <edge type='token' id='t1' cfrom='0' cto='3' source='v0' target='v1'>The</edge>
  <edge type='token' id='t2' cfrom='4' cto='7' source='v1' target='v2'>dog</edge>
  <edge type='token' id='t3' cfrom='8' cto='14' source='v2' target='v3'>barks.</edge>
  <edge type='pos' id='p1' source='v0' target='v1' deps='t1'><slot name='tag'>DET</slot></edge>
  <edge type='pos' id='p2' source='v1' target='v2' deps='t2'><slot name='tag'>NN</slot></edge>
  <edge type='pos' id='p3' source='v2' target='v3' deps='t3'><slot name='tag'>VBZ</slot></edge>
 </lattice>
</smaf>

SMAF DTD

See SmafDtd.

SmafTop (last edited 2011-10-08 21:12:09 by localhost)

(The DELPH-IN infrastructure is hosted at the University of Oslo)