Diff for "TdlRfc" - Deep Linguistic Processing with HPSG (DELPH-IN)

Differences between revisions 14 and 15

Type Description Language and other aspects of DELPH-IN Joint Reference Formalism

Contents

TDL File Syntax
TDL File Interpretation and Conventions
Notes for implementation
1. DocStrings
2. Comments
Open Questions
Discussions

TDL File Syntax

Productions are separated into thematic sections. ALL-CAPS rule names are for non-content terminals, which appear at the bottom of the description.

   1 # File Contents
   2 
   3 TdlTypeFile  := ( TypeDef | TypeAddendum | Spacing )* EOF
   4 TdlRuleFile  := ( LexRuleDef | LetterSet | WildCard | Spacing )* EOF
   5 
   6 # Types and Lexical Rules
   7 
   8 TypeDef      := TypeName DEFOP TypeDefBody DOT
   9 TypeAddendum := TypeName ADDOP AddendumBody DOT
  10 TypeName     := Identifier Spacing
  11 
  12 LexRuleDef   := LexRuleId DEFOP Affix? TypeDefBody DOT
  13 LexRuleId    := Identifier Spacing
  14 
  15 # Identifiers are used in several patterns
  16 #
  17 # Note: For some processors, like the LKB, there may be "break characters"
  18 #       defined which determine what is allowed within an identifier.
  19 
  20 Identifier   := /[^\s.:<=&,#[]$()>!^\/]+/
  21 
  22 # Definition Bodies (top-level conjunctions of terms)
  23 #
  24 # Note: Definition bodies are most simply Conjunctions, but several
  25 #       variations require special productions:
  26 #
  27 #       (1) """DocStrings""" may precede any top-level Term or the final DOT
  28 #       (2) TypeDef and LexRuleDef require at least one TypeName
  29 #       (3) TypeAddendum may use a DocString in place of a Conjunction
  30 #           
  31 
  32 TypeDefBody  := TypedConj DocString?
  33 AddendumBody := DocConj DocString? | DocString
  34 
  35 # Note: To accommodate TypeDefBody and AddendumBody, three special
  36 #       conjunctions are added:
  37 #
  38 #       (1) TypedConj has an obligatory TypeName term
  39 #       (2) FeatureConj excludes type terms (including strings, etc.)
  40 #       (3) DocConj is a regular conjunction with optional DocStrings
  41 #
  42 #       Note that FeatureConj is only necessary to reduce ambiguity (e.g.,
  43 #       for LALR parsing); if ambiguity is allowed, DocConj may be used.
  44 
  45 TypedConj    := ( FeatureConj AND )? DocString? TypeName ( AND DocConj )?
  46 FeatureConj  := DocString? FeatureTerm ( AND DocString? FeatureTerm )*
  47 DocConj      := DocString? Term ( AND DocString? Term )*
  48 
  49 # Note: The DocString pattern may span multiple lines
  50 
  51 DocString    := /"""([^"\\]|\\.|"(?!")|""(?!"))*"""/ Spacing
  52 
  53 # Terms and Conjunctions
  54 
  55 Conjunction  := Term ( AND Term )*
  56 Term         := TypeTerm | FeatureTerm | Coreference
  57 TypeTerm     := TypeName
  58               | DQString
  59               | Regex
  60 FeatureTerm  := Avm
  61               | DiffList
  62               | ConsList
  63 
  64 DQString     := /"([^"\\]|\\.)*"/ Spacing
  65 Regex        := "^" /([^$\\]|\\.)*/ "$"
  66 
  67 Avm          := AVMOPEN AttrVals? AVMCLOSE
  68 AttrVals     := AttrVal ( COMMA AttrVal )*
  69 AttrVal      := AttrPath SPACE Conjunction
  70 AttrPath     := Attribute ( DOT Attribute )*
  71 Attribute    := Identifier Spacing
  72 
  73 DiffList     := DLOPEN Conjunctions? DLCLOSE
  74 ConsList     := CLOPEN ( Conjunctions ConsEnd? )? CLCLOSE
  75 ConsEnd      := COMMA ELLIPSIS | DOT Conjunction
  76 Conjunctions := Conjunction ( COMMA Conjunction )*
  77 
  78 Coreference  := "#" Identifier Spacing
  79 
  80 # Letter-sets, Wild-cards, and Affixes
  81 #
  82 # Note: spacing is sensitive within these patterns, so many non-content
  83 #       terminals are used directly with an explicit SPACE instead of in
  84 #       a production with Spacing.
  85 
  86 LetterSet    := "%(letter-set" SPACE? LetterSetDef SPACE? ")"
  87 WildCard     := "%(wild-card" SPACE? WildCardDef SPACE? ")"
  88 LetterSetDef := "(" LetterSetVar SPACE Characters ")"
  89 WildCardDef  := "(" WildCardVar SPACE Characters ")"
  90 LetterSetVar := /![^ ]/
  91 WildCardVar  := /\?[^ ]/
  92 Characters   := /([^)\\]|\\.)+/
  93 
  94 # Note: When a LetterSetVar is used in an AffixMatch, the same LetterSetVar
  95 #       in the AffixSub copies the matched character, in order, so there
  96 #       should be the same number of LetterSetVars in both, but this is not
  97 #       captured in the syntax.
  98 
  99 Affix        := AffixClass AffixPattern+ Spacing
 100 AffixClass   := "%prefix" | "%suffix"
 101 AffixPattern := SPACE? "(" AffixMatch SPACE AffixSub ")"
 102 AffixMatch   := NullChar | CharList
 103 AffixSub     := CharList
 104 NullChar     := "*"
 105 CharList     := ( LetterSetVar | WildCardVar | AffixChar )+
 106 AffixChar    := /([^!?\s*\\]|\\[^ ])+/
 107 
 108 # Whitespace and Comments
 109 #
 110 # Note: SPACE and BlockComment may span multiple lines
 111 
 112 Spacing      := SPACE? Comment*
 113 SPACE        := /\s+/
 114 Comment      := ( LineComment | BlockComment ) SPACE?
 115 LineComment  := /;.*$/
 116 BlockComment := "#|" /([^|\\]|\\.|\|(?!#))*/ "|#"
 117 
 118 # Non-content Terminals
 119 
 120 DEFOP        := ":=" Spacing
 121 ADDOP        := ":+" Spacing
 122 DOT          := "." Spacing
 123 AND          := "&" Spacing
 124 COMMA        := "," Spacing
 125 AVMOPEN      := "[" Spacing
 126 AVMCLOSE     := "]" Spacing
 127 DLOPEN       := "<!" Spacing
 128 DLCLOSE      := "!>" Spacing
 129 CLOPEN       := "<" Spacing
 130 CLCLOSE      := ">" Spacing
 131 ELLIPSIS     := "..." Spacing
 132 EOF          := ""  # end-of-file

TDL File Interpretation and Conventions

Layout of a type definition

Some parts of a type definition are mandated by TDL syntax, such as the initial identifier, the main operator, and the final dot:

identifier := (definition body) .

The definition body is just a conjunction of terms, maybe with documentation strings, and there is much valid variation in how those terms are arranged. Nevertheless, there are conventional locations for these terms depending on what kind of term they are. For instance, the supertypes are generally listed first, followed by an AVM:

head_only := unary_phrase & headed_phrase &
  [ HD-DTR #head & [ SYNSEM.LOCAL.CONJ cnil ],
    ARGS < #head > ].

If a documentation string is specified, the conventional place is before the AVM:

n_-_ad-pl_le := norm_np_adv_lexent &
"""
<description>N, can modify, locative (place)
<ex>B lives overseas.
<nex>
<todo>
"""
  [ SYNSEM.LOCAL [ CAT.HEAD [ MINORS.MIN place_n_rel,
                              CASE obliq ],
                   CONT.HOOK.INDEX.SORT place ] ].

Or if there is no AVM, before the final dot:

info-str := icons
  """Type for underspecified or "neutral" information structure.""".

Types versus instances

Specifying the text encoding

The text encoding of TDL files can be specified using a special comment on the first line of the file, as is done with many scripting languages. For instance, the following sets the encoding to UTF-8:

   1 ; -*- coding: utf-8 -*-

In some TDL files, attributes specific to the Emacs text editor may be included:

   1 ;;; -*- Mode: tdl; Coding: utf-8; indent-tabs-mode: nil; -*-

Feature interpretation of lists

The < ... > and <! ... !> shorthand for lists ("cons lists") and diff-lists, respectively, correspond to normal attribute-value pairs. The implementation relies on an encoding scheme where the first list item (the list's head) is at the feature FIRST while the rest of the list (the tail) is defined recursively under the feature REST (e.g., REST.REST.FIRST is the third element). The types associated with open and closed lists, and sometimes even the feature names, are configurable by the grammar.

entity	example	LKB config	ACE config
cons-list type	`cons`	(not configurable)	`cons-type`
open list type	`list`	`list-type`	`list-type`
closed list type	`null`	`empty-list-type`	`null-type`
diff-list type	`diff-list`	`diff-list-type`	`diff-list-type`
list head feature	`FIRST`	`list-head`	(not configurable)
list tail feature	`REST`	`list-tail`	(not configurable)
diff-list list feature	`LIST`	`diff-list-list`	(not configurable)
diff-list last feature	`LAST`	`diff-list-last`	(not configurable)

For the examples below, I use the values defined in the above table, which are taken from the ERG.

Cons Lists

Regular cons lists may be open (extendable) or closed (fixed-length). The type of an open list as interpreted by, e.g., < ... >, is *list* (rather, the defined open list type), but in hand-written TDL a subtype of *list* is often used, such as *cons*.

   1 ; an empty list is terminated (always empty)
   2 [ ATTR < > ]             =>  [ ATTR *null* ]
   3 ; single item goes on FIRST attribute and REST is terminated
   4 [ ATTR < a > ]           =>  [ ATTR *list* & [ FIRST a,
   5                                                REST *null* ] ]
   6 ; items after the first go on (REST.)+FIRST
   7 [ ATTR < a, b > ]        =>  [ ATTR *list* & [ FIRST a,
   8                                                REST [ FIRST b,
   9                                                       REST *null* ] ] ]
  10 ; an empty list with ... is not terminated
  11 [ ATTR < ... > ]         =>  [ ATTR *list* ]
  12 ; this also works with items on the list
  13 [ ATTR < a, ... > ]      =>  [ ATTR *list* & [ FIRST a,
  14                                                REST *list* ] ]
  15 ; the . delimiter allows a non-*list*, non-*null* value for the last REST
  16 [ ATTR < a . #coref > ]  =>  [ ATTR *list* & [ FIRST a,
  17                                                REST #coref ] ]

Diff Lists

Diff lists are regular lists under a LIST attribute, and LAST points to the last item. Diff lists don't support the unterminated-list functionality of cons lists, but they allow for appending lists of arbitrary size (see GeFaqDiffList).

   1 [ ATTR <! !> ]           =>  [ ATTR *diff-list* & [ LIST #coref,
   2                                                     LAST #coref ] ]
   3 
   4 [ ATTR <! a !> ]         =>  [ ATTR *diff-list* & [ LIST *list* & [ FIRST a,
   5                                                                     REST #coref ],
   6                                                     LAST #coref ] ]

Type documentation

TDL definitions allow documentation strings ("docstrings") before any term in the top-level conjunction or before the terminating dot (.) character:

n_-_c_le := n_intr_lex_entry
"""Intransitive count noun (icn)
<ex>The dog barked.
<nex>Much dog bark.""".

Before docstrings became well-supported, LTDB supported documentation in comments (normally preceding the documented type):

   1 ; <type val="case-p-lex-np-to">
   2 ; <name-ja>承名詞目的格助詞ト
   3 ; <description>case-p-lex-np-woを参照。このtypeは助詞「と」。
   4 ; <ex>部長 と 会う
   5 ; <nex>ゆっくり と 進む
   6 ; <todo>
   7 ; </type>
   8 case-p-lex-np-to := case-p-lex-np &
   9  [SYNSEM.LOCAL.CAT.HEAD.CASE to].

Case sensitivity

Case Sensitive

Things inside quotes (NB: strings passed from TFS world into MRS can be treated as case insensitive in MRS processing (i.e. as predicate symbols, but not CARGs)

Case Insensitive

Everything in TDL not inside of quotes.
Lexicon look-up.
- Proper names?
- Acronyms?
.. approach these with token-mapping (preserve the info, and then downcase anyway)

Unknown

Orthographic subrules (agree: case sensitive, ACE: [intended] case insensitive)

Notes: Arguments for case insensitive include shouting (call caps); Arguments for case sensitive include the use of upper case vowels in vowel harmony languages (linguistic representations, not orthography)

Notes for implementation

DocStrings

Multiple docstrings may be present on a single definition, but only the first one encountered on a definition is considered its primary docstring, and implementers are free to store or discard the other doc strings as they see fit. Docstrings on type-addenda should be concatenated with a newline to the previous docstring(s), or appended to a list of docstrings, associated with the type.

Comments

The syntax description above allows for comments anywhere that separating whitespace is allowed (not including those within strings, regular expressions, letter sets, etc.). This includes within a dotted attribute path (e.g., [ SYNSEM #| comment |# . #| comment |# LOCAL ... ]), although grammar developers may want to use this flexibility sparingly.

Open Questions

1. The ^ character is used to signal "expanded-syntax" in the LKB, but is this only used for regular expressions? Are there other expanded syntaxes? Do non-LKB processors support them? (see this thread on the 'developers' mailing list)

2. Are instances distinguishable from types? Are they (other other entities) restricted to having exactly one supertype?

Wiki

Page

User