Differences between revisions 18 and 20 (spanning 2 versions)
Revision 18 as of 2018-10-16 17:44:46
Size: 16552
Comment: Updated Identifier, DQString, and Regex production rules
Revision 20 as of 2018-10-18 19:37:58
Size: 16675
Comment: Change to character whitelist for the Identifier pattern
Deletions are marked like this. Additions are marked like this.
Line 68: Line 68:
# Note: For some processors, like the LKB, there may be "break characters"
# defined which determine what is allowed within an identifier.

Identifier := /[^\s.:;<=&,#[\]$()>!^\/|]+/
# Note: Some punctuation characters are allowed in TDL identifiers as they
# do not conflict with other parts of the TDL syntax.
# Note: The _ character is technically redundant as it's included in \w but
# it is left in the pattern to be explicit about what's allowed.

Identifier := /[\w_+*?-]+/

Type Description Language and other aspects of DELPH-IN Joint Reference Formalism

TDL File Syntax

Productions are separated into thematic sections. ALL-CAPS rule names are for non-content terminals, which appear at the bottom of the description.

   1 # File Contents
   2 #
   3 # Note: The LKB does not parse environments (:begin ... :end), nor does it
   4 #       support :include statements, so the following is only applicable for
   5 #       PET, ACE, and perhaps agree.
   6 
   7 TdlFile      := ( Environment | Statement | Spacing )* EOF
   8 Environment  := BEGIN TYPE DOT TypeEnv END TYPE DOT
   9               | BEGIN INSTANCE Status? DOT InstanceEnv END INSTANCE DOT
  10 TypeEnv      := ( Environment | Statement
  11                 | TypeDef | TypeAddendum | Spacing )*
  12 InstanceEnv  := ( Environment | Statement
  13                 | InstanceDef | LetterSet | WildCard | Spacing )*
  14 Status       := STATUS ( "generic-lex-entry"
  15                        | "lex-entry"
  16                        | "lexical-filtering-rule"
  17                        | "lex-rule"
  18                        | "post-generation-mapping-rule"
  19                        | "rule"
  20                        | "token-mapping-rule" ) Spacing
  21 
  22 # Note: The LKB has several Lisp functions which open files in specified
  23 #       environments, so the following are parsing targets for those
  24 #       functions.
  25 
  26 TdlTypeFile  := ( TypeDef | TypeAddendum | Spacing )* EOF
  27 TdlRuleFile  := ( InstanceDef | LetterSet | WildCard | Spacing )* EOF
  28 
  29 # Note: Krieger & Schaeffer 1994 define a large number of statements, but
  30 #       DELPH-IN grammars appear to only use :include.
  31 # Note: :include's string argument is a path relative to the current file's
  32 #       directory. If the filename extension is not given, the default ".tdl"
  33 #       extension is used. The file is opened in the same environment as the
  34 #       :include statements (e.g., :include in a type environment opens the
  35 #       file and parses it as TypeEnv)
  36 
  37 Statement    := Include
  38 Include      := INCLUDE Filename DOT
  39 Filename     := DQString
  40 
  41 # Types and Instances
  42 #
  43 # Note: Instances may be syntactically identical to type definitions, but they
  44 #       do not affect the type hierarchy. They may also be lexical rule
  45 #       definitions that include an affixing pattern to a definition.
  46 
  47 TypeDef      := TypeName DEFOP TypeDefBody DOT
  48 TypeAddendum := TypeName ADDOP AddendumBody DOT
  49 TypeName     := Identifier Spacing
  50 
  51 InstanceDef  := TypeDef | LexRuleDef
  52 LexRuleDef   := LexRuleId DEFOP Affix? TypeDefBody DOT
  53 LexRuleId    := Identifier Spacing
  54 
  55 # Identifiers are used in several patterns
  56 #
  57 # Note: Some punctuation characters are allowed in TDL identifiers as they
  58 #       do not conflict with other parts of the TDL syntax.
  59 # Note: The _ character is technically redundant as it's included in \w but
  60 #       it is left in the pattern to be explicit about what's allowed.
  61 
  62 Identifier   := /[\w_+*?-]+/
  63 
  64 # Definition Bodies (top-level conjunctions of terms)
  65 #
  66 # Note: Definition bodies are most simply Conjunctions, but several
  67 #       variations require special productions:
  68 #
  69 #       (1) """DocStrings""" may precede any top-level Term or the final DOT
  70 #       (2) TypeDef and LexRuleDef require at least one TypeName
  71 #       (3) TypeAddendum may use a DocString in place of a Conjunction
  72 #           
  73 
  74 TypeDefBody  := TypedConj DocString?
  75 AddendumBody := DocConj DocString? | DocString
  76 
  77 # Note: To accommodate TypeDefBody and AddendumBody, three special
  78 #       conjunctions are added:
  79 #
  80 #       (1) TypedConj has an obligatory TypeName term
  81 #       (2) FeatureConj excludes type terms (including strings, etc.)
  82 #       (3) DocConj is a regular conjunction with optional DocStrings
  83 #
  84 #       Note that FeatureConj is only necessary to reduce ambiguity (e.g.,
  85 #       for LALR parsing); if ambiguity is allowed, DocConj may be used.
  86 
  87 TypedConj    := ( FeatureConj AND )? DocString? TypeName ( AND DocConj )?
  88 FeatureConj  := DocString? FeatureTerm ( AND DocString? FeatureTerm )*
  89 DocConj      := DocString? Term ( AND DocString? Term )*
  90 
  91 # Note: The DocString pattern may span multiple lines
  92 
  93 DocString    := /"""([^"\\]|\\.|"(?!")|""(?!"))*"""/ Spacing
  94 
  95 # Terms and Conjunctions
  96 
  97 Conjunction  := Term ( AND Term )*
  98 Term         := TypeTerm | FeatureTerm | Coreference
  99 TypeTerm     := TypeName
 100               | DQString
 101               | Regex
 102 FeatureTerm  := Avm
 103               | DiffList
 104               | ConsList
 105 
 106 DQString     := ( /""(?!")/ | /"([^"\\]|\\.)+"/ ) Spacing
 107 Regex        := "^" /([^$\\]|\\.)*/ "$" Spacing
 108 
 109 Avm          := AVMOPEN AttrVals? AVMCLOSE
 110 AttrVals     := AttrVal ( COMMA AttrVal )*
 111 AttrVal      := AttrPath SPACE Conjunction
 112 AttrPath     := Attribute ( DOT Attribute )*
 113 Attribute    := Identifier Spacing
 114 
 115 DiffList     := DLOPEN Conjunctions? DLCLOSE
 116 ConsList     := CLOPEN ( Conjunctions ConsEnd? )? CLCLOSE
 117 ConsEnd      := COMMA ELLIPSIS | DOT Conjunction
 118 Conjunctions := Conjunction ( COMMA Conjunction )*
 119 
 120 Coreference  := "#" Identifier Spacing
 121 
 122 # Letter-sets, Wild-cards, and Affixes
 123 #
 124 # Note: spacing is sensitive within these patterns, so many non-content
 125 #       terminals are used directly with an explicit SPACE instead of in
 126 #       a production with Spacing.
 127 
 128 LetterSet    := "%(letter-set" SPACE? LetterSetDef SPACE? ")"
 129 WildCard     := "%(wild-card" SPACE? WildCardDef SPACE? ")"
 130 LetterSetDef := "(" LetterSetVar SPACE Characters ")"
 131 WildCardDef  := "(" WildCardVar SPACE Characters ")"
 132 LetterSetVar := /![^ ]/
 133 WildCardVar  := /\?[^ ]/
 134 Characters   := /([^)\\]|\\.)+/
 135 
 136 # Note: When a LetterSetVar is used in an AffixMatch, the same LetterSetVar
 137 #       in the AffixSub copies the matched character, in order, so there
 138 #       should be the same number of LetterSetVars in both, but this is not
 139 #       captured in the syntax.
 140 
 141 Affix        := AffixClass AffixPattern+ Spacing
 142 AffixClass   := "%prefix" | "%suffix"
 143 AffixPattern := SPACE? "(" AffixMatch SPACE AffixSub ")"
 144 AffixMatch   := NullChar | CharList
 145 AffixSub     := CharList
 146 NullChar     := "*"
 147 CharList     := ( LetterSetVar | WildCardVar | AffixChar )+
 148 AffixChar    := /([^!?\s*\\]|\\[^ ])+/
 149 
 150 # Whitespace and Comments
 151 #
 152 # Note: SPACE and BlockComment may span multiple lines. Also, while block
 153 #       comments in Lisp may be nested (`#| outer #| inner |# outer |#`),
 154 #       support for nested comments in TDL is mixed (ACE supports it, the
 155 #       LKB does not), so this definition does not nest.
 156 
 157 Spacing      := SPACE? Comment*
 158 SPACE        := /\s+/
 159 Comment      := ( LineComment | BlockComment ) SPACE?
 160 LineComment  := /;.*$/
 161 BlockComment := "#|" /([^|\\]|\\.|\|(?!#))*/ "|#"
 162 
 163 # Non-content Terminals
 164 
 165 BEGIN        := ":begin" Spacing
 166 TYPE         := ":type" Spacing
 167 INSTANCE     := ":instance" Spacing
 168 STATUS       := ":status" Spacing
 169 INCLUDE      := ":include" Spacing
 170 END          := ":end" Spacing
 171 DEFOP        := ":=" Spacing
 172 ADDOP        := ":+" Spacing
 173 DOT          := "." Spacing
 174 AND          := "&" Spacing
 175 COMMA        := "," Spacing
 176 AVMOPEN      := "[" Spacing
 177 AVMCLOSE     := "]" Spacing
 178 DLOPEN       := "<!" Spacing
 179 DLCLOSE      := "!>" Spacing
 180 CLOPEN       := "<" Spacing
 181 CLCLOSE      := ">" Spacing
 182 ELLIPSIS     := "..." Spacing
 183 EOF          := ""  # end-of-file

TDL File Interpretation and Conventions

Layout of a type definition

Some parts of a type definition are mandated by TDL syntax, such as the initial identifier, the main operator, and the final dot:

identifier := (definition body) .

The definition body is just a conjunction of terms, maybe with documentation strings, and there is much valid variation in how those terms are arranged. Nevertheless, there are conventional locations for these terms depending on what kind of term they are. For instance, the supertypes are generally listed first, followed by an AVM:

head_only := unary_phrase & headed_phrase &
  [ HD-DTR #head & [ SYNSEM.LOCAL.CONJ cnil ],
    ARGS < #head > ].

If a documentation string is specified, the conventional place is before the AVM:

n_-_ad-pl_le := norm_np_adv_lexent &
"""
<description>N, can modify, locative (place)
<ex>B lives overseas.
<nex>
<todo>
"""
  [ SYNSEM.LOCAL [ CAT.HEAD [ MINORS.MIN place_n_rel,
                              CASE obliq ],
                   CONT.HOOK.INDEX.SORT place ] ].

Or if there is no AVM, before the final dot:

info-str := icons
  """Type for underspecified or "neutral" information structure.""".

Types versus instances

Specifying the text encoding

The text encoding of TDL files can be specified using a special comment on the first line of the file, as is done with many scripting languages. For instance, the following sets the encoding to UTF-8:

   1 ; -*- coding: utf-8 -*-

In some TDL files, attributes specific to the Emacs text editor may be included:

   1 ;;; -*- Mode: tdl; Coding: utf-8; indent-tabs-mode: nil; -*-

Feature interpretation of lists

The < ... > and <! ... !> shorthand for lists ("cons lists") and diff-lists, respectively, correspond to normal attribute-value pairs. The implementation relies on an encoding scheme where the first list item (the list's head) is at the feature FIRST while the rest of the list (the tail) is defined recursively under the feature REST (e.g., REST.REST.FIRST is the third element). The types associated with open and closed lists, and sometimes even the feature names, are configurable by the grammar.

entity

example

LKB config

ACE config

cons-list type

*cons*

(not configurable)

cons-type

open list type

*list*

*list-type*

list-type

closed list type

*null*

*empty-list-type*

null-type

diff-list type

*diff-list*

*diff-list-type*

diff-list-type

list head feature

FIRST

*list-head*

(not configurable)

list tail feature

REST

*list-tail*

(not configurable)

diff-list list feature

LIST

*diff-list-list*

(not configurable)

diff-list last feature

LAST

*diff-list-last*

(not configurable)

For the examples below, I use the values defined in the above table, which are taken from the ERG.

Cons Lists

Regular cons lists may be open (extendable) or closed (fixed-length). The type of an open list as interpreted by, e.g., < ... >, is *list* (rather, the defined open list type), but in hand-written TDL a subtype of *list* is often used, such as *cons*.

   1 ; an empty list is terminated (always empty)
   2 [ ATTR < > ]             =>  [ ATTR *null* ]
   3 ; single item goes on FIRST attribute and REST is terminated
   4 [ ATTR < a > ]           =>  [ ATTR *list* & [ FIRST a,
   5                                                REST *null* ] ]
   6 ; items after the first go on (REST.)+FIRST
   7 [ ATTR < a, b > ]        =>  [ ATTR *list* & [ FIRST a,
   8                                                REST [ FIRST b,
   9                                                       REST *null* ] ] ]
  10 ; an empty list with ... is not terminated
  11 [ ATTR < ... > ]         =>  [ ATTR *list* ]
  12 ; this also works with items on the list
  13 [ ATTR < a, ... > ]      =>  [ ATTR *list* & [ FIRST a,
  14                                                REST *list* ] ]
  15 ; the . delimiter allows a non-*list*, non-*null* value for the last REST
  16 [ ATTR < a . #coref > ]  =>  [ ATTR *list* & [ FIRST a,
  17                                                REST #coref ] ]

Diff Lists

Diff lists are regular lists under a LIST attribute, and LAST points to the last item. Diff lists don't support the unterminated-list functionality of cons lists, but they allow for appending lists of arbitrary size (see GeFaqDiffList).

   1 [ ATTR <! !> ]           =>  [ ATTR *diff-list* & [ LIST #coref,
   2                                                     LAST #coref ] ]
   3 
   4 [ ATTR <! a !> ]         =>  [ ATTR *diff-list* & [ LIST *list* & [ FIRST a,
   5                                                                     REST #coref ],
   6                                                     LAST #coref ] ]

Type documentation

TDL definitions allow documentation strings ("docstrings") before any term in the top-level conjunction or before the terminating dot (.) character:

n_-_c_le := n_intr_lex_entry
"""Intransitive count noun (icn)
<ex>The dog barked.
<nex>Much dog bark.""".

Before docstrings became well-supported, LTDB supported documentation in comments (normally preceding the documented type):

   1 ; <type val="case-p-lex-np-to">
   2 ; <name-ja>承名詞目的格助詞ト
   3 ; <description>case-p-lex-np-woを参照。このtypeは助詞「と」。
   4 ; <ex>部長 と 会う
   5 ; <nex>ゆっくり と 進む
   6 ; <todo>
   7 ; </type>
   8 case-p-lex-np-to := case-p-lex-np &
   9  [SYNSEM.LOCAL.CAT.HEAD.CASE to].

Case sensitivity

Case Sensitive

  • Things inside quotes (NB: strings passed from TFS world into MRS can be treated as case insensitive in MRS processing (i.e. as predicate symbols, but not CARGs)

Case Insensitive

  • Everything in TDL not inside of quotes.
  • Lexicon look-up.
    • Proper names?
    • Acronyms?
  • .. approach these with token-mapping (preserve the info, and then downcase anyway)

Unknown

  • Orthographic subrules (agree: case sensitive, ACE: [intended] case insensitive)

Notes: Arguments for case insensitive include shouting (call caps); Arguments for case sensitive include the use of upper case vowels in vowel harmony languages (linguistic representations, not orthography)

Notes for implementation

DocStrings

Multiple docstrings may be present on a single definition, but only the first one encountered on a definition is considered its primary docstring, and implementers are free to store or discard the other doc strings as they see fit. Docstrings on type-addenda should be concatenated with a newline to the previous docstring(s), or appended to a list of docstrings, associated with the type.

Comments

The syntax description above allows for comments anywhere that separating whitespace is allowed (not including those within strings, regular expressions, letter sets, etc.). This includes within a dotted attribute path (e.g., [ SYNSEM #| comment |# . #| comment |# LOCAL ... ]), although grammar developers may want to use this flexibility sparingly.

Open Questions

1. The ^ character is used to signal "expanded-syntax" in the LKB, but is this only used for regular expressions? Are there other expanded syntaxes? Do non-LKB processors support them? (see this thread on the 'developers' mailing list)

2. Are instances distinguishable from types? Are they (other other entities) restricted to having exactly one supertype?

Discussions

TdlRfc (last edited 2020-06-05 06:38:36 by FrancisBond)

(The DELPH-IN infrastructure is hosted at the University of Oslo)