Differences between revisions 10 and 22 (spanning 12 versions)
Revision 10 as of 2018-07-12 17:58:27
Size: 9398
Comment: Add options for docstring positioning
Revision 22 as of 2020-06-05 06:38:36
Size: 18828
Editor: FrancisBond
Comment: added a request to use status for roots and labels.
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:
Line 4: Line 5:
== Case Sensitivity ==

=== Case Sensitive ===
<<TableOfContents(3)>>

== TDL File Syntax ==

Productions are separated into thematic sections. ALL-CAPS rule names are for non-content terminals, which appear at the bottom of the description.

{{{#!highlight ruby
# File Contents
#
# Note: The LKB does not parse environments (:begin ... :end), nor does it
# support :include statements, so the following is only applicable for
# PET, ACE, and perhaps agree.

TdlFile := ( Environment | Statement | Spacing )* EOF
Environment := BEGIN TYPE DOT TypeEnv END TYPE DOT
              | BEGIN INSTANCE Status? DOT InstanceEnv END INSTANCE DOT
TypeEnv := ( Environment | Statement
                | TypeDef | TypeAddendum | Spacing )*
InstanceEnv := ( Environment | Statement
                | InstanceDef | LetterSet | WildCard | Spacing )*
Status := STATUS ( "generic-lex-entry"
                       | "lex-entry"
                       | "lexical-filtering-rule"
                       | "lex-rule"
                       | "post-generation-mapping-rule"
                       | "rule"
                       | "token-mapping-rule" ) Spacing

# Note: The LKB has several Lisp functions which open files in specified
# environments, so the following are parsing targets for those
# functions.

TdlTypeFile := ( TypeDef | TypeAddendum | Spacing )* EOF
TdlRuleFile := ( InstanceDef | LetterSet | WildCard | Spacing )* EOF

# Note: Krieger & Schaeffer 1994 define a large number of statements, but
# DELPH-IN grammars appear to only use :include.
# Note: :include's string argument is a path relative to the current file's
# directory. If the filename extension is not given, the default ".tdl"
# extension is used. The file is opened in the same environment as the
# :include statements (e.g., :include in a type environment opens the
# file and parses it as TypeEnv)

Statement := Include
Include := INCLUDE Filename DOT
Filename := DQString

# Types and Instances
#
# Note: Instances may be syntactically identical to type definitions, but they
# do not affect the type hierarchy. They may also be lexical rule
# definitions that include an affixing pattern to a definition.

TypeDef := TypeName DEFOP TypeDefBody DOT
TypeAddendum := TypeName ADDOP AddendumBody DOT
TypeName := Identifier Spacing

InstanceDef := TypeDef | LexRuleDef
LexRuleDef := LexRuleId DEFOP Affix? TypeDefBody DOT
LexRuleId := Identifier Spacing

# Identifiers are used in several patterns
#
# Note: The characters disallowed in Identifiers are chosen to avoid ambiguity
# with other parts of the TDL syntax.

Identifier := /[^\s!"#$%&'(),.\/:;<=>[\]^|]+/

# Definition Bodies (top-level conjunctions of terms)
#
# Note: Definition bodies are most simply Conjunctions, but several
# variations require special productions:
#
# (1) """DocStrings""" may precede any top-level Term or the final DOT
# (2) TypeDef and LexRuleDef require at least one TypeName
# (3) TypeAddendum may use a DocString in place of a Conjunction
#

TypeDefBody := TypedConj DocString?
AddendumBody := DocConj DocString? | DocString

# Note: To accommodate TypeDefBody and AddendumBody, three special
# conjunctions are added:
#
# (1) TypedConj has an obligatory TypeName term
# (2) FeatureConj excludes type terms (including strings, etc.)
# (3) DocConj is a regular conjunction with optional DocStrings
#
# Note that FeatureConj is only necessary to reduce ambiguity (e.g.,
# for LALR parsing); if ambiguity is allowed, DocConj may be used.

TypedConj := ( FeatureConj AND )? DocString? TypeName ( AND DocConj )?
FeatureConj := DocString? FeatureTerm ( AND DocString? FeatureTerm )*
DocConj := DocString? Term ( AND DocString? Term )*

# Note: The DocString pattern may span multiple lines

DocString := /"""([^"\\]|\\.|"(?!")|""(?!"))*"""/ Spacing

# Terms and Conjunctions

Conjunction := Term ( AND Term )*
Term := TypeTerm | FeatureTerm | Coreference
TypeTerm := TypeName
              | DQString
              | Regex
FeatureTerm := Avm
              | DiffList
              | ConsList

DQString := ( /""(?!")/ | /"([^"\\]|\\.)+"/ ) Spacing
Regex := "^" /([^$\\]|\\.)*/ "$" Spacing

Avm := AVMOPEN AttrVals? AVMCLOSE
AttrVals := AttrVal ( COMMA AttrVal )*
AttrVal := AttrPath SPACE Conjunction
AttrPath := Attribute ( DOT Attribute )*
Attribute := Identifier Spacing

DiffList := DLOPEN Conjunctions? DLCLOSE
ConsList := CLOPEN ( Conjunctions ConsEnd? )? CLCLOSE
ConsEnd := COMMA ELLIPSIS | DOT Conjunction
Conjunctions := Conjunction ( COMMA Conjunction )*

Coreference := "#" Identifier Spacing

# Letter-sets, Wild-cards, and Affixes
#
# Note: spacing is sensitive within these patterns, so many non-content
# terminals are used directly with an explicit SPACE instead of in
# a production with Spacing.

LetterSet := "%(letter-set" SPACE? LetterSetDef SPACE? ")"
WildCard := "%(wild-card" SPACE? WildCardDef SPACE? ")"
LetterSetDef := "(" LetterSetVar SPACE Characters ")"
WildCardDef := "(" WildCardVar SPACE Characters ")"
LetterSetVar := /![^ ]/
WildCardVar := /\?[^ ]/
Characters := /([^)\\]|\\.)+/

# Note: When a LetterSetVar is used in an AffixMatch, the same LetterSetVar
# in the AffixSub copies the matched character, in order, so there
# should be the same number of LetterSetVars in both, but this is not
# captured in the syntax.

Affix := AffixClass AffixPattern+ Spacing
AffixClass := "%prefix" | "%suffix"
AffixPattern := SPACE? "(" AffixMatch SPACE AffixSub ")"
AffixMatch := NullChar | CharList
AffixSub := CharList
NullChar := "*"
CharList := ( LetterSetVar | WildCardVar | AffixChar )+
AffixChar := /([^!?\s*\\]|\\[^ ])+/

# Whitespace and Comments
#
# Note: SPACE and BlockComment may span multiple lines. Also, while block
# comments in Lisp may be nested (`#| outer #| inner |# outer |#`),
# support for nested comments in TDL is mixed (ACE supports it, the
# LKB does not), so this definition does not nest.

Spacing := SPACE? Comment*
SPACE := /\s+/
Comment := ( LineComment | BlockComment ) SPACE?
LineComment := /;.*$/
BlockComment := "#|" /([^|\\]|\\.|\|(?!#))*/ "|#"

# Non-content Terminals

BEGIN := ":begin" Spacing
TYPE := ":type" Spacing
INSTANCE := ":instance" Spacing
STATUS := ":status" Spacing
INCLUDE := ":include" Spacing
END := ":end" Spacing
DEFOP := ":=" Spacing
ADDOP := ":+" Spacing
DOT := "." Spacing
AND := "&" Spacing
COMMA := "," Spacing
AVMOPEN := "[" Spacing
AVMCLOSE := "]" Spacing
DLOPEN := "<!" Spacing
DLCLOSE := "!>" Spacing
CLOPEN := "<" Spacing
CLCLOSE := ">" Spacing
ELLIPSIS := "..." Spacing
EOF := "" # end-of-file

}}}

== Deprecated TDL Features ==

The following are deprecated features of DELPH-IN TDL. They are no longer considered part of the format, but implementers of TDL parsers may want to include them for backward compatibility. If so, they are encouraged to print warnings upon encountering the deprecated forms so grammar developers know to change them.

=== Subtyping Operator (:<) ===

The `:<` operator was originally used only for declaring a type's position in the type hierarchy (i.e., features could not be specified, unlike with `:=`), but eventually this constraint was relaxed and it became equivalent to `:=`. As of Autumn 2018, the form has been removed and is no longer considered part of DELPH-IN TDL, but the change to TDL syntax to support the operator is minimal:

{{{#!highlight ruby
DEFOP := ( ":=" | ":<" ) Spacing
}}}

=== Single-quoted Symbols ('symbol) ===

Double-quoted strings and identifiers are both type names, but there used to be Lisp-like single-quoted symbols as well. These still exist in some grammars, such as those using an old version of [[MatrixTop|matrix.tdl]], which has the following:

{{{
implicit-coord-rel := coordination-relation &
  [ PRED 'implicit_coord_rel ].
null-coord-rel := coordination-relation &
  [ PRED 'null_coord_rel ].
}}}

There is no difference between using quoted symbols and regular strings or identifiers (although identifiers would need to be defined as types somewhere), so recent versions of `matrix.tdl` have this instead:

{{{
implicit-coord-rel := coordination-relation &
  [ PRED "implicit_coord_rel" ].
null-coord-rel := coordination-relation &
  [ PRED "null_coord_rel" ].
}}}

The change to the syntax to support quoted symbols is as follows:

{{{#!highlight ruby
TypeTerm := TypeName
              | DQString
              | Regex
              | QSymbol
QSymbol := "'" Identifier Spacing
}}}

== TDL File Interpretation and Conventions ==

=== Layout of a type definition ===

Some parts of a type definition are mandated by TDL syntax, such as the initial identifier, the main operator, and the final dot:

{{{
identifier := (definition body) .
}}}

The definition body is just a conjunction of terms, maybe with documentation strings, and there is much valid variation in how those terms are arranged.
Nevertheless, there are conventional locations for these terms depending on what kind of term they are.
For instance, the supertypes are generally listed first, followed by an AVM:

{{{
head_only := unary_phrase & headed_phrase &
  [ HD-DTR #head & [ SYNSEM.LOCAL.CONJ cnil ],
    ARGS < #head > ].
}}}

If a documentation string is specified, the conventional place is before the AVM:

{{{
n_-_ad-pl_le := norm_np_adv_lexent &
"""
<description>N, can modify, locative (place)
<ex>B lives overseas.
<nex>
<todo>
"""
  [ SYNSEM.LOCAL [ CAT.HEAD [ MINORS.MIN place_n_rel,
                              CASE obliq ],
                   CONT.HOOK.INDEX.SORT place ] ].
}}}

Or if there is no AVM, before the final dot:

{{{
info-str := icons
  """Type for underspecified or "neutral" information structure.""".
}}}



=== Types versus instances ===
=== Specifying the text encoding ===

The text encoding of TDL files can be specified using a special comment on the first line of the file, as is done with many scripting languages.
For instance, the following sets the encoding to UTF-8:

{{{#!highlight scheme
; -*- coding: utf-8 -*-
}}}

In some TDL files, attributes specific to the [[https://www.gnu.org/software/emacs/|Emacs]] text editor may be included:

{{{#!highlight scheme
;;; -*- Mode: tdl; Coding: utf-8; indent-tabs-mode: nil; -*-
}}}


=== Feature interpretation of lists ===

The `< ... >` and `<! ... !>` shorthand for lists ("cons lists") and diff-lists, respectively, correspond to normal attribute-value pairs.
The implementation relies on an encoding scheme where the first list item (the list's head) is at the feature `FIRST` while the rest of the list (the tail) is defined recursively under the feature `REST` (e.g., `REST.REST.FIRST` is the third element).
The types associated with open and closed lists, and sometimes even the feature names, are configurable by the grammar.

|| '''entity''' || '''example''' || '''LKB config''' || '''ACE config''' ||
|| cons-list type || `*cons*` || (not configurable) || `cons-type` ||
|| open list type || `*list*` || `*list-type*` || `list-type` ||
|| closed list type || `*null*` || `*empty-list-type*` || `null-type` ||
|| diff-list type || `*diff-list*` || `*diff-list-type*` || `diff-list-type` ||
|| list head feature || `FIRST` || `*list-head*` || (not configurable) ||
|| list tail feature || `REST` || `*list-tail*` || (not configurable) ||
|| diff-list list feature || `LIST` || `*diff-list-list*` || (not configurable) ||
|| diff-list last feature || `LAST` || `*diff-list-last*` || (not configurable) ||

For the examples below, I use the values defined in the above table, which are taken from the ERG.

==== Cons Lists ====

Regular cons lists may be open (extendable) or closed (fixed-length).
The type of an open list as interpreted by, e.g., `< ... >`, is `*list*` (rather, the defined open list type), but in hand-written TDL a subtype of `*list*` is often used, such as `*cons*`.

{{{#!highlight scheme
; an empty list is terminated (always empty)
[ ATTR < > ] => [ ATTR *null* ]
; single item goes on FIRST attribute and REST is terminated
[ ATTR < a > ] => [ ATTR *list* & [ FIRST a,
                                               REST *null* ] ]
; items after the first go on (REST.)+FIRST
[ ATTR < a, b > ] => [ ATTR *list* & [ FIRST a,
                                               REST [ FIRST b,
                                                      REST *null* ] ] ]
; an empty list with ... is not terminated
[ ATTR < ... > ] => [ ATTR *list* ]
; this also works with items on the list
[ ATTR < a, ... > ] => [ ATTR *list* & [ FIRST a,
                                               REST *list* ] ]
; the . delimiter allows a non-*list*, non-*null* value for the last REST
[ ATTR < a . #coref > ] => [ ATTR *list* & [ FIRST a,
                                               REST #coref ] ]
}}}

==== Diff Lists ====

Diff lists are regular lists under a `LIST` attribute, and `LAST` points to the last item.
Diff lists don't support the unterminated-list functionality of cons lists, but they allow for appending lists of arbitrary size (see [[GeFaqDiffList]]).

{{{#!highlight scheme
[ ATTR <! !> ] => [ ATTR *diff-list* & [ LIST #coref,
                                                    LAST #coref ] ]

[ ATTR <! a !> ] => [ ATTR *diff-list* & [ LIST *list* & [ FIRST a,
                                                                    REST #coref ],
                                                    LAST #coref ] ]
}}}

=== Type documentation ===

TDL definitions allow documentation strings ("docstrings") before any term in the top-level conjunction or before the terminating dot (`.`) character:

{{{
n_-_c_le := n_intr_lex_entry
"""Intransitive count noun (icn)
<ex>The dog barked.
<nex>Much dog bark.""".
}}}

Before docstrings became well-supported, LTDB supported documentation in comments (normally preceding the documented type):

{{{#!highlight cl
; <type val="case-p-lex-np-to">
; <name-ja>承名詞目的格助詞ト
; <description>case-p-lex-np-woを参照。このtypeは助詞「と」。
; <ex>部長 と 会う
; <nex>ゆっくり と 進む
; <todo>
; </type>
case-p-lex-np-to := case-p-lex-np &
 [SYNSEM.LOCAL.CAT.HEAD.CASE to].
}}}

=== Case sensitivity ===

==== Case Sensitive ====
Line 10: Line 387:
=== Case Insensitive === ==== Case Insensitive ====
Line 19: Line 396:
=== Unknown === ==== Unknown ====
Line 25: Line 402:
== Doc Strings ==

TDL types allow a doc string:
{{{
n_-_c_le := n_intr_lex_entry &
"Intransitive count noun (icn)
<ex>The dog barked.
<nex>Much dog bark.".
}}}

== TDL File Syntax ==

{{{#!highlight ruby
# File Contents

TdlTypeFile := ( TypeDef | Spacing )* EOF
TdlRuleFile := ( LexRuleDef | MorphSet | Spacing )* EOF

# Types and Lexical Rules

TypeDef := Type ( AvmDef | AvmAddendum ) Dot
AvmDef := DefOp DefBody
AvmAddendum := AddOp ( DefBody
                      | DocString? Conjunction
                      | DocString )
LexRuleDef := LexRuleId DefOp Affix? DefBody Dot
DefBody := Supertypes ( And DocString? Conjunction | DocString? )
Supertypes := Type ( And Type )*
Type := Identifier Spacing
LexRuleId := Identifier Spacing
DocString := DQString
Conjunction := Term ( And Term )*
Term := ( Type
                | FeatureTerm
                | DiffList
                | ConsList
                | Coreference
                | DQString
                | QSymbol
                | Regex
                )
FeatureTerm := LBrack AttrVals? RBrack
AttrVals := AttrVal ( Comma AttrVal )*
AttrVal := Attribute ( Dot Attribute )* Conjunction
Attribute := Identifier Spacing
DiffList := DLOpen Conjunctions? DLClose
ConsList := CLOpen ( Conjunctions ConsEnd? )? CLClose
ConsEnd := Comma Ellipsis | Dot Conjunction
Conjunctions := Conjunction ( Comma Conjunction )*
Coreference := "#" Identifier Spacing

# Letter-sets, Wild-cards, and Affixes

MorphSet := "%" "(" ( LetterSetDef | WildCardDef ) ")"
LetterSetDef := "letter-set" Space? "(" LetterSetVar Space LetterSet ")"
WildCardDef := "wild-card" Space? "(" WildCardVar Space LetterSet ")"
LetterSetVar := /![^ ]/
WildCardVar := /\?[^ ]/
LetterSet := /([^)\\]|\\.)+/
Affix := AffixClass AffixPattern+ Spacing
AffixClass := "%prefix" | "%suffix"
AffixPattern := Space? "(" ( NullChar | CharList ) Space CharList ")"
CharList := ( LetterSetVar | WildCardVar | AffixChar )+
NullChar := "*"
AffixChar := /([^!?\s*\\]|\\[^ ])+/

# Whitespace and Comments

Spacing := Space? Comment*
Space := /\s+/
Comment := ( LineComment | BlockComment ) Space?
LineComment := /;.*$/
BlockComment := "#|" /([^|\\]|\\.|\|[^#])*/ "|#"

# Literals

DefOp := ":=" Spacing
AddOp := ":+" Spacing
Identifier := /[^\s.:<=&,#[]$()>!^\/]+/
Dot := "." Spacing
And := "&" Spacing
Comma := "," Spacing
LBrack := "[" Spacing
RBrack := "]" Spacing
DLOpen := "<!" Spacing
DLClose := "!>" Spacing
CLOpen := "<" Spacing
CLClose := ">" Spacing
Ellipsis := "..." Spacing
DQString := /"([^"\\]|\\.)*"/ Spacing
QSymbol := "'" Identifier Spacing
Regex := "^" /([^$\\]|\\.)*/ "$"
}}}


== Docstring Revision ==

Currently docstrings are regular strings that appear before a Term in an !TypeDef, presumably after the list of supertypes:

{{{
type := supertype1 & supertype2 &
  "Docstring"
  [ ... ].
}}}

But this syntax is not supported in all processors (namely PET), and the others allow variations. At the 2018 summit in Paris (see DiderotSchedule), there was a decision to distinguish docstrings from other strings by using triple-quotes (three double-quotes in a row, similar to Python), which additionally allows quotes to appear inside the docstring.


{{{
type := supertype1 & supertype2 &
  """Docstring"""
  [ ... ].
}}}

This changed the !DocString production like so:

{{{#!highlight ruby
DocString := /"""([^"\\]|\\.|"[^"]|""[^"])*"""/ Spacing
}}}

(note that an unescaped quote cannot appear directly before the ending triple-quotes (or rather, it can, but the string would be terminated early and there'd be an extra quote character in the stream))

There are remaining questions about their placement.

=== Option 1: Placed before any Term with multiple docstrings per type allowed ===

Where multiple docstrings occur, the type's final docstring is the concatenation of them.

Example:

{{{
type := """here""" supertype1 & """here""" supertype2 &
  """here, too"""
  [ ... ] """maybe here?""".
}}}

This can be implemented by changing the following producitons:

{{{#!highlight ruby
TypeDef := Type ( AvmDef | AvmAddendum ) DocString? Dot # maybe
LexRuleDef := LexRuleId DefOp Affix? DefBody DocString? Dot # maybe
AvmAddendum := AddOp ( DefBody | Conjunction | DocString )
DefBody := Supertypes ( And Conjunction )?
Supertypes := DocString? Type ( And DocString? Type )*
Term := Docstring? ( Type
                           | FeatureTerm
                           | DiffList
                           | ConsList
                           | Coreference
                           | DQString
                           | QSymbol
                           | Regex
                           )
}}}

=== Option 2: Placed before any Term with only one docstring per type allowed ===

Example:

{{{
type := supertype1 & """just one, somewhere""" supertype2 &
  [ ... ].
}}}

This is more complicated to describe as production rules (need to duplicate several productions; some for use before docstring is encountered, then others for use after), but the implementation may be simple (just set a flag after reading a docstring).

=== Option 3: Once, after the list of supertypes and before any feature list ===

Example:

{{{
type := supertype1 & supertype2 &
  """just one, here"""
  [ ... ].

type2 := supertype1 &
  """what about this?***
  [ ... ] & supertype2.
}}}

This is not hard to implement. If it only needs to appear after *a* list of supertypes (both examples above), it's the same as in the full production list above (but other supertypes could appear after a feature list, for instance). If one wants to ensure that all supertypes appear before any docstring or feature list (only the first example above), then we need to duplicate the Conjunction and Term productions to disallow Types at the top level. If that's something desired, it would look like this:

{{{#!highlight
AvmAddendum := AddOp ( DefBody | DocString? NoTypeConj | DocString )
DefBody := Supertypes DocString? ( And NoTypeConj )?
NoTypeConj := NoTypeTerm ( And NoTypeTerm )*
NoTypeTerm := ( FeatureTerm
                | DiffList
                | ConsList
                | Coreference
                | DQString
                | QSymbol
                | Regex
                )
}}}

=== Option 4: Once, immediately after the typedef or addendum operators ===

Example:

{{{
type := """just one, here"""
  supertype1 & supertype2 &
  [ ... ].

type :=
  """
  example
  with
  multiple
  lines
  """
  supertype1 & supertype2 &
  [ ... ].
}}}

This is the simplest to implement, and the !DefBody and !Supertypes productions would be unnecessary (unless we still want supertypes to appear first):

{{{#!highlight ruby
AvmDef := DefOp DocString? Conjunction
AvmAddendum := AddOp ( DocString? Conjunction | DocString )
LexRuleDef := LexRuleId DefOp DocString? Affix? Conjunction Dot
}}}

Previously some did not like it for aesthetic reasons, though (although that is subjective).

== Questions ==

1. The `^` character is used to signal "expanded-syntax" in the LKB, but is this only used for regular expressions? Are there other expanded syntaxes? Do non-LKB processors support them?

== Notes for implementation ==

=== DocStrings ===

Multiple docstrings may be present on a single definition, but only the first one encountered on a definition is considered its primary docstring, and implementers are free to store or discard the other doc strings as they see fit. Docstrings on type-addenda should be concatenated with a newline to the previous docstring(s), or appended to a list of docstrings, associated with the type.

=== Comments ===

The syntax description above allows for comments anywhere that separating whitespace is allowed (not including those within strings, regular expressions, letter sets, etc.). This includes within a dotted attribute path (e.g., `[ SYNSEM #| comment |# . #| comment |# LOCAL ... ]`), although grammar developers may want to use this flexibility sparingly.

== Open Questions ==

1. The `^` character is used to signal "expanded-syntax" in the LKB, but is this only used for regular expressions? Are there other expanded syntaxes? Do non-LKB processors support them? (see [[http://lists.delph-in.net/archives/developers/2009/thread.html#1082|this thread]] on the 'developers' mailing list)
Line 257: Line 419:
3. When supertypes are required (e.g., on a !TypeDef), must they appear before other Terms in the Conjunction? (see [[#Docstring_Revision]] above)

4. Should the (deprecated or repurposed) subtype operator (`:<`) be included in the syntax description?

5. Is variation allowed with regards to the position of docstrings? (see [[#Docstring_Revision]] above)

6. Are spaces allowed inside a feature path? Comments?
   {{{
   type := supertype &
     [ ATTR1
       . ; comment here?
       ATTR2 value ];
   }}}
   For that matter, are comments allowed anywhere that whitespace is (except maybe letter-sets and lex-rule affix patterns)?
3. Can we use 'status' to identify roots and labels (parsenodes)? Something like
{{{
;;
;; parse-tree labels (instances)
;;

:begin :instance :status label.
:include "parse-nodes".
:end :instance.

;;
;; start symbols of the grammar (instances)
;;

:begin :instance. :status root.
:include "roots".
:include "educ/roots-educ".
:end :instance.
}}}
Line 278: Line 446:
 * [[http://lists.delph-in.net/archives/developers/2006/000419.html|Mailing list discussion about docstrings (Feb 2006)]]
 * [[http://lists.delph-in.net/archives/developers/2006/000550.html|Mailing list discussion about type addenda (Jul 2006)]]
 * [[http://lists.delph-in.net/archives/developers/2007/000762.html|Mailing list discussion about docstrings (Mar 2007)]]
 * [[http://lists.delph-in.net/archives/developers/2007/000868.html|Mailing list discussion about docstrings (Sep 2007)]]
 * [[http://lists.delph-in.net/archives/developers/2008/001037.html|Mailing list discussion about the :+ and :< operators (Nov 2008)]]
 * [[http://lists.delph-in.net/archives/developers/2009/001082.html|Mailing list discussion about regular expressions in TDL (Jan 2009)]]
 * [[http://lists.delph-in.net/archives/developers/2018/002754.html|Mailing list discussion about TDL syntax (Jul 2018)]]
 * [[http://lists.delph-in.net/archives/developers/2018/002792.html|Mailing list discussion about docstrings (Aug 2018)]]

Type Description Language and other aspects of DELPH-IN Joint Reference Formalism

TDL File Syntax

Productions are separated into thematic sections. ALL-CAPS rule names are for non-content terminals, which appear at the bottom of the description.

   1 # File Contents
   2 #
   3 # Note: The LKB does not parse environments (:begin ... :end), nor does it
   4 #       support :include statements, so the following is only applicable for
   5 #       PET, ACE, and perhaps agree.
   6 
   7 TdlFile      := ( Environment | Statement | Spacing )* EOF
   8 Environment  := BEGIN TYPE DOT TypeEnv END TYPE DOT
   9               | BEGIN INSTANCE Status? DOT InstanceEnv END INSTANCE DOT
  10 TypeEnv      := ( Environment | Statement
  11                 | TypeDef | TypeAddendum | Spacing )*
  12 InstanceEnv  := ( Environment | Statement
  13                 | InstanceDef | LetterSet | WildCard | Spacing )*
  14 Status       := STATUS ( "generic-lex-entry"
  15                        | "lex-entry"
  16                        | "lexical-filtering-rule"
  17                        | "lex-rule"
  18                        | "post-generation-mapping-rule"
  19                        | "rule"
  20                        | "token-mapping-rule" ) Spacing
  21 
  22 # Note: The LKB has several Lisp functions which open files in specified
  23 #       environments, so the following are parsing targets for those
  24 #       functions.
  25 
  26 TdlTypeFile  := ( TypeDef | TypeAddendum | Spacing )* EOF
  27 TdlRuleFile  := ( InstanceDef | LetterSet | WildCard | Spacing )* EOF
  28 
  29 # Note: Krieger & Schaeffer 1994 define a large number of statements, but
  30 #       DELPH-IN grammars appear to only use :include.
  31 # Note: :include's string argument is a path relative to the current file's
  32 #       directory. If the filename extension is not given, the default ".tdl"
  33 #       extension is used. The file is opened in the same environment as the
  34 #       :include statements (e.g., :include in a type environment opens the
  35 #       file and parses it as TypeEnv)
  36 
  37 Statement    := Include
  38 Include      := INCLUDE Filename DOT
  39 Filename     := DQString
  40 
  41 # Types and Instances
  42 #
  43 # Note: Instances may be syntactically identical to type definitions, but they
  44 #       do not affect the type hierarchy. They may also be lexical rule
  45 #       definitions that include an affixing pattern to a definition.
  46 
  47 TypeDef      := TypeName DEFOP TypeDefBody DOT
  48 TypeAddendum := TypeName ADDOP AddendumBody DOT
  49 TypeName     := Identifier Spacing
  50 
  51 InstanceDef  := TypeDef | LexRuleDef
  52 LexRuleDef   := LexRuleId DEFOP Affix? TypeDefBody DOT
  53 LexRuleId    := Identifier Spacing
  54 
  55 # Identifiers are used in several patterns
  56 #
  57 # Note: The characters disallowed in Identifiers are chosen to avoid ambiguity
  58 #       with other parts of the TDL syntax.
  59 
  60 Identifier   := /[^\s!"#$%&'(),.\/:;<=>[\]^|]+/
  61 
  62 # Definition Bodies (top-level conjunctions of terms)
  63 #
  64 # Note: Definition bodies are most simply Conjunctions, but several
  65 #       variations require special productions:
  66 #
  67 #       (1) """DocStrings""" may precede any top-level Term or the final DOT
  68 #       (2) TypeDef and LexRuleDef require at least one TypeName
  69 #       (3) TypeAddendum may use a DocString in place of a Conjunction
  70 #           
  71 
  72 TypeDefBody  := TypedConj DocString?
  73 AddendumBody := DocConj DocString? | DocString
  74 
  75 # Note: To accommodate TypeDefBody and AddendumBody, three special
  76 #       conjunctions are added:
  77 #
  78 #       (1) TypedConj has an obligatory TypeName term
  79 #       (2) FeatureConj excludes type terms (including strings, etc.)
  80 #       (3) DocConj is a regular conjunction with optional DocStrings
  81 #
  82 #       Note that FeatureConj is only necessary to reduce ambiguity (e.g.,
  83 #       for LALR parsing); if ambiguity is allowed, DocConj may be used.
  84 
  85 TypedConj    := ( FeatureConj AND )? DocString? TypeName ( AND DocConj )?
  86 FeatureConj  := DocString? FeatureTerm ( AND DocString? FeatureTerm )*
  87 DocConj      := DocString? Term ( AND DocString? Term )*
  88 
  89 # Note: The DocString pattern may span multiple lines
  90 
  91 DocString    := /"""([^"\\]|\\.|"(?!")|""(?!"))*"""/ Spacing
  92 
  93 # Terms and Conjunctions
  94 
  95 Conjunction  := Term ( AND Term )*
  96 Term         := TypeTerm | FeatureTerm | Coreference
  97 TypeTerm     := TypeName
  98               | DQString
  99               | Regex
 100 FeatureTerm  := Avm
 101               | DiffList
 102               | ConsList
 103 
 104 DQString     := ( /""(?!")/ | /"([^"\\]|\\.)+"/ ) Spacing
 105 Regex        := "^" /([^$\\]|\\.)*/ "$" Spacing
 106 
 107 Avm          := AVMOPEN AttrVals? AVMCLOSE
 108 AttrVals     := AttrVal ( COMMA AttrVal )*
 109 AttrVal      := AttrPath SPACE Conjunction
 110 AttrPath     := Attribute ( DOT Attribute )*
 111 Attribute    := Identifier Spacing
 112 
 113 DiffList     := DLOPEN Conjunctions? DLCLOSE
 114 ConsList     := CLOPEN ( Conjunctions ConsEnd? )? CLCLOSE
 115 ConsEnd      := COMMA ELLIPSIS | DOT Conjunction
 116 Conjunctions := Conjunction ( COMMA Conjunction )*
 117 
 118 Coreference  := "#" Identifier Spacing
 119 
 120 # Letter-sets, Wild-cards, and Affixes
 121 #
 122 # Note: spacing is sensitive within these patterns, so many non-content
 123 #       terminals are used directly with an explicit SPACE instead of in
 124 #       a production with Spacing.
 125 
 126 LetterSet    := "%(letter-set" SPACE? LetterSetDef SPACE? ")"
 127 WildCard     := "%(wild-card" SPACE? WildCardDef SPACE? ")"
 128 LetterSetDef := "(" LetterSetVar SPACE Characters ")"
 129 WildCardDef  := "(" WildCardVar SPACE Characters ")"
 130 LetterSetVar := /![^ ]/
 131 WildCardVar  := /\?[^ ]/
 132 Characters   := /([^)\\]|\\.)+/
 133 
 134 # Note: When a LetterSetVar is used in an AffixMatch, the same LetterSetVar
 135 #       in the AffixSub copies the matched character, in order, so there
 136 #       should be the same number of LetterSetVars in both, but this is not
 137 #       captured in the syntax.
 138 
 139 Affix        := AffixClass AffixPattern+ Spacing
 140 AffixClass   := "%prefix" | "%suffix"
 141 AffixPattern := SPACE? "(" AffixMatch SPACE AffixSub ")"
 142 AffixMatch   := NullChar | CharList
 143 AffixSub     := CharList
 144 NullChar     := "*"
 145 CharList     := ( LetterSetVar | WildCardVar | AffixChar )+
 146 AffixChar    := /([^!?\s*\\]|\\[^ ])+/
 147 
 148 # Whitespace and Comments
 149 #
 150 # Note: SPACE and BlockComment may span multiple lines. Also, while block
 151 #       comments in Lisp may be nested (`#| outer #| inner |# outer |#`),
 152 #       support for nested comments in TDL is mixed (ACE supports it, the
 153 #       LKB does not), so this definition does not nest.
 154 
 155 Spacing      := SPACE? Comment*
 156 SPACE        := /\s+/
 157 Comment      := ( LineComment | BlockComment ) SPACE?
 158 LineComment  := /;.*$/
 159 BlockComment := "#|" /([^|\\]|\\.|\|(?!#))*/ "|#"
 160 
 161 # Non-content Terminals
 162 
 163 BEGIN        := ":begin" Spacing
 164 TYPE         := ":type" Spacing
 165 INSTANCE     := ":instance" Spacing
 166 STATUS       := ":status" Spacing
 167 INCLUDE      := ":include" Spacing
 168 END          := ":end" Spacing
 169 DEFOP        := ":=" Spacing
 170 ADDOP        := ":+" Spacing
 171 DOT          := "." Spacing
 172 AND          := "&" Spacing
 173 COMMA        := "," Spacing
 174 AVMOPEN      := "[" Spacing
 175 AVMCLOSE     := "]" Spacing
 176 DLOPEN       := "<!" Spacing
 177 DLCLOSE      := "!>" Spacing
 178 CLOPEN       := "<" Spacing
 179 CLCLOSE      := ">" Spacing
 180 ELLIPSIS     := "..." Spacing
 181 EOF          := ""  # end-of-file

Deprecated TDL Features

The following are deprecated features of DELPH-IN TDL. They are no longer considered part of the format, but implementers of TDL parsers may want to include them for backward compatibility. If so, they are encouraged to print warnings upon encountering the deprecated forms so grammar developers know to change them.

Subtyping Operator (:<)

The :< operator was originally used only for declaring a type's position in the type hierarchy (i.e., features could not be specified, unlike with :=), but eventually this constraint was relaxed and it became equivalent to :=. As of Autumn 2018, the form has been removed and is no longer considered part of DELPH-IN TDL, but the change to TDL syntax to support the operator is minimal:

   1 DEFOP        := ( ":=" | ":<" ) Spacing

Single-quoted Symbols ('symbol)

Double-quoted strings and identifiers are both type names, but there used to be Lisp-like single-quoted symbols as well. These still exist in some grammars, such as those using an old version of matrix.tdl, which has the following:

implicit-coord-rel := coordination-relation &
  [ PRED 'implicit_coord_rel ].
null-coord-rel := coordination-relation &
  [ PRED 'null_coord_rel ].

There is no difference between using quoted symbols and regular strings or identifiers (although identifiers would need to be defined as types somewhere), so recent versions of matrix.tdl have this instead:

implicit-coord-rel := coordination-relation &
  [ PRED "implicit_coord_rel" ].
null-coord-rel := coordination-relation &
  [ PRED "null_coord_rel" ].

The change to the syntax to support quoted symbols is as follows:

   1 TypeTerm     := TypeName
   2               | DQString
   3               | Regex
   4               | QSymbol
   5 QSymbol      := "'" Identifier Spacing

TDL File Interpretation and Conventions

Layout of a type definition

Some parts of a type definition are mandated by TDL syntax, such as the initial identifier, the main operator, and the final dot:

identifier := (definition body) .

The definition body is just a conjunction of terms, maybe with documentation strings, and there is much valid variation in how those terms are arranged. Nevertheless, there are conventional locations for these terms depending on what kind of term they are. For instance, the supertypes are generally listed first, followed by an AVM:

head_only := unary_phrase & headed_phrase &
  [ HD-DTR #head & [ SYNSEM.LOCAL.CONJ cnil ],
    ARGS < #head > ].

If a documentation string is specified, the conventional place is before the AVM:

n_-_ad-pl_le := norm_np_adv_lexent &
"""
<description>N, can modify, locative (place)
<ex>B lives overseas.
<nex>
<todo>
"""
  [ SYNSEM.LOCAL [ CAT.HEAD [ MINORS.MIN place_n_rel,
                              CASE obliq ],
                   CONT.HOOK.INDEX.SORT place ] ].

Or if there is no AVM, before the final dot:

info-str := icons
  """Type for underspecified or "neutral" information structure.""".

Types versus instances

Specifying the text encoding

The text encoding of TDL files can be specified using a special comment on the first line of the file, as is done with many scripting languages. For instance, the following sets the encoding to UTF-8:

   1 ; -*- coding: utf-8 -*-

In some TDL files, attributes specific to the Emacs text editor may be included:

   1 ;;; -*- Mode: tdl; Coding: utf-8; indent-tabs-mode: nil; -*-

Feature interpretation of lists

The < ... > and <! ... !> shorthand for lists ("cons lists") and diff-lists, respectively, correspond to normal attribute-value pairs. The implementation relies on an encoding scheme where the first list item (the list's head) is at the feature FIRST while the rest of the list (the tail) is defined recursively under the feature REST (e.g., REST.REST.FIRST is the third element). The types associated with open and closed lists, and sometimes even the feature names, are configurable by the grammar.

entity

example

LKB config

ACE config

cons-list type

*cons*

(not configurable)

cons-type

open list type

*list*

*list-type*

list-type

closed list type

*null*

*empty-list-type*

null-type

diff-list type

*diff-list*

*diff-list-type*

diff-list-type

list head feature

FIRST

*list-head*

(not configurable)

list tail feature

REST

*list-tail*

(not configurable)

diff-list list feature

LIST

*diff-list-list*

(not configurable)

diff-list last feature

LAST

*diff-list-last*

(not configurable)

For the examples below, I use the values defined in the above table, which are taken from the ERG.

Cons Lists

Regular cons lists may be open (extendable) or closed (fixed-length). The type of an open list as interpreted by, e.g., < ... >, is *list* (rather, the defined open list type), but in hand-written TDL a subtype of *list* is often used, such as *cons*.

   1 ; an empty list is terminated (always empty)
   2 [ ATTR < > ]             =>  [ ATTR *null* ]
   3 ; single item goes on FIRST attribute and REST is terminated
   4 [ ATTR < a > ]           =>  [ ATTR *list* & [ FIRST a,
   5                                                REST *null* ] ]
   6 ; items after the first go on (REST.)+FIRST
   7 [ ATTR < a, b > ]        =>  [ ATTR *list* & [ FIRST a,
   8                                                REST [ FIRST b,
   9                                                       REST *null* ] ] ]
  10 ; an empty list with ... is not terminated
  11 [ ATTR < ... > ]         =>  [ ATTR *list* ]
  12 ; this also works with items on the list
  13 [ ATTR < a, ... > ]      =>  [ ATTR *list* & [ FIRST a,
  14                                                REST *list* ] ]
  15 ; the . delimiter allows a non-*list*, non-*null* value for the last REST
  16 [ ATTR < a . #coref > ]  =>  [ ATTR *list* & [ FIRST a,
  17                                                REST #coref ] ]

Diff Lists

Diff lists are regular lists under a LIST attribute, and LAST points to the last item. Diff lists don't support the unterminated-list functionality of cons lists, but they allow for appending lists of arbitrary size (see GeFaqDiffList).

   1 [ ATTR <! !> ]           =>  [ ATTR *diff-list* & [ LIST #coref,
   2                                                     LAST #coref ] ]
   3 
   4 [ ATTR <! a !> ]         =>  [ ATTR *diff-list* & [ LIST *list* & [ FIRST a,
   5                                                                     REST #coref ],
   6                                                     LAST #coref ] ]

Type documentation

TDL definitions allow documentation strings ("docstrings") before any term in the top-level conjunction or before the terminating dot (.) character:

n_-_c_le := n_intr_lex_entry
"""Intransitive count noun (icn)
<ex>The dog barked.
<nex>Much dog bark.""".

Before docstrings became well-supported, LTDB supported documentation in comments (normally preceding the documented type):

   1 ; <type val="case-p-lex-np-to">
   2 ; <name-ja>承名詞目的格助詞ト
   3 ; <description>case-p-lex-np-woを参照。このtypeは助詞「と」。
   4 ; <ex>部長 と 会う
   5 ; <nex>ゆっくり と 進む
   6 ; <todo>
   7 ; </type>
   8 case-p-lex-np-to := case-p-lex-np &
   9  [SYNSEM.LOCAL.CAT.HEAD.CASE to].

Case sensitivity

Case Sensitive

  • Things inside quotes (NB: strings passed from TFS world into MRS can be treated as case insensitive in MRS processing (i.e. as predicate symbols, but not CARGs)

Case Insensitive

  • Everything in TDL not inside of quotes.
  • Lexicon look-up.
    • Proper names?
    • Acronyms?
  • .. approach these with token-mapping (preserve the info, and then downcase anyway)

Unknown

  • Orthographic subrules (agree: case sensitive, ACE: [intended] case insensitive)

Notes: Arguments for case insensitive include shouting (call caps); Arguments for case sensitive include the use of upper case vowels in vowel harmony languages (linguistic representations, not orthography)

Notes for implementation

DocStrings

Multiple docstrings may be present on a single definition, but only the first one encountered on a definition is considered its primary docstring, and implementers are free to store or discard the other doc strings as they see fit. Docstrings on type-addenda should be concatenated with a newline to the previous docstring(s), or appended to a list of docstrings, associated with the type.

Comments

The syntax description above allows for comments anywhere that separating whitespace is allowed (not including those within strings, regular expressions, letter sets, etc.). This includes within a dotted attribute path (e.g., [ SYNSEM #| comment |# . #| comment |# LOCAL ... ]), although grammar developers may want to use this flexibility sparingly.

Open Questions

1. The ^ character is used to signal "expanded-syntax" in the LKB, but is this only used for regular expressions? Are there other expanded syntaxes? Do non-LKB processors support them? (see this thread on the 'developers' mailing list)

2. Are instances distinguishable from types? Are they (other other entities) restricted to having exactly one supertype?

3. Can we use 'status' to identify roots and labels (parsenodes)? Something like

;;
;; parse-tree labels (instances)
;;

:begin :instance :status label.
:include "parse-nodes".
:end :instance.

;;
;; start symbols of the grammar (instances)
;;

:begin :instance. :status root.
:include "roots".
:include "educ/roots-educ".
:end :instance.

Discussions

TdlRfc (last edited 2020-06-05 06:38:36 by FrancisBond)

(The DELPH-IN infrastructure is hosted at the University of Oslo)