## page was renamed from TdlRFC Type Description Language and other aspects of DELPH-IN Joint Reference Formalism <> == TDL File Syntax == Productions are separated into thematic sections. ALL-CAPS rule names are for non-content terminals, which appear at the bottom of the description. {{{#!highlight ruby # File Contents # # Note: The LKB does not parse environments (:begin ... :end), nor does it # support :include statements, so the following is only applicable for # PET, ACE, and perhaps agree. TdlFile := ( Environment | Statement | Spacing )* EOF Environment := BEGIN TYPE DOT TypeEnv END TYPE DOT | BEGIN INSTANCE Status? DOT InstanceEnv END INSTANCE DOT TypeEnv := ( Environment | Statement | TypeDef | TypeAddendum | Spacing )* InstanceEnv := ( Environment | Statement | InstanceDef | LetterSet | WildCard | Spacing )* Status := STATUS ( "generic-lex-entry" | "lex-entry" | "lexical-filtering-rule" | "lex-rule" | "post-generation-mapping-rule" | "rule" | "token-mapping-rule" ) Spacing # Note: The LKB has several Lisp functions which open files in specified # environments, so the following are parsing targets for those # functions. TdlTypeFile := ( TypeDef | TypeAddendum | Spacing )* EOF TdlRuleFile := ( InstanceDef | LetterSet | WildCard | Spacing )* EOF # Note: Krieger & Schaeffer 1994 define a large number of statements, but # DELPH-IN grammars appear to only use :include. # Note: :include's string argument is a path relative to the current file's # directory. If the filename extension is not given, the default ".tdl" # extension is used. The file is opened in the same environment as the # :include statements (e.g., :include in a type environment opens the # file and parses it as TypeEnv) Statement := Include Include := INCLUDE Filename DOT Filename := DQString # Types and Instances # # Note: Instances may be syntactically identical to type definitions, but they # do not affect the type hierarchy. They may also be lexical rule # definitions that include an affixing pattern to a definition. TypeDef := TypeName DEFOP TypeDefBody DOT TypeAddendum := TypeName ADDOP AddendumBody DOT TypeName := Identifier Spacing InstanceDef := TypeDef | LexRuleDef LexRuleDef := LexRuleId DEFOP Affix? TypeDefBody DOT LexRuleId := Identifier Spacing # Identifiers are used in several patterns # # Note: The characters disallowed in Identifiers are chosen to avoid ambiguity # with other parts of the TDL syntax. Identifier := /[^\s!"#$%&'(),.\/:;<=>[\]^|]+/ # Definition Bodies (top-level conjunctions of terms) # # Note: Definition bodies are most simply Conjunctions, but several # variations require special productions: # # (1) """DocStrings""" may precede any top-level Term or the final DOT # (2) TypeDef and LexRuleDef require at least one TypeName # (3) TypeAddendum may use a DocString in place of a Conjunction # TypeDefBody := TypedConj DocString? AddendumBody := DocConj DocString? | DocString # Note: To accommodate TypeDefBody and AddendumBody, three special # conjunctions are added: # # (1) TypedConj has an obligatory TypeName term # (2) FeatureConj excludes type terms (including strings, etc.) # (3) DocConj is a regular conjunction with optional DocStrings # # Note that FeatureConj is only necessary to reduce ambiguity (e.g., # for LALR parsing); if ambiguity is allowed, DocConj may be used. TypedConj := ( FeatureConj AND )? DocString? TypeName ( AND DocConj )? FeatureConj := DocString? FeatureTerm ( AND DocString? FeatureTerm )* DocConj := DocString? Term ( AND DocString? Term )* # Note: The DocString pattern may span multiple lines DocString := /"""([^"\\]|\\.|"(?!")|""(?!"))*"""/ Spacing # Terms and Conjunctions Conjunction := Term ( AND Term )* Term := TypeTerm | FeatureTerm | Coreference TypeTerm := TypeName | DQString | Regex FeatureTerm := Avm | DiffList | ConsList DQString := ( /""(?!")/ | /"([^"\\]|\\.)+"/ ) Spacing Regex := "^" /([^$\\]|\\.)*/ "$" Spacing Avm := AVMOPEN AttrVals? AVMCLOSE AttrVals := AttrVal ( COMMA AttrVal )* AttrVal := AttrPath SPACE Conjunction AttrPath := Attribute ( DOT Attribute )* Attribute := Identifier Spacing DiffList := DLOPEN Conjunctions? DLCLOSE ConsList := CLOPEN ( Conjunctions ConsEnd? )? CLCLOSE ConsEnd := COMMA ELLIPSIS | DOT Conjunction Conjunctions := Conjunction ( COMMA Conjunction )* Coreference := "#" Identifier Spacing # Letter-sets, Wild-cards, and Affixes # # Note: spacing is sensitive within these patterns, so many non-content # terminals are used directly with an explicit SPACE instead of in # a production with Spacing. LetterSet := "%(letter-set" SPACE? LetterSetDef SPACE? ")" WildCard := "%(wild-card" SPACE? WildCardDef SPACE? ")" LetterSetDef := "(" LetterSetVar SPACE Characters ")" WildCardDef := "(" WildCardVar SPACE Characters ")" LetterSetVar := /![^ ]/ WildCardVar := /\?[^ ]/ Characters := /([^)\\]|\\.)+/ # Note: When a LetterSetVar is used in an AffixMatch, the same LetterSetVar # in the AffixSub copies the matched character, in order, so there # should be the same number of LetterSetVars in both, but this is not # captured in the syntax. Affix := AffixClass AffixPattern+ Spacing AffixClass := "%prefix" | "%suffix" AffixPattern := SPACE? "(" AffixMatch SPACE AffixSub ")" AffixMatch := NullChar | CharList AffixSub := CharList NullChar := "*" CharList := ( LetterSetVar | WildCardVar | AffixChar )+ AffixChar := /([^!?\s*\\]|\\[^ ])+/ # Whitespace and Comments # # Note: SPACE and BlockComment may span multiple lines. Also, while block # comments in Lisp may be nested (`#| outer #| inner |# outer |#`), # support for nested comments in TDL is mixed (ACE supports it, the # LKB does not), so this definition does not nest. Spacing := SPACE? Comment* SPACE := /\s+/ Comment := ( LineComment | BlockComment ) SPACE? LineComment := /;.*$/ BlockComment := "#|" /([^|\\]|\\.|\|(?!#))*/ "|#" # Non-content Terminals BEGIN := ":begin" Spacing TYPE := ":type" Spacing INSTANCE := ":instance" Spacing STATUS := ":status" Spacing INCLUDE := ":include" Spacing END := ":end" Spacing DEFOP := ":=" Spacing ADDOP := ":+" Spacing DOT := "." Spacing AND := "&" Spacing COMMA := "," Spacing AVMOPEN := "[" Spacing AVMCLOSE := "]" Spacing DLOPEN := "" Spacing CLOPEN := "<" Spacing CLCLOSE := ">" Spacing ELLIPSIS := "..." Spacing EOF := "" # end-of-file }}} == Deprecated TDL Features == The following are deprecated features of DELPH-IN TDL. They are no longer considered part of the format, but implementers of TDL parsers may want to include them for backward compatibility. If so, they are encouraged to print warnings upon encountering the deprecated forms so grammar developers know to change them. === Subtyping Operator (:<) === The `:<` operator was originally used only for declaring a type's position in the type hierarchy (i.e., features could not be specified, unlike with `:=`), but eventually this constraint was relaxed and it became equivalent to `:=`. As of Autumn 2018, the form has been removed and is no longer considered part of DELPH-IN TDL, but the change to TDL syntax to support the operator is minimal: {{{#!highlight ruby DEFOP := ( ":=" | ":<" ) Spacing }}} === Single-quoted Symbols ('symbol) === Double-quoted strings and identifiers are both type names, but there used to be Lisp-like single-quoted symbols as well. These still exist in some grammars, such as those using an old version of [[MatrixTop|matrix.tdl]], which has the following: {{{ implicit-coord-rel := coordination-relation & [ PRED 'implicit_coord_rel ]. null-coord-rel := coordination-relation & [ PRED 'null_coord_rel ]. }}} There is no difference between using quoted symbols and regular strings or identifiers (although identifiers would need to be defined as types somewhere), so recent versions of `matrix.tdl` have this instead: {{{ implicit-coord-rel := coordination-relation & [ PRED "implicit_coord_rel" ]. null-coord-rel := coordination-relation & [ PRED "null_coord_rel" ]. }}} The change to the syntax to support quoted symbols is as follows: {{{#!highlight ruby TypeTerm := TypeName | DQString | Regex | QSymbol QSymbol := "'" Identifier Spacing }}} == TDL File Interpretation and Conventions == === Layout of a type definition === Some parts of a type definition are mandated by TDL syntax, such as the initial identifier, the main operator, and the final dot: {{{ identifier := (definition body) . }}} The definition body is just a conjunction of terms, maybe with documentation strings, and there is much valid variation in how those terms are arranged. Nevertheless, there are conventional locations for these terms depending on what kind of term they are. For instance, the supertypes are generally listed first, followed by an AVM: {{{ head_only := unary_phrase & headed_phrase & [ HD-DTR #head & [ SYNSEM.LOCAL.CONJ cnil ], ARGS < #head > ]. }}} If a documentation string is specified, the conventional place is before the AVM: {{{ n_-_ad-pl_le := norm_np_adv_lexent & """ N, can modify, locative (place) B lives overseas. """ [ SYNSEM.LOCAL [ CAT.HEAD [ MINORS.MIN place_n_rel, CASE obliq ], CONT.HOOK.INDEX.SORT place ] ]. }}} Or if there is no AVM, before the final dot: {{{ info-str := icons """Type for underspecified or "neutral" information structure.""". }}} === Types versus instances === === Specifying the text encoding === The text encoding of TDL files can be specified using a special comment on the first line of the file, as is done with many scripting languages. For instance, the following sets the encoding to UTF-8: {{{#!highlight scheme ; -*- coding: utf-8 -*- }}} In some TDL files, attributes specific to the [[https://www.gnu.org/software/emacs/|Emacs]] text editor may be included: {{{#!highlight scheme ;;; -*- Mode: tdl; Coding: utf-8; indent-tabs-mode: nil; -*- }}} === Feature interpretation of lists === The `< ... >` and `` shorthand for lists ("cons lists") and diff-lists, respectively, correspond to normal attribute-value pairs. The implementation relies on an encoding scheme where the first list item (the list's head) is at the feature `FIRST` while the rest of the list (the tail) is defined recursively under the feature `REST` (e.g., `REST.REST.FIRST` is the third element). The types associated with open and closed lists, and sometimes even the feature names, are configurable by the grammar. || '''entity''' || '''example''' || '''LKB config''' || '''ACE config''' || || cons-list type || `*cons*` || (not configurable) || `cons-type` || || open list type || `*list*` || `*list-type*` || `list-type` || || closed list type || `*null*` || `*empty-list-type*` || `null-type` || || diff-list type || `*diff-list*` || `*diff-list-type*` || `diff-list-type` || || list head feature || `FIRST` || `*list-head*` || (not configurable) || || list tail feature || `REST` || `*list-tail*` || (not configurable) || || diff-list list feature || `LIST` || `*diff-list-list*` || (not configurable) || || diff-list last feature || `LAST` || `*diff-list-last*` || (not configurable) || For the examples below, I use the values defined in the above table, which are taken from the ERG. ==== Cons Lists ==== Regular cons lists may be open (extendable) or closed (fixed-length). The type of an open list as interpreted by, e.g., `< ... >`, is `*list*` (rather, the defined open list type), but in hand-written TDL a subtype of `*list*` is often used, such as `*cons*`. {{{#!highlight scheme ; an empty list is terminated (always empty) [ ATTR < > ] => [ ATTR *null* ] ; single item goes on FIRST attribute and REST is terminated [ ATTR < a > ] => [ ATTR *list* & [ FIRST a, REST *null* ] ] ; items after the first go on (REST.)+FIRST [ ATTR < a, b > ] => [ ATTR *list* & [ FIRST a, REST [ FIRST b, REST *null* ] ] ] ; an empty list with ... is not terminated [ ATTR < ... > ] => [ ATTR *list* ] ; this also works with items on the list [ ATTR < a, ... > ] => [ ATTR *list* & [ FIRST a, REST *list* ] ] ; the . delimiter allows a non-*list*, non-*null* value for the last REST [ ATTR < a . #coref > ] => [ ATTR *list* & [ FIRST a, REST #coref ] ] }}} ==== Diff Lists ==== Diff lists are regular lists under a `LIST` attribute, and `LAST` points to the last item. Diff lists don't support the unterminated-list functionality of cons lists, but they allow for appending lists of arbitrary size (see [[GeFaqDiffList]]). {{{#!highlight scheme [ ATTR ] => [ ATTR *diff-list* & [ LIST #coref, LAST #coref ] ] [ ATTR ] => [ ATTR *diff-list* & [ LIST *list* & [ FIRST a, REST #coref ], LAST #coref ] ] }}} === Type documentation === TDL definitions allow documentation strings ("docstrings") before any term in the top-level conjunction or before the terminating dot (`.`) character: {{{ n_-_c_le := n_intr_lex_entry """Intransitive count noun (icn) The dog barked. Much dog bark.""". }}} Before docstrings became well-supported, LTDB supported documentation in comments (normally preceding the documented type): {{{#!highlight cl ; ; 承名詞目的格助詞ト ; case-p-lex-np-woを参照。このtypeは助詞「と」。 ; 部長 と 会う ; ゆっくり と 進む ; ; case-p-lex-np-to := case-p-lex-np & [SYNSEM.LOCAL.CAT.HEAD.CASE to]. }}} === Case sensitivity === ==== Case Sensitive ==== * Things inside quotes (NB: strings passed from TFS world into MRS can be treated as case insensitive in MRS processing (i.e. as predicate symbols, but not `CARG`s) ==== Case Insensitive ==== * Everything in TDL not inside of quotes. * Lexicon look-up. * Proper names? * Acronyms? ... approach these with token-mapping (preserve the info, and then downcase anyway) ==== Unknown ==== * Orthographic subrules (agree: case sensitive, ACE: [intended] case insensitive) Notes: Arguments for case insensitive include shouting (call caps); Arguments for case sensitive include the use of upper case vowels in vowel harmony languages (linguistic representations, not orthography) == Notes for implementation == === DocStrings === Multiple docstrings may be present on a single definition, but only the first one encountered on a definition is considered its primary docstring, and implementers are free to store or discard the other doc strings as they see fit. Docstrings on type-addenda should be concatenated with a newline to the previous docstring(s), or appended to a list of docstrings, associated with the type. === Comments === The syntax description above allows for comments anywhere that separating whitespace is allowed (not including those within strings, regular expressions, letter sets, etc.). This includes within a dotted attribute path (e.g., `[ SYNSEM #| comment |# . #| comment |# LOCAL ... ]`), although grammar developers may want to use this flexibility sparingly. == Open Questions == 1. The `^` character is used to signal "expanded-syntax" in the LKB, but is this only used for regular expressions? Are there other expanded syntaxes? Do non-LKB processors support them? (see [[http://lists.delph-in.net/archives/developers/2009/thread.html#1082|this thread]] on the 'developers' mailing list) 2. Are instances distinguishable from types? Are they (other other entities) restricted to having exactly one supertype? 3. Can we use 'status' to identify roots and labels (parsenodes)? Something like {{{ ;; ;; parse-tree labels (instances) ;; :begin :instance :status label. :include "parse-nodes". :end :instance. ;; ;; start symbols of the grammar (instances) ;; :begin :instance. :status root. :include "roots". :include "educ/roots-educ". :end :instance. }}} == Discussions == * ParisDefeasibleConstraints * StanfordDefaults * [[http://www.delph-in.net/2017/append.pdf|(Diff)List Appends in TDL]] * [[http://lists.delph-in.net/archives/developers/2006/000419.html|Mailing list discussion about docstrings (Feb 2006)]] * [[http://lists.delph-in.net/archives/developers/2006/000550.html|Mailing list discussion about type addenda (Jul 2006)]] * [[http://lists.delph-in.net/archives/developers/2007/000762.html|Mailing list discussion about docstrings (Mar 2007)]] * [[http://lists.delph-in.net/archives/developers/2007/000868.html|Mailing list discussion about docstrings (Sep 2007)]] * [[http://lists.delph-in.net/archives/developers/2008/001037.html|Mailing list discussion about the :+ and :< operators (Nov 2008)]] * [[http://lists.delph-in.net/archives/developers/2009/001082.html|Mailing list discussion about regular expressions in TDL (Jan 2009)]] * [[http://lists.delph-in.net/archives/developers/2018/002754.html|Mailing list discussion about TDL syntax (Jul 2018)]] * [[http://lists.delph-in.net/archives/developers/2018/002792.html|Mailing list discussion about docstrings (Aug 2018)]]