4802
Comment: Allow spacing between Attribute and Dot; Only allow Null "*" in affixes on LHS
|
11366
Updated the syntax description, restructured the wiki
|
Deletions are marked like this. | Additions are marked like this. |
Line 2: | Line 2: |
Line 4: | Line 5: |
== Case Sensitivity == === Case Sensitive === |
<<TableOfContents(3)>> == TDL File Syntax == Productions are separated into thematic sections. ALL-CAPS rule names are for non-content terminals, which appear at the bottom of the description. {{{#!highlight ruby # File Contents TdlTypeFile := ( TypeDef | TypeAddendum | Spacing )* EOF TdlRuleFile := ( LexRuleDef | LetterSet | WildCard | Spacing )* EOF # Types and Lexical Rules TypeDef := TypeName DEFOP TypeDefBody DOT TypeAddendum := TypeName ADDOP AddendumBody DOT TypeName := Identifier Spacing LexRuleDef := LexRuleId DEFOP Affix? TypeDefBody DOT LexRuleId := Identifier Spacing # Identifiers are used in several patterns # # Note: For some processors, like the LKB, there may be "break characters" # defined which determine what is allowed within an identifier. Identifier := /[^\s.:<=&,#[]$()>!^\/]+/ # Definition Bodies (top-level conjunctions of terms) # # Note: Definition bodies are most simply Conjunctions, but several # variations require special productions: # # (1) """DocStrings""" may precede any top-level Term or the final DOT # (2) TypeDef and LexRuleDef require at least one TypeName # (3) TypeAddendum may use a DocString in place of a Conjunction # TypeDefBody := TypedConj DocString? AddendumBody := DocConj DocString? | DocString # Note: To accommodate TypeDefBody and AddendumBody, three special # conjunctions are added: # # (1) TypedConj has an obligatory TypeName term # (2) FeatureConj excludes type terms (including strings, etc.) # (3) DocConj is a regular conjunction with optional DocStrings # # Note that FeatureConj is only necessary to reduce ambiguity (e.g., # for LALR parsing); if ambiguity is allowed, DocConj may be used. TypedConj := ( FeatureConj AND )? DocString? TypeName ( AND DocConj )? FeatureConj := DocString? FeatureTerm ( AND DocString? FeatureTerm )* DocConj := DocString? Term ( AND DocString? Term ) # Note: The DocString pattern may span multiple lines DocString := /"""([^"\\]|\\.|"(?!")|""(?!"))*"""/ Spacing # Terms and Conjunctions Conjunction := Term ( AND Term )* Term := TypeTerm | FeatureTerm | Coreference TypeTerm := TypeName | DQString | QSymbol | Regex FeatureTerm := Avm | DiffList | ConsList DQString := /"([^"\\]|\\.)*"/ Spacing QSymbol := "'" Identifier Spacing Regex := "^" /([^$\\]|\\.)*/ "$" Avm := AVMOPEN AttrVals? AVMCLOSE AttrVals := AttrVal ( COMMA AttrVal )* AttrVal := AttrPath SPACE Conjunction AttrPath := Attribute ( DOT Attribute )* Attribute := Identifier Spacing DiffList := DLOPEN Conjunctions? DLCLOSE ConsList := CLOPEN ( Conjunctions ConsEnd? )? CLCLOSE ConsEnd := COMMA ELLIPSIS | DOT Conjunction Conjunctions := Conjunction ( COMMA Conjunction )* Coreference := "#" Identifier Spacing # Letter-sets, Wild-cards, and Affixes # # Note: spacing is sensitive within these patterns, so many non-content # terminals are used directly with an explicit SPACE instead of in # a production with Spacing. LetterSet := "%(letter-set" SPACE? LetterSetDef SPACE? ")" WildCard := "%(wild-card" SPACE? WildCardDef SPACE? ")" LetterSetDef := "(" LetterSetVar SPACE Characters ")" WildCardDef := "(" WildCardVar SPACE Characters ")" LetterSetVar := /![^ ]/ WildCardVar := /\?[^ ]/ Characters := /([^)\\]|\\.)+/ # Note: When a LetterSetVar is used in an AffixMatch, the same LetterSetVar # in the AffixSub copies the matched character, in order, so there # should be the same number of LetterSetVars in both, but this is not # captured in the syntax. Affix := AffixClass AffixPattern+ Spacing AffixClass := "%prefix" | "%suffix" AffixPattern := SPACE? "(" AffixMatch SPACE AffixSub ")" AffixMatch := NullChar | CharList AffixSub := CharList NullChar := "*" CharList := ( LetterSetVar | WildCardVar | AffixChar )+ AffixChar := /([^!?\s*\\]|\\[^ ])+/ # Whitespace and Comments # # Note: SPACE and BlockComment may span multiple lines Spacing := SPACE? Comment* SPACE := /\s+/ Comment := ( LineComment | BlockComment ) SPACE? LineComment := /;.*$/ BlockComment := "#|" /([^|\\]|\\.|\|(?!#))*/ "|#" # Non-content Terminals DEFOP := ":=" Spacing ADDOP := ":+" Spacing DOT := "." Spacing AND := "&" Spacing COMMA := "," Spacing AVMOPEN := "[" Spacing AVMCLOSE := "]" Spacing DLOPEN := "<!" Spacing DLCLOSE := "!>" Spacing CLOPEN := "<" Spacing CLCLOSE := ">" Spacing ELLIPSIS := "..." Spacing EOF := "" # end-of-file }}} == TDL File Interpretation and Conventions == === Layout of a type definition === === Types versus instances === === Specifying the text encoding === The text encoding of TDL files can be specified using a special comment on the first line of the file, as is done with many scripting languages. For instance, the following sets the encoding to UTF-8: {{{#!highlight scheme ; -*- coding: utf-8 -*- }}} In some TDL files, attributes specific to the [[https://www.gnu.org/software/emacs/|Emacs]] text editor may be included: {{{#!highlight scheme ;;; -*- Mode: tdl; Coding: utf-8; indent-tabs-mode: nil; -*- }}} === Feature interpretation of lists === The `< ... >` and `<! ... !>` shorthand for lists ("cons lists") and diff-lists, respectively, correspond to normal attribute-value pairs. Regular cons lists may be terminated (fixed-length) or unterminated (expandable). {{{#!highlight scheme ; an empty list is terminated (always empty) [ ATTR < > ] => [ ATTR *null* ] ; single item goes on FIRST attribute and REST is terminated [ ATTR < a > ] => [ ATTR [ FIRST a, REST *null* ] ] ; items after the first go on (REST.)+FIRST [ ATTR < a, b > ] => [ ATTR [ FIRST a, REST [ FIRST b, REST *null* ] ] ] ; an empty list with ... is not terminated [ ATTR < ... > ] => [ ATTR *list* ] ; this also works with items on the list [ ATTR < a, ... > ] => [ ATTR [ FIRST a, REST *list* ] ] ; the . delimiter allows a non-*list*, non-*null* value for the last REST [ ATTR < a . #coref > ] => [ ATTR [ FIRST a, REST #coref ] ] }}} Diff lists are regular lists under a `LIST` attribute, and `LAST` points to the last item. Diff lists don't support the unterminated-list functionality of cons lists, but they allow for appending lists of arbitrary size (see [[GeFaqDiffList]]). {{{#!highlight scheme [ ATTR <! !> ] => [ ATTR [ LIST #coref, LAST #coref ] ] [ ATTR <! a !> ] => [ ATTR [ LIST [ FIRST a, REST #coref & *null* ], LAST #coref ] ] }}} === Type documentation === TDL definitions allow documentation strings ("docstrings") before any term in the top-level conjunction or before the terminating dot (`.`) character: {{{ n_-_c_le := n_intr_lex_entry """Intransitive count noun (icn) <ex>The dog barked. <nex>Much dog bark.""". }}} Before docstrings became well-supported, LTDB supported documentation in comments (normally preceding the documented type): {{{#!highlight cl ; <type val="case-p-lex-np-to"> ; <name-ja>承名詞目的格助詞ト ; <description>case-p-lex-np-woを参照。このtypeは助詞「と」。 ; <ex>部長 と 会う ; <nex>ゆっくり と 進む ; <todo> ; </type> case-p-lex-np-to := case-p-lex-np & [SYNSEM.LOCAL.CAT.HEAD.CASE to]. }}} === Case sensitivity === ==== Case Sensitive ==== |
Line 10: | Line 238: |
=== Case Insensitive === | ==== Case Insensitive ==== |
Line 19: | Line 247: |
=== Unknown === | ==== Unknown ==== |
Line 25: | Line 253: |
== Doc Strings == TDL types allow a doc string: {{{ n_-_c_le := n_intr_lex_entry & "Intransitive count noun (icn) <ex>The dog barked. <nex>Much dog bark.". }}} == TDL File Syntax == {{{#!highlight ruby # File Contents TdlTypeFile := ( TypeDef | Spacing )* EOF TdlRuleFile := ( LexRuleDef | MorphSet | Spacing )* EOF # Types and Lexical Rules TypeDef := Type ( AvmDef | AvmAddendum ) AvmDef := DefOp DefBody AvmAddendum := AddOp ( DefBody | DocString? Conjunction | DocString ) LexRuleDef := Type DefOp Affix? DefBody DefBody := Supertypes ( And DocString? Conjunction | DocString? ) Supertypes := Type ( And Type )* Type := Identifier Spacing DocString := DQString Conjunction := Term ( And Term )* Term := ( Type | FeatureTerm | DiffList | ConsList | Coreference | DQString | QSymbol | Regex ) FeatureTerm := LBrack AttrVals? RBrack AttrVals := AttrVal ( Comma AttrVal )* AttrVal := Attribute ( Dot Attribute )* Conjunction Attribute := Identifier Spacing DiffList := DLOpen Conjunctions? DLClose ConsList := CLOpen ( Conjunctions ConsEnd? )? CLClose ConsEnd := Comma Ellipsis | Dot Conjunction Conjunctions := Conjunction ( Comma Conjunction )* Coreference := "#" Identifier Spacing # Letter-sets, Wild-cards, and Affixes MorphSet := "%" "(" ( LetterSetDef | WildCardDef ) ")" LetterSetDef := "letter-set" Space? "(" LetterSetVar Space LetterSet ")" WildCardDef := "wild-card" Space? "(" WildCardVar Space LetterSet ")" LetterSetVar := /![^ ]/ WildCardVar := /\?[^ ]/ LetterSet := /([^)\\]|\\.)+/ Affix := AffixClass AffixPattern+ Spacing AffixClass := "%prefix" | "%suffix" AffixPattern := Space? "(" ( NullChar | CharList ) Space CharList ")" CharList := ( LetterSetVar | WildCardVar | AffixChar )+ NullChar := "*" AffixChar := /([^!?\s*\\]|\\[^ ])+/ # Whitespace and Comments Spacing := Space? Comment* Space := /\s+/ Comment := ( LineComment | BlockComment ) Space? LineComment := /;.*$/ BlockComment := "#|" /([^|\\]|\\.|\|[^#])*/ "|#" # Literals DefOp := ":=" Spacing AddOp := ":+" Spacing Identifier := /[^\s.:<=&,#[]$()>!^\/]+/ Dot := "." Spacing And := "&" Spacing Comma := "," Spacing LBrack := "[" Spacing RBrack := "]" Spacing DLOpen := "<!" Spacing DLClose := "!>" Spacing CLOpen := "<" Spacing CLClose := ">" Spacing Ellipsis := "..." Spacing DQString := /"([^"\\]|\\.)*"/ Spacing QSymbol := "'" Identifier Spacing Regex := "^" /([^$\\]|\\.)*/ "$" }}} == Questions == 1. The `^` character is used to signal "expanded-syntax" in the LKB, but is this only used for regular expressions? Are there other expanded syntaxes? Do non-LKB processors support them? |
== Notes for implementation == === DocStrings === Multiple docstrings may be present on a single definition, but only the first one encountered on a definition is considered its primary docstring, and implementers are free to store or discard the other doc strings as they see fit. Docstrings on type-addenda should be concatenated with a newline to the previous docstring(s), or appended to a list of docstrings, associated with the type. === Comments === The syntax description above allows for comments anywhere that separating whitespace is allowed (not including those within strings, regular expressions, letter sets, etc.). This includes within a dotted attribute path (e.g., `[ SYNSEM #| comment |# . #| comment |# LOCAL ... ]`), although grammar developers may want to use this flexibility sparingly. == Open Questions == 1. The `^` character is used to signal "expanded-syntax" in the LKB, but is this only used for regular expressions? Are there other expanded syntaxes? Do non-LKB processors support them? (see [[http://lists.delph-in.net/archives/developers/2009/thread.html#1082|this thread]] on the 'developers' mailing list) |
Line 124: | Line 269: |
3. When supertypes are required (e.g., on a TypeDef), must they appear before other Terms in the Conjunction? 4. Should the (deprecated or repurposed) subtype operator (`:<`) be included in the syntax description? 5. Is variation allowed with regards to the position of docstrings? 6. Are spaces allowed inside a feature path? Comments? {{{ type := supertype & [ ATTR1 . ; comment here? ATTR2 value ]; }}} For that matter, are comments allowed anywhere that whitespace is (except maybe letter-sets and lex-rule affix patterns)? |
|
Line 146: | Line 276: |
* [[http://lists.delph-in.net/archives/developers/2006/000419.html|Mailing list discussion about docstrings (Feb 2006)]] * [[http://lists.delph-in.net/archives/developers/2006/000550.html|Mailing list discussion about type addenda (Jul 2006)]] * [[http://lists.delph-in.net/archives/developers/2007/000762.html|Mailing list discussion about docstrings (Mar 2007)]] * [[http://lists.delph-in.net/archives/developers/2007/000868.html|Mailing list discussion about docstrings (Sep 2007)]] * [[http://lists.delph-in.net/archives/developers/2008/001037.html|Mailing list discussion about the :+ and :< operators (Nov 2008)]] * [[http://lists.delph-in.net/archives/developers/2009/001082.html|Mailing list discussion about regular expressions in TDL (Jan 2009)]] * [[http://lists.delph-in.net/archives/developers/2018/002754.html|Mailing list discussion about TDL syntax (Jul 2018)]] * [[http://lists.delph-in.net/archives/developers/2018/002792.html|Mailing list discussion about docstrings (Aug 2018)]] |
Type Description Language and other aspects of DELPH-IN Joint Reference Formalism
Contents
TDL File Syntax
Productions are separated into thematic sections. ALL-CAPS rule names are for non-content terminals, which appear at the bottom of the description.
1 # File Contents
2
3 TdlTypeFile := ( TypeDef | TypeAddendum | Spacing )* EOF
4 TdlRuleFile := ( LexRuleDef | LetterSet | WildCard | Spacing )* EOF
5
6 # Types and Lexical Rules
7
8 TypeDef := TypeName DEFOP TypeDefBody DOT
9 TypeAddendum := TypeName ADDOP AddendumBody DOT
10 TypeName := Identifier Spacing
11
12 LexRuleDef := LexRuleId DEFOP Affix? TypeDefBody DOT
13 LexRuleId := Identifier Spacing
14
15 # Identifiers are used in several patterns
16 #
17 # Note: For some processors, like the LKB, there may be "break characters"
18 # defined which determine what is allowed within an identifier.
19
20 Identifier := /[^\s.:<=&,#[]$()>!^\/]+/
21
22 # Definition Bodies (top-level conjunctions of terms)
23 #
24 # Note: Definition bodies are most simply Conjunctions, but several
25 # variations require special productions:
26 #
27 # (1) """DocStrings""" may precede any top-level Term or the final DOT
28 # (2) TypeDef and LexRuleDef require at least one TypeName
29 # (3) TypeAddendum may use a DocString in place of a Conjunction
30 #
31
32 TypeDefBody := TypedConj DocString?
33 AddendumBody := DocConj DocString? | DocString
34
35 # Note: To accommodate TypeDefBody and AddendumBody, three special
36 # conjunctions are added:
37 #
38 # (1) TypedConj has an obligatory TypeName term
39 # (2) FeatureConj excludes type terms (including strings, etc.)
40 # (3) DocConj is a regular conjunction with optional DocStrings
41 #
42 # Note that FeatureConj is only necessary to reduce ambiguity (e.g.,
43 # for LALR parsing); if ambiguity is allowed, DocConj may be used.
44
45 TypedConj := ( FeatureConj AND )? DocString? TypeName ( AND DocConj )?
46 FeatureConj := DocString? FeatureTerm ( AND DocString? FeatureTerm )*
47 DocConj := DocString? Term ( AND DocString? Term )
48
49 # Note: The DocString pattern may span multiple lines
50
51 DocString := /"""([^"\\]|\\.|"(?!")|""(?!"))*"""/ Spacing
52
53 # Terms and Conjunctions
54
55 Conjunction := Term ( AND Term )*
56 Term := TypeTerm | FeatureTerm | Coreference
57 TypeTerm := TypeName
58 | DQString
59 | QSymbol
60 | Regex
61 FeatureTerm := Avm
62 | DiffList
63 | ConsList
64
65 DQString := /"([^"\\]|\\.)*"/ Spacing
66 QSymbol := "'" Identifier Spacing
67 Regex := "^" /([^$\\]|\\.)*/ "$"
68
69 Avm := AVMOPEN AttrVals? AVMCLOSE
70 AttrVals := AttrVal ( COMMA AttrVal )*
71 AttrVal := AttrPath SPACE Conjunction
72 AttrPath := Attribute ( DOT Attribute )*
73 Attribute := Identifier Spacing
74
75 DiffList := DLOPEN Conjunctions? DLCLOSE
76 ConsList := CLOPEN ( Conjunctions ConsEnd? )? CLCLOSE
77 ConsEnd := COMMA ELLIPSIS | DOT Conjunction
78 Conjunctions := Conjunction ( COMMA Conjunction )*
79
80 Coreference := "#" Identifier Spacing
81
82 # Letter-sets, Wild-cards, and Affixes
83 #
84 # Note: spacing is sensitive within these patterns, so many non-content
85 # terminals are used directly with an explicit SPACE instead of in
86 # a production with Spacing.
87
88 LetterSet := "%(letter-set" SPACE? LetterSetDef SPACE? ")"
89 WildCard := "%(wild-card" SPACE? WildCardDef SPACE? ")"
90 LetterSetDef := "(" LetterSetVar SPACE Characters ")"
91 WildCardDef := "(" WildCardVar SPACE Characters ")"
92 LetterSetVar := /![^ ]/
93 WildCardVar := /\?[^ ]/
94 Characters := /([^)\\]|\\.)+/
95
96 # Note: When a LetterSetVar is used in an AffixMatch, the same LetterSetVar
97 # in the AffixSub copies the matched character, in order, so there
98 # should be the same number of LetterSetVars in both, but this is not
99 # captured in the syntax.
100
101 Affix := AffixClass AffixPattern+ Spacing
102 AffixClass := "%prefix" | "%suffix"
103 AffixPattern := SPACE? "(" AffixMatch SPACE AffixSub ")"
104 AffixMatch := NullChar | CharList
105 AffixSub := CharList
106 NullChar := "*"
107 CharList := ( LetterSetVar | WildCardVar | AffixChar )+
108 AffixChar := /([^!?\s*\\]|\\[^ ])+/
109
110 # Whitespace and Comments
111 #
112 # Note: SPACE and BlockComment may span multiple lines
113
114 Spacing := SPACE? Comment*
115 SPACE := /\s+/
116 Comment := ( LineComment | BlockComment ) SPACE?
117 LineComment := /;.*$/
118 BlockComment := "#|" /([^|\\]|\\.|\|(?!#))*/ "|#"
119
120 # Non-content Terminals
121
122 DEFOP := ":=" Spacing
123 ADDOP := ":+" Spacing
124 DOT := "." Spacing
125 AND := "&" Spacing
126 COMMA := "," Spacing
127 AVMOPEN := "[" Spacing
128 AVMCLOSE := "]" Spacing
129 DLOPEN := "<!" Spacing
130 DLCLOSE := "!>" Spacing
131 CLOPEN := "<" Spacing
132 CLCLOSE := ">" Spacing
133 ELLIPSIS := "..." Spacing
134 EOF := "" # end-of-file
TDL File Interpretation and Conventions
Layout of a type definition
Types versus instances
Specifying the text encoding
The text encoding of TDL files can be specified using a special comment on the first line of the file, as is done with many scripting languages. For instance, the following sets the encoding to UTF-8:
1 ; -*- coding: utf-8 -*-
In some TDL files, attributes specific to the Emacs text editor may be included:
1 ;;; -*- Mode: tdl; Coding: utf-8; indent-tabs-mode: nil; -*-
Feature interpretation of lists
The < ... > and <! ... !> shorthand for lists ("cons lists") and diff-lists, respectively, correspond to normal attribute-value pairs. Regular cons lists may be terminated (fixed-length) or unterminated (expandable).
1 ; an empty list is terminated (always empty)
2 [ ATTR < > ] => [ ATTR *null* ]
3 ; single item goes on FIRST attribute and REST is terminated
4 [ ATTR < a > ] => [ ATTR [ FIRST a,
5 REST *null* ] ]
6 ; items after the first go on (REST.)+FIRST
7 [ ATTR < a, b > ] => [ ATTR [ FIRST a,
8 REST [ FIRST b,
9 REST *null* ] ] ]
10 ; an empty list with ... is not terminated
11 [ ATTR < ... > ] => [ ATTR *list* ]
12 ; this also works with items on the list
13 [ ATTR < a, ... > ] => [ ATTR [ FIRST a,
14 REST *list* ] ]
15 ; the . delimiter allows a non-*list*, non-*null* value for the last REST
16 [ ATTR < a . #coref > ] => [ ATTR [ FIRST a,
17 REST #coref ] ]
Diff lists are regular lists under a LIST attribute, and LAST points to the last item. Diff lists don't support the unterminated-list functionality of cons lists, but they allow for appending lists of arbitrary size (see GeFaqDiffList).
Type documentation
TDL definitions allow documentation strings ("docstrings") before any term in the top-level conjunction or before the terminating dot (.) character:
n_-_c_le := n_intr_lex_entry """Intransitive count noun (icn) <ex>The dog barked. <nex>Much dog bark.""".
Before docstrings became well-supported, LTDB supported documentation in comments (normally preceding the documented type):
Case sensitivity
Case Sensitive
Things inside quotes (NB: strings passed from TFS world into MRS can be treated as case insensitive in MRS processing (i.e. as predicate symbols, but not CARGs)
Case Insensitive
- Everything in TDL not inside of quotes.
- Lexicon look-up.
- Proper names?
- Acronyms?
- .. approach these with token-mapping (preserve the info, and then downcase anyway)
Unknown
- Orthographic subrules (agree: case sensitive, ACE: [intended] case insensitive)
Notes: Arguments for case insensitive include shouting (call caps); Arguments for case sensitive include the use of upper case vowels in vowel harmony languages (linguistic representations, not orthography)
Notes for implementation
DocStrings
Multiple docstrings may be present on a single definition, but only the first one encountered on a definition is considered its primary docstring, and implementers are free to store or discard the other doc strings as they see fit. Docstrings on type-addenda should be concatenated with a newline to the previous docstring(s), or appended to a list of docstrings, associated with the type.
Comments
The syntax description above allows for comments anywhere that separating whitespace is allowed (not including those within strings, regular expressions, letter sets, etc.). This includes within a dotted attribute path (e.g., [ SYNSEM #| comment |# . #| comment |# LOCAL ... ]), although grammar developers may want to use this flexibility sparingly.
Open Questions
1. The ^ character is used to signal "expanded-syntax" in the LKB, but is this only used for regular expressions? Are there other expanded syntaxes? Do non-LKB processors support them? (see this thread on the 'developers' mailing list)
2. Are instances distinguishable from types? Are they (other other entities) restricted to having exactly one supertype?