16527
Comment: Added environments to the syntax description
|
16552
Updated Identifier, DQString, and Regex production rules
|
Deletions are marked like this. | Additions are marked like this. |
Line 19: | Line 19: |
Environment := BEGIN TYPE "DOT" TypeEnv END TYPE DOT | Environment := BEGIN TYPE DOT TypeEnv END TYPE DOT |
Line 71: | Line 71: |
Identifier := /[^\s.:<=&,#[]$()>!^\/]+/ | Identifier := /[^\s.:;<=&,#[\]$()>!^\/|]+/ |
Line 115: | Line 115: |
DQString := /"([^"\\]|\\.)*"/ Spacing Regex := "^" /([^$\\]|\\.)*/ "$" |
DQString := ( /""(?!")/ | /"([^"\\]|\\.)+"/ ) Spacing Regex := "^" /([^$\\]|\\.)*/ "$" Spacing |
Type Description Language and other aspects of DELPH-IN Joint Reference Formalism
Contents
TDL File Syntax
Productions are separated into thematic sections. ALL-CAPS rule names are for non-content terminals, which appear at the bottom of the description.
1 # File Contents
2 #
3 # Note: The LKB does not parse environments (:begin ... :end), nor does it
4 # support :include statements, so the following is only applicable for
5 # PET, ACE, and perhaps agree.
6
7 TdlFile := ( Environment | Statement | Spacing )* EOF
8 Environment := BEGIN TYPE DOT TypeEnv END TYPE DOT
9 | BEGIN INSTANCE Status? DOT InstanceEnv END INSTANCE DOT
10 TypeEnv := ( Environment | Statement
11 | TypeDef | TypeAddendum | Spacing )*
12 InstanceEnv := ( Environment | Statement
13 | InstanceDef | LetterSet | WildCard | Spacing )*
14 Status := STATUS ( "generic-lex-entry"
15 | "lex-entry"
16 | "lexical-filtering-rule"
17 | "lex-rule"
18 | "post-generation-mapping-rule"
19 | "rule"
20 | "token-mapping-rule" ) Spacing
21
22 # Note: The LKB has several Lisp functions which open files in specified
23 # environments, so the following are parsing targets for those
24 # functions.
25
26 TdlTypeFile := ( TypeDef | TypeAddendum | Spacing )* EOF
27 TdlRuleFile := ( InstanceDef | LetterSet | WildCard | Spacing )* EOF
28
29 # Note: Krieger & Schaeffer 1994 define a large number of statements, but
30 # DELPH-IN grammars appear to only use :include.
31 # Note: :include's string argument is a path relative to the current file's
32 # directory. If the filename extension is not given, the default ".tdl"
33 # extension is used. The file is opened in the same environment as the
34 # :include statements (e.g., :include in a type environment opens the
35 # file and parses it as TypeEnv)
36
37 Statement := Include
38 Include := INCLUDE Filename DOT
39 Filename := DQString
40
41 # Types and Instances
42 #
43 # Note: Instances may be syntactically identical to type definitions, but they
44 # do not affect the type hierarchy. They may also be lexical rule
45 # definitions that include an affixing pattern to a definition.
46
47 TypeDef := TypeName DEFOP TypeDefBody DOT
48 TypeAddendum := TypeName ADDOP AddendumBody DOT
49 TypeName := Identifier Spacing
50
51 InstanceDef := TypeDef | LexRuleDef
52 LexRuleDef := LexRuleId DEFOP Affix? TypeDefBody DOT
53 LexRuleId := Identifier Spacing
54
55 # Identifiers are used in several patterns
56 #
57 # Note: For some processors, like the LKB, there may be "break characters"
58 # defined which determine what is allowed within an identifier.
59
60 Identifier := /[^\s.:;<=&,#[\]$()>!^\/|]+/
61
62 # Definition Bodies (top-level conjunctions of terms)
63 #
64 # Note: Definition bodies are most simply Conjunctions, but several
65 # variations require special productions:
66 #
67 # (1) """DocStrings""" may precede any top-level Term or the final DOT
68 # (2) TypeDef and LexRuleDef require at least one TypeName
69 # (3) TypeAddendum may use a DocString in place of a Conjunction
70 #
71
72 TypeDefBody := TypedConj DocString?
73 AddendumBody := DocConj DocString? | DocString
74
75 # Note: To accommodate TypeDefBody and AddendumBody, three special
76 # conjunctions are added:
77 #
78 # (1) TypedConj has an obligatory TypeName term
79 # (2) FeatureConj excludes type terms (including strings, etc.)
80 # (3) DocConj is a regular conjunction with optional DocStrings
81 #
82 # Note that FeatureConj is only necessary to reduce ambiguity (e.g.,
83 # for LALR parsing); if ambiguity is allowed, DocConj may be used.
84
85 TypedConj := ( FeatureConj AND )? DocString? TypeName ( AND DocConj )?
86 FeatureConj := DocString? FeatureTerm ( AND DocString? FeatureTerm )*
87 DocConj := DocString? Term ( AND DocString? Term )*
88
89 # Note: The DocString pattern may span multiple lines
90
91 DocString := /"""([^"\\]|\\.|"(?!")|""(?!"))*"""/ Spacing
92
93 # Terms and Conjunctions
94
95 Conjunction := Term ( AND Term )*
96 Term := TypeTerm | FeatureTerm | Coreference
97 TypeTerm := TypeName
98 | DQString
99 | Regex
100 FeatureTerm := Avm
101 | DiffList
102 | ConsList
103
104 DQString := ( /""(?!")/ | /"([^"\\]|\\.)+"/ ) Spacing
105 Regex := "^" /([^$\\]|\\.)*/ "$" Spacing
106
107 Avm := AVMOPEN AttrVals? AVMCLOSE
108 AttrVals := AttrVal ( COMMA AttrVal )*
109 AttrVal := AttrPath SPACE Conjunction
110 AttrPath := Attribute ( DOT Attribute )*
111 Attribute := Identifier Spacing
112
113 DiffList := DLOPEN Conjunctions? DLCLOSE
114 ConsList := CLOPEN ( Conjunctions ConsEnd? )? CLCLOSE
115 ConsEnd := COMMA ELLIPSIS | DOT Conjunction
116 Conjunctions := Conjunction ( COMMA Conjunction )*
117
118 Coreference := "#" Identifier Spacing
119
120 # Letter-sets, Wild-cards, and Affixes
121 #
122 # Note: spacing is sensitive within these patterns, so many non-content
123 # terminals are used directly with an explicit SPACE instead of in
124 # a production with Spacing.
125
126 LetterSet := "%(letter-set" SPACE? LetterSetDef SPACE? ")"
127 WildCard := "%(wild-card" SPACE? WildCardDef SPACE? ")"
128 LetterSetDef := "(" LetterSetVar SPACE Characters ")"
129 WildCardDef := "(" WildCardVar SPACE Characters ")"
130 LetterSetVar := /![^ ]/
131 WildCardVar := /\?[^ ]/
132 Characters := /([^)\\]|\\.)+/
133
134 # Note: When a LetterSetVar is used in an AffixMatch, the same LetterSetVar
135 # in the AffixSub copies the matched character, in order, so there
136 # should be the same number of LetterSetVars in both, but this is not
137 # captured in the syntax.
138
139 Affix := AffixClass AffixPattern+ Spacing
140 AffixClass := "%prefix" | "%suffix"
141 AffixPattern := SPACE? "(" AffixMatch SPACE AffixSub ")"
142 AffixMatch := NullChar | CharList
143 AffixSub := CharList
144 NullChar := "*"
145 CharList := ( LetterSetVar | WildCardVar | AffixChar )+
146 AffixChar := /([^!?\s*\\]|\\[^ ])+/
147
148 # Whitespace and Comments
149 #
150 # Note: SPACE and BlockComment may span multiple lines. Also, while block
151 # comments in Lisp may be nested (`#| outer #| inner |# outer |#`),
152 # support for nested comments in TDL is mixed (ACE supports it, the
153 # LKB does not), so this definition does not nest.
154
155 Spacing := SPACE? Comment*
156 SPACE := /\s+/
157 Comment := ( LineComment | BlockComment ) SPACE?
158 LineComment := /;.*$/
159 BlockComment := "#|" /([^|\\]|\\.|\|(?!#))*/ "|#"
160
161 # Non-content Terminals
162
163 BEGIN := ":begin" Spacing
164 TYPE := ":type" Spacing
165 INSTANCE := ":instance" Spacing
166 STATUS := ":status" Spacing
167 INCLUDE := ":include" Spacing
168 END := ":end" Spacing
169 DEFOP := ":=" Spacing
170 ADDOP := ":+" Spacing
171 DOT := "." Spacing
172 AND := "&" Spacing
173 COMMA := "," Spacing
174 AVMOPEN := "[" Spacing
175 AVMCLOSE := "]" Spacing
176 DLOPEN := "<!" Spacing
177 DLCLOSE := "!>" Spacing
178 CLOPEN := "<" Spacing
179 CLCLOSE := ">" Spacing
180 ELLIPSIS := "..." Spacing
181 EOF := "" # end-of-file
TDL File Interpretation and Conventions
Layout of a type definition
Some parts of a type definition are mandated by TDL syntax, such as the initial identifier, the main operator, and the final dot:
identifier := (definition body) .
The definition body is just a conjunction of terms, maybe with documentation strings, and there is much valid variation in how those terms are arranged. Nevertheless, there are conventional locations for these terms depending on what kind of term they are. For instance, the supertypes are generally listed first, followed by an AVM:
head_only := unary_phrase & headed_phrase & [ HD-DTR #head & [ SYNSEM.LOCAL.CONJ cnil ], ARGS < #head > ].
If a documentation string is specified, the conventional place is before the AVM:
n_-_ad-pl_le := norm_np_adv_lexent & """ <description>N, can modify, locative (place) <ex>B lives overseas. <nex> <todo> """ [ SYNSEM.LOCAL [ CAT.HEAD [ MINORS.MIN place_n_rel, CASE obliq ], CONT.HOOK.INDEX.SORT place ] ].
Or if there is no AVM, before the final dot:
info-str := icons """Type for underspecified or "neutral" information structure.""".
Types versus instances
Specifying the text encoding
The text encoding of TDL files can be specified using a special comment on the first line of the file, as is done with many scripting languages. For instance, the following sets the encoding to UTF-8:
1 ; -*- coding: utf-8 -*-
In some TDL files, attributes specific to the Emacs text editor may be included:
1 ;;; -*- Mode: tdl; Coding: utf-8; indent-tabs-mode: nil; -*-
Feature interpretation of lists
The < ... > and <! ... !> shorthand for lists ("cons lists") and diff-lists, respectively, correspond to normal attribute-value pairs. The implementation relies on an encoding scheme where the first list item (the list's head) is at the feature FIRST while the rest of the list (the tail) is defined recursively under the feature REST (e.g., REST.REST.FIRST is the third element). The types associated with open and closed lists, and sometimes even the feature names, are configurable by the grammar.
entity |
example |
LKB config |
ACE config |
cons-list type |
*cons* |
(not configurable) |
cons-type |
open list type |
*list* |
*list-type* |
list-type |
closed list type |
*null* |
*empty-list-type* |
null-type |
diff-list type |
*diff-list* |
*diff-list-type* |
diff-list-type |
list head feature |
FIRST |
*list-head* |
(not configurable) |
list tail feature |
REST |
*list-tail* |
(not configurable) |
diff-list list feature |
LIST |
*diff-list-list* |
(not configurable) |
diff-list last feature |
LAST |
*diff-list-last* |
(not configurable) |
For the examples below, I use the values defined in the above table, which are taken from the ERG.
Cons Lists
Regular cons lists may be open (extendable) or closed (fixed-length). The type of an open list as interpreted by, e.g., < ... >, is *list* (rather, the defined open list type), but in hand-written TDL a subtype of *list* is often used, such as *cons*.
1 ; an empty list is terminated (always empty)
2 [ ATTR < > ] => [ ATTR *null* ]
3 ; single item goes on FIRST attribute and REST is terminated
4 [ ATTR < a > ] => [ ATTR *list* & [ FIRST a,
5 REST *null* ] ]
6 ; items after the first go on (REST.)+FIRST
7 [ ATTR < a, b > ] => [ ATTR *list* & [ FIRST a,
8 REST [ FIRST b,
9 REST *null* ] ] ]
10 ; an empty list with ... is not terminated
11 [ ATTR < ... > ] => [ ATTR *list* ]
12 ; this also works with items on the list
13 [ ATTR < a, ... > ] => [ ATTR *list* & [ FIRST a,
14 REST *list* ] ]
15 ; the . delimiter allows a non-*list*, non-*null* value for the last REST
16 [ ATTR < a . #coref > ] => [ ATTR *list* & [ FIRST a,
17 REST #coref ] ]
Diff Lists
Diff lists are regular lists under a LIST attribute, and LAST points to the last item. Diff lists don't support the unterminated-list functionality of cons lists, but they allow for appending lists of arbitrary size (see GeFaqDiffList).
Type documentation
TDL definitions allow documentation strings ("docstrings") before any term in the top-level conjunction or before the terminating dot (.) character:
n_-_c_le := n_intr_lex_entry """Intransitive count noun (icn) <ex>The dog barked. <nex>Much dog bark.""".
Before docstrings became well-supported, LTDB supported documentation in comments (normally preceding the documented type):
Case sensitivity
Case Sensitive
Things inside quotes (NB: strings passed from TFS world into MRS can be treated as case insensitive in MRS processing (i.e. as predicate symbols, but not CARGs)
Case Insensitive
- Everything in TDL not inside of quotes.
- Lexicon look-up.
- Proper names?
- Acronyms?
- .. approach these with token-mapping (preserve the info, and then downcase anyway)
Unknown
- Orthographic subrules (agree: case sensitive, ACE: [intended] case insensitive)
Notes: Arguments for case insensitive include shouting (call caps); Arguments for case sensitive include the use of upper case vowels in vowel harmony languages (linguistic representations, not orthography)
Notes for implementation
DocStrings
Multiple docstrings may be present on a single definition, but only the first one encountered on a definition is considered its primary docstring, and implementers are free to store or discard the other doc strings as they see fit. Docstrings on type-addenda should be concatenated with a newline to the previous docstring(s), or appended to a list of docstrings, associated with the type.
Comments
The syntax description above allows for comments anywhere that separating whitespace is allowed (not including those within strings, regular expressions, letter sets, etc.). This includes within a dotted attribute path (e.g., [ SYNSEM #| comment |# . #| comment |# LOCAL ... ]), although grammar developers may want to use this flexibility sparingly.
Open Questions
1. The ^ character is used to signal "expanded-syntax" in the LKB, but is this only used for regular expressions? Are there other expanded syntaxes? Do non-LKB processors support them? (see this thread on the 'developers' mailing list)
2. Are instances distinguishable from types? Are they (other other entities) restricted to having exactly one supertype?