Differences between revisions 12 and 13
Revision 12 as of 2018-08-08 03:06:23
Size: 6947
Comment: Updated BNF for selected docstring pattern; removed obsolete discussion
Revision 13 as of 2018-09-11 05:46:22
Size: 11366
Comment: Updated the syntax description, restructured the wiki
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:
Line 4: Line 5:
== Case Sensitivity ==

=== Case Sensitive ===
<<TableOfContents(3)>>

== TDL File Syntax ==

Productions are separated into thematic sections. ALL-CAPS rule names are for non-content terminals, which appear at the bottom of the description.

{{{#!highlight ruby
# File Contents

TdlTypeFile := ( TypeDef | TypeAddendum | Spacing )* EOF
TdlRuleFile := ( LexRuleDef | LetterSet | WildCard | Spacing )* EOF

# Types and Lexical Rules

TypeDef := TypeName DEFOP TypeDefBody DOT
TypeAddendum := TypeName ADDOP AddendumBody DOT
TypeName := Identifier Spacing

LexRuleDef := LexRuleId DEFOP Affix? TypeDefBody DOT
LexRuleId := Identifier Spacing

# Identifiers are used in several patterns
#
# Note: For some processors, like the LKB, there may be "break characters"
# defined which determine what is allowed within an identifier.

Identifier := /[^\s.:<=&,#[]$()>!^\/]+/

# Definition Bodies (top-level conjunctions of terms)
#
# Note: Definition bodies are most simply Conjunctions, but several
# variations require special productions:
#
# (1) """DocStrings""" may precede any top-level Term or the final DOT
# (2) TypeDef and LexRuleDef require at least one TypeName
# (3) TypeAddendum may use a DocString in place of a Conjunction
#

TypeDefBody := TypedConj DocString?
AddendumBody := DocConj DocString? | DocString

# Note: To accommodate TypeDefBody and AddendumBody, three special
# conjunctions are added:
#
# (1) TypedConj has an obligatory TypeName term
# (2) FeatureConj excludes type terms (including strings, etc.)
# (3) DocConj is a regular conjunction with optional DocStrings
#
# Note that FeatureConj is only necessary to reduce ambiguity (e.g.,
# for LALR parsing); if ambiguity is allowed, DocConj may be used.

TypedConj := ( FeatureConj AND )? DocString? TypeName ( AND DocConj )?
FeatureConj := DocString? FeatureTerm ( AND DocString? FeatureTerm )*
DocConj := DocString? Term ( AND DocString? Term )

# Note: The DocString pattern may span multiple lines

DocString := /"""([^"\\]|\\.|"(?!")|""(?!"))*"""/ Spacing

# Terms and Conjunctions

Conjunction := Term ( AND Term )*
Term := TypeTerm | FeatureTerm | Coreference
TypeTerm := TypeName
              | DQString
              | QSymbol
              | Regex
FeatureTerm := Avm
              | DiffList
              | ConsList

DQString := /"([^"\\]|\\.)*"/ Spacing
QSymbol := "'" Identifier Spacing
Regex := "^" /([^$\\]|\\.)*/ "$"

Avm := AVMOPEN AttrVals? AVMCLOSE
AttrVals := AttrVal ( COMMA AttrVal )*
AttrVal := AttrPath SPACE Conjunction
AttrPath := Attribute ( DOT Attribute )*
Attribute := Identifier Spacing

DiffList := DLOPEN Conjunctions? DLCLOSE
ConsList := CLOPEN ( Conjunctions ConsEnd? )? CLCLOSE
ConsEnd := COMMA ELLIPSIS | DOT Conjunction
Conjunctions := Conjunction ( COMMA Conjunction )*

Coreference := "#" Identifier Spacing

# Letter-sets, Wild-cards, and Affixes
#
# Note: spacing is sensitive within these patterns, so many non-content
# terminals are used directly with an explicit SPACE instead of in
# a production with Spacing.

LetterSet := "%(letter-set" SPACE? LetterSetDef SPACE? ")"
WildCard := "%(wild-card" SPACE? WildCardDef SPACE? ")"
LetterSetDef := "(" LetterSetVar SPACE Characters ")"
WildCardDef := "(" WildCardVar SPACE Characters ")"
LetterSetVar := /![^ ]/
WildCardVar := /\?[^ ]/
Characters := /([^)\\]|\\.)+/

# Note: When a LetterSetVar is used in an AffixMatch, the same LetterSetVar
# in the AffixSub copies the matched character, in order, so there
# should be the same number of LetterSetVars in both, but this is not
# captured in the syntax.

Affix := AffixClass AffixPattern+ Spacing
AffixClass := "%prefix" | "%suffix"
AffixPattern := SPACE? "(" AffixMatch SPACE AffixSub ")"
AffixMatch := NullChar | CharList
AffixSub := CharList
NullChar := "*"
CharList := ( LetterSetVar | WildCardVar | AffixChar )+
AffixChar := /([^!?\s*\\]|\\[^ ])+/

# Whitespace and Comments
#
# Note: SPACE and BlockComment may span multiple lines

Spacing := SPACE? Comment*
SPACE := /\s+/
Comment := ( LineComment | BlockComment ) SPACE?
LineComment := /;.*$/
BlockComment := "#|" /([^|\\]|\\.|\|(?!#))*/ "|#"

# Non-content Terminals

DEFOP := ":=" Spacing
ADDOP := ":+" Spacing
DOT := "." Spacing
AND := "&" Spacing
COMMA := "," Spacing
AVMOPEN := "[" Spacing
AVMCLOSE := "]" Spacing
DLOPEN := "<!" Spacing
DLCLOSE := "!>" Spacing
CLOPEN := "<" Spacing
CLCLOSE := ">" Spacing
ELLIPSIS := "..." Spacing
EOF := "" # end-of-file

}}}

== TDL File Interpretation and Conventions ==

=== Layout of a type definition ===


=== Types versus instances ===
=== Specifying the text encoding ===

The text encoding of TDL files can be specified using a special comment on the first line of the file, as is done with many scripting languages.
For instance, the following sets the encoding to UTF-8:

{{{#!highlight scheme
; -*- coding: utf-8 -*-
}}}

In some TDL files, attributes specific to the [[https://www.gnu.org/software/emacs/|Emacs]] text editor may be included:

{{{#!highlight scheme
;;; -*- Mode: tdl; Coding: utf-8; indent-tabs-mode: nil; -*-
}}}


=== Feature interpretation of lists ===

The `< ... >` and `<! ... !>` shorthand for lists ("cons lists") and diff-lists, respectively, correspond to normal attribute-value pairs.
Regular cons lists may be terminated (fixed-length) or unterminated (expandable).

{{{#!highlight scheme
; an empty list is terminated (always empty)
[ ATTR < > ] => [ ATTR *null* ]
; single item goes on FIRST attribute and REST is terminated
[ ATTR < a > ] => [ ATTR [ FIRST a,
                                      REST *null* ] ]
; items after the first go on (REST.)+FIRST
[ ATTR < a, b > ] => [ ATTR [ FIRST a,
                                      REST [ FIRST b,
                                             REST *null* ] ] ]
; an empty list with ... is not terminated
[ ATTR < ... > ] => [ ATTR *list* ]
; this also works with items on the list
[ ATTR < a, ... > ] => [ ATTR [ FIRST a,
                                      REST *list* ] ]
; the . delimiter allows a non-*list*, non-*null* value for the last REST
[ ATTR < a . #coref > ] => [ ATTR [ FIRST a,
                                      REST #coref ] ]
}}}

Diff lists are regular lists under a `LIST` attribute, and `LAST` points to the last item.
Diff lists don't support the unterminated-list functionality of cons lists, but they allow for appending lists of arbitrary size (see [[GeFaqDiffList]]).

{{{#!highlight scheme
[ ATTR <! !> ] => [ ATTR [ LIST #coref,
                                      LAST #coref ] ]
[ ATTR <! a !> ] => [ ATTR [ LIST [ FIRST a,
                                             REST #coref & *null* ],
                                      LAST #coref ] ]
}}}

=== Type documentation ===

TDL definitions allow documentation strings ("docstrings") before any term in the top-level conjunction or before the terminating dot (`.`) character:

{{{
n_-_c_le := n_intr_lex_entry
"""Intransitive count noun (icn)
<ex>The dog barked.
<nex>Much dog bark.""".
}}}

Before docstrings became well-supported, LTDB supported documentation in comments (normally preceding the documented type):

{{{#!highlight cl
; <type val="case-p-lex-np-to">
; <name-ja>承名詞目的格助詞ト
; <description>case-p-lex-np-woを参照。このtypeは助詞「と」。
; <ex>部長 と 会う
; <nex>ゆっくり と 進む
; <todo>
; </type>
case-p-lex-np-to := case-p-lex-np &
 [SYNSEM.LOCAL.CAT.HEAD.CASE to].
}}}

=== Case sensitivity ===

==== Case Sensitive ====
Line 10: Line 238:
=== Case Insensitive === ==== Case Insensitive ====
Line 19: Line 247:
=== Unknown === ==== Unknown ====
Line 25: Line 253:
== Doc Strings ==

TDL definitions allow documentation strings ("docstrings") before any term in the top-level conjunction or before the terminating dot (`.`) character:

{{{
n_-_c_le := n_intr_lex_entry
"""Intransitive count noun (icn)
<ex>The dog barked.
<nex>Much dog bark.""".
}}}

== TDL File Syntax ==

{{{#!highlight ruby
# File Contents

TdlTypeFile := ( TypeDef | TypeAddendum | Spacing )* EOF
TdlRuleFile := ( LexRuleDef | MorphSet | Spacing )* EOF

# Types and Lexical Rules

TypeDef := Type DefOp TypedDefBody Dot
Typeddendum := Type AddOp ( DefBody | DocString ) Dot
LexRuleDef := LexRuleId DefOp Affix? TypedDefBody Dot
LexRuleId := Identifier Spacing

# Definition Bodies (top-level conjunctions of terms)
#
# The body of a type definition, type addendum, or lexical rule is
# essentially a conjunction of Terms, but there are two special features
# of top-level conjunctions (i.e., those outside of an AVM):
#
# (1) """DocStrings""" may precede any Term or the final Dot (.)
#
# (2) TypeDef and LexRuleDef require at least one Type (supertype)
# somewhere in the conjunction (conventionally the first Term)

TypedDefBody := ( TopLevelConj And )? DocString? Type ( And TopLevelConj )? DocString?
DefBody := TopLevelConj DocString?
TopLevelConj := DocString? Term ( And DocString? Term )*
DocString := TQString

# Terms and Conjunctions

Conjunction := Term ( And Term )*
Type := Identifier Spacing
Term := ( Type
                | FeatureTerm
                | DiffList
                | ConsList
                | Coreference
                | DQString
                | QSymbol
                | Regex
                )
FeatureTerm := LBrack AttrVals? RBrack
AttrVals := AttrVal ( Comma AttrVal )*
AttrVal := Attribute ( Dot Attribute )* Conjunction
Attribute := Identifier Spacing
DiffList := DLOpen Conjunctions? DLClose
ConsList := CLOpen ( Conjunctions ConsEnd? )? CLClose
ConsEnd := Comma Ellipsis | Dot Conjunction
Conjunctions := Conjunction ( Comma Conjunction )*
Coreference := "#" Identifier Spacing

# Letter-sets, Wild-cards, and Affixes

MorphSet := "%" "(" ( LetterSetDef | WildCardDef ) ")"
LetterSetDef := "letter-set" Space? "(" LetterSetVar Space LetterSet ")"
WildCardDef := "wild-card" Space? "(" WildCardVar Space LetterSet ")"
LetterSetVar := /![^ ]/
WildCardVar := /\?[^ ]/
LetterSet := /([^)\\]|\\.)+/
Affix := AffixClass AffixPattern+ Spacing
AffixClass := "%prefix" | "%suffix"
AffixPattern := Space? "(" ( NullChar | CharList ) Space CharList ")"
CharList := ( LetterSetVar | WildCardVar | AffixChar )+
NullChar := "*"
AffixChar := /([^!?\s*\\]|\\[^ ])+/

# Whitespace and Comments

Spacing := Space? Comment*
Space := /\s+/
Comment := ( LineComment | BlockComment ) Space?
LineComment := /;.*$/
BlockComment := "#|" /([^|\\]|\\.|\|(?!#))*/ "|#"

# Literals

DefOp := ":=" Spacing
AddOp := ":+" Spacing
Identifier := /[^\s.:<=&,#[]$()>!^\/]+/
Dot := "." Spacing
And := "&" Spacing
Comma := "," Spacing
LBrack := "[" Spacing
RBrack := "]" Spacing
DLOpen := "<!" Spacing
DLClose := "!>" Spacing
CLOpen := "<" Spacing
CLClose := ">" Spacing
Ellipsis := "..." Spacing
DQString := /"([^"\\]|\\.)*"/ Spacing
TQString := /"""([^"\\]|\\.|"(?!")|""(?!"))*"""/ Spacing
QSymbol := "'" Identifier Spacing
Regex := "^" /([^$\\]|\\.)*/ "$"
}}}

=== Notes for implementation ===

==== DocStrings ====

== Notes for implementation ==

=== DocStrings ===
Line 140: Line 260:
==== Comments ==== === Comments ===
Line 144: Line 264:
== Questions == == Open Questions ==

Type Description Language and other aspects of DELPH-IN Joint Reference Formalism

TDL File Syntax

Productions are separated into thematic sections. ALL-CAPS rule names are for non-content terminals, which appear at the bottom of the description.

   1 # File Contents
   2 
   3 TdlTypeFile  := ( TypeDef | TypeAddendum | Spacing )* EOF
   4 TdlRuleFile  := ( LexRuleDef | LetterSet | WildCard | Spacing )* EOF
   5 
   6 # Types and Lexical Rules
   7 
   8 TypeDef      := TypeName DEFOP TypeDefBody DOT
   9 TypeAddendum := TypeName ADDOP AddendumBody DOT
  10 TypeName     := Identifier Spacing
  11 
  12 LexRuleDef   := LexRuleId DEFOP Affix? TypeDefBody DOT
  13 LexRuleId    := Identifier Spacing
  14 
  15 # Identifiers are used in several patterns
  16 #
  17 # Note: For some processors, like the LKB, there may be "break characters"
  18 #       defined which determine what is allowed within an identifier.
  19 
  20 Identifier   := /[^\s.:<=&,#[]$()>!^\/]+/
  21 
  22 # Definition Bodies (top-level conjunctions of terms)
  23 #
  24 # Note: Definition bodies are most simply Conjunctions, but several
  25 #       variations require special productions:
  26 #
  27 #       (1) """DocStrings""" may precede any top-level Term or the final DOT
  28 #       (2) TypeDef and LexRuleDef require at least one TypeName
  29 #       (3) TypeAddendum may use a DocString in place of a Conjunction
  30 #           
  31 
  32 TypeDefBody  := TypedConj DocString?
  33 AddendumBody := DocConj DocString? | DocString
  34 
  35 # Note: To accommodate TypeDefBody and AddendumBody, three special
  36 #       conjunctions are added:
  37 #
  38 #       (1) TypedConj has an obligatory TypeName term
  39 #       (2) FeatureConj excludes type terms (including strings, etc.)
  40 #       (3) DocConj is a regular conjunction with optional DocStrings
  41 #
  42 #       Note that FeatureConj is only necessary to reduce ambiguity (e.g.,
  43 #       for LALR parsing); if ambiguity is allowed, DocConj may be used.
  44 
  45 TypedConj    := ( FeatureConj AND )? DocString? TypeName ( AND DocConj )?
  46 FeatureConj  := DocString? FeatureTerm ( AND DocString? FeatureTerm )*
  47 DocConj      := DocString? Term ( AND DocString? Term )
  48 
  49 # Note: The DocString pattern may span multiple lines
  50 
  51 DocString    := /"""([^"\\]|\\.|"(?!")|""(?!"))*"""/ Spacing
  52 
  53 # Terms and Conjunctions
  54 
  55 Conjunction  := Term ( AND Term )*
  56 Term         := TypeTerm | FeatureTerm | Coreference
  57 TypeTerm     := TypeName
  58               | DQString
  59               | QSymbol
  60               | Regex
  61 FeatureTerm  := Avm
  62               | DiffList
  63               | ConsList
  64 
  65 DQString     := /"([^"\\]|\\.)*"/ Spacing
  66 QSymbol      := "'" Identifier Spacing
  67 Regex        := "^" /([^$\\]|\\.)*/ "$"
  68 
  69 Avm          := AVMOPEN AttrVals? AVMCLOSE
  70 AttrVals     := AttrVal ( COMMA AttrVal )*
  71 AttrVal      := AttrPath SPACE Conjunction
  72 AttrPath     := Attribute ( DOT Attribute )*
  73 Attribute    := Identifier Spacing
  74 
  75 DiffList     := DLOPEN Conjunctions? DLCLOSE
  76 ConsList     := CLOPEN ( Conjunctions ConsEnd? )? CLCLOSE
  77 ConsEnd      := COMMA ELLIPSIS | DOT Conjunction
  78 Conjunctions := Conjunction ( COMMA Conjunction )*
  79 
  80 Coreference  := "#" Identifier Spacing
  81 
  82 # Letter-sets, Wild-cards, and Affixes
  83 #
  84 # Note: spacing is sensitive within these patterns, so many non-content
  85 #       terminals are used directly with an explicit SPACE instead of in
  86 #       a production with Spacing.
  87 
  88 LetterSet    := "%(letter-set" SPACE? LetterSetDef SPACE? ")"
  89 WildCard     := "%(wild-card" SPACE? WildCardDef SPACE? ")"
  90 LetterSetDef := "(" LetterSetVar SPACE Characters ")"
  91 WildCardDef  := "(" WildCardVar SPACE Characters ")"
  92 LetterSetVar := /![^ ]/
  93 WildCardVar  := /\?[^ ]/
  94 Characters   := /([^)\\]|\\.)+/
  95 
  96 # Note: When a LetterSetVar is used in an AffixMatch, the same LetterSetVar
  97 #       in the AffixSub copies the matched character, in order, so there
  98 #       should be the same number of LetterSetVars in both, but this is not
  99 #       captured in the syntax.
 100 
 101 Affix        := AffixClass AffixPattern+ Spacing
 102 AffixClass   := "%prefix" | "%suffix"
 103 AffixPattern := SPACE? "(" AffixMatch SPACE AffixSub ")"
 104 AffixMatch   := NullChar | CharList
 105 AffixSub     := CharList
 106 NullChar     := "*"
 107 CharList     := ( LetterSetVar | WildCardVar | AffixChar )+
 108 AffixChar    := /([^!?\s*\\]|\\[^ ])+/
 109 
 110 # Whitespace and Comments
 111 #
 112 # Note: SPACE and BlockComment may span multiple lines
 113 
 114 Spacing      := SPACE? Comment*
 115 SPACE        := /\s+/
 116 Comment      := ( LineComment | BlockComment ) SPACE?
 117 LineComment  := /;.*$/
 118 BlockComment := "#|" /([^|\\]|\\.|\|(?!#))*/ "|#"
 119 
 120 # Non-content Terminals
 121 
 122 DEFOP        := ":=" Spacing
 123 ADDOP        := ":+" Spacing
 124 DOT          := "." Spacing
 125 AND          := "&" Spacing
 126 COMMA        := "," Spacing
 127 AVMOPEN      := "[" Spacing
 128 AVMCLOSE     := "]" Spacing
 129 DLOPEN       := "<!" Spacing
 130 DLCLOSE      := "!>" Spacing
 131 CLOPEN       := "<" Spacing
 132 CLCLOSE      := ">" Spacing
 133 ELLIPSIS     := "..." Spacing
 134 EOF          := ""  # end-of-file

TDL File Interpretation and Conventions

Layout of a type definition

Types versus instances

Specifying the text encoding

The text encoding of TDL files can be specified using a special comment on the first line of the file, as is done with many scripting languages. For instance, the following sets the encoding to UTF-8:

   1 ; -*- coding: utf-8 -*-

In some TDL files, attributes specific to the Emacs text editor may be included:

   1 ;;; -*- Mode: tdl; Coding: utf-8; indent-tabs-mode: nil; -*-

Feature interpretation of lists

The < ... > and <! ... !> shorthand for lists ("cons lists") and diff-lists, respectively, correspond to normal attribute-value pairs. Regular cons lists may be terminated (fixed-length) or unterminated (expandable).

   1 ; an empty list is terminated (always empty)
   2 [ ATTR < > ]             =>  [ ATTR *null* ]
   3 ; single item goes on FIRST attribute and REST is terminated
   4 [ ATTR < a > ]           =>  [ ATTR [ FIRST a,
   5                                       REST *null* ] ]
   6 ; items after the first go on (REST.)+FIRST
   7 [ ATTR < a, b > ]        =>  [ ATTR [ FIRST a,
   8                                       REST [ FIRST b,
   9                                              REST *null* ] ] ]
  10 ; an empty list with ... is not terminated
  11 [ ATTR < ... > ]         =>  [ ATTR *list* ]
  12 ; this also works with items on the list
  13 [ ATTR < a, ... > ]      =>  [ ATTR [ FIRST a,
  14                                       REST *list* ] ]
  15 ; the . delimiter allows a non-*list*, non-*null* value for the last REST
  16 [ ATTR < a . #coref > ]  =>  [ ATTR [ FIRST a,
  17                                       REST #coref ] ]

Diff lists are regular lists under a LIST attribute, and LAST points to the last item. Diff lists don't support the unterminated-list functionality of cons lists, but they allow for appending lists of arbitrary size (see GeFaqDiffList).

   1 [ ATTR <! !> ]           =>  [ ATTR [ LIST #coref,
   2                                       LAST #coref ] ]
   3 [ ATTR <! a !> ]         =>  [ ATTR [ LIST [ FIRST a,
   4                                              REST #coref & *null* ],
   5                                       LAST #coref ] ]

Type documentation

TDL definitions allow documentation strings ("docstrings") before any term in the top-level conjunction or before the terminating dot (.) character:

n_-_c_le := n_intr_lex_entry
"""Intransitive count noun (icn)
<ex>The dog barked.
<nex>Much dog bark.""".

Before docstrings became well-supported, LTDB supported documentation in comments (normally preceding the documented type):

   1 ; <type val="case-p-lex-np-to">
   2 ; <name-ja>承名詞目的格助詞ト
   3 ; <description>case-p-lex-np-woを参照。このtypeは助詞「と」。
   4 ; <ex>部長 と 会う
   5 ; <nex>ゆっくり と 進む
   6 ; <todo>
   7 ; </type>
   8 case-p-lex-np-to := case-p-lex-np &
   9  [SYNSEM.LOCAL.CAT.HEAD.CASE to].

Case sensitivity

Case Sensitive

  • Things inside quotes (NB: strings passed from TFS world into MRS can be treated as case insensitive in MRS processing (i.e. as predicate symbols, but not CARGs)

Case Insensitive

  • Everything in TDL not inside of quotes.
  • Lexicon look-up.
    • Proper names?
    • Acronyms?
  • .. approach these with token-mapping (preserve the info, and then downcase anyway)

Unknown

  • Orthographic subrules (agree: case sensitive, ACE: [intended] case insensitive)

Notes: Arguments for case insensitive include shouting (call caps); Arguments for case sensitive include the use of upper case vowels in vowel harmony languages (linguistic representations, not orthography)

Notes for implementation

DocStrings

Multiple docstrings may be present on a single definition, but only the first one encountered on a definition is considered its primary docstring, and implementers are free to store or discard the other doc strings as they see fit. Docstrings on type-addenda should be concatenated with a newline to the previous docstring(s), or appended to a list of docstrings, associated with the type.

Comments

The syntax description above allows for comments anywhere that separating whitespace is allowed (not including those within strings, regular expressions, letter sets, etc.). This includes within a dotted attribute path (e.g., [ SYNSEM #| comment |# . #| comment |# LOCAL ... ]), although grammar developers may want to use this flexibility sparingly.

Open Questions

1. The ^ character is used to signal "expanded-syntax" in the LKB, but is this only used for regular expressions? Are there other expanded syntaxes? Do non-LKB processors support them? (see this thread on the 'developers' mailing list)

2. Are instances distinguishable from types? Are they (other other entities) restricted to having exactly one supertype?

Discussions

TdlRfc (last edited 2020-06-05 06:38:36 by FrancisBond)

(The DELPH-IN infrastructure is hosted at the University of Oslo)