Work in Progress!

Abstract

In this document, we provide an outline for an overhaul of the choices system for the Grammar Matrix. The primary changes are moving to a standard serialization format, the introduction of a choices schema, and the decoupling of choices and questionnaire display (matrixdef). We propose using TOML as an open source, non-proprietary, generally available data storage format. We propose providing a schema for choices files and an associated API. These changes will make development easier and help enable connecting the Grammar Matrix to other projects, both as a component and a target.

Introduction

The Grammar Matrix is composed of several components, including but not limited to the following:

  1. matrix.tdl

  2. the customization system
  3. the questionnaire
  4. the regression tests

A choices file is a serialized data file containing both 1) the answers to questions in the questionnaire (as well as annotated data such as lexical entries) and 2) information mapping choices to the matrixdef file that participated in generating it. (2) includes, but is probably not limited to:

Currently, choices are written in a proprietary storage format (DSL).

The code implementing choices is defined in three locations:

Motivations

The goals for changing this setup include the following:

  1. Ease of human reading: the current choices file format is generally human readable, and this is an important consideration of any new changes. However, indexing and non-obvious nesting can make choices difficult for newcomers to understand.

  2. Ease of human editing: along with reading, being able to identify and edit a choice without using the customization system is a critical requirement, for use with matrix.py, debugging, and development.

  3. Ease of creation: it is currently difficult to write a choices file from scratch. One must parse the current matrixdef file to understand available sections, choices, and options. Ideally, one would be able to write choices with very few dependencies.

  4. Computer serialization and deserialization: choices are currently stored in a proprietary format, so reading and writing them requires implementing a significant amount of code.

  5. Make development easier: currently, changing anything to do with choices probably requires changes in several locations and is brittle.

  6. Choices as a component: a modularized choices format could be used as a data format in other projects, targeting the Grammar Matrix or not.

  7. Matrix as a target: modularizing the choices format could help others develop systems that use the Grammar Matrix (i.e. matrix.py) without the Customization System (e.g. AGGREGATION).

To achieve these goals, we propose the following changes:

  1. Changing the choices serialization format
  2. Introducing a choices schema

Background

The existing choices data serialization consists of the following data types:

  1. Dictionaries (maps)
  2. Lists (arrays/vectors)
  3. Strings / References
  4. Integers

Below is an example snippet of the existing choices format as of July 2020 from the mini English choices file:

section=morphology
  noun-pc1_name=num
  noun-pc1_obligatory=on
  noun-pc1_order=suffix
  noun-pc1_inputs=noun1
    noun-pc1_lrt1_name=singular
      noun-pc1_lrt1_feat1_name=number
      noun-pc1_lrt1_feat1_value=sg
      noun-pc1_lrt1_lri1_inflecting=no
    noun-pc1_lrt2_name=plural
      noun-pc1_lrt2_feat1_name=number
      noun-pc1_lrt2_feat1_value=pl
      noun-pc1_lrt2_lri1_inflecting=yes
      noun-pc1_lrt2_lri1_orth=s
  verb-pc1_name=pernum
  verb-pc1_obligatory=on
  verb-pc1_order=suffix
  verb-pc1_inputs=verb
    verb-pc1_lrt1_name=3sg
      verb-pc1_lrt1_feat1_name=person
      verb-pc1_lrt1_feat1_value=3rd
      verb-pc1_lrt1_feat1_head=subj
      verb-pc1_lrt1_feat2_name=number
      verb-pc1_lrt1_feat2_value=sg
      verb-pc1_lrt1_feat2_head=subj
      verb-pc1_lrt1_lri1_inflecting=yes
      verb-pc1_lrt1_lri1_orth=s
    verb-pc1_lrt2_name=pl
      verb-pc1_lrt2_feat1_name=number
      verb-pc1_lrt2_feat1_value=pl
      verb-pc1_lrt2_feat1_head=subj
      verb-pc1_lrt2_lri1_inflecting=no
    verb-pc1_lrt3_name=non-3rd
      verb-pc1_lrt3_feat1_name=person
      verb-pc1_lrt3_feat1_value=1st, 2nd
      verb-pc1_lrt3_feat1_head=subj
      verb-pc1_lrt3_lri1_inflecting=no

These choices are aligned with the matrixdef section named morphology, and the following matrixdef snippet as of July 2020 (trimmed for brevity, only including items relevant to choices snippet above):

Section morphology "Morphology"

BeginIter noun-pc{i} "a Position Class" 1 0

  Text name "Noun Position Class {i} Name" "<b>Noun Position Class {i}</b>:<br/>Position Class Name: " " " 20

  Check obligatory "Noun Position Class {i} Obligatory" "<br/>Obligatorily occurs:" ""

Select order "Noun Position Class {i} Order" "<br/>Appears as a prefix or suffix:" ""
. prefix "Prefix" "prefix"
. suffix "Suffix" "suffix"

  MultiSelect inputs "Noun Position Class {i} Input" "<br/>Possible inputs:" ""
  fillcache c=nouns
  fillregex p=(verb|noun)-pc[0-9]+(_lrt[0-9]+)?_name
  . noun "Any noun" "any noun"

  BeginIter lrt{j} "a Lexical Rule Type" 1 0

    Text name "Noun Position Class {i} Lexical Rule Type {j} Name" "<b>Lexical Rule Type {j}</b>:<br/>Name: " "" 20
    BeginIter feat{k} "a Feature"

      Select name "Noun Position Class {i} Lexical Rule Type {j} Feature {k} Name" "Name: " " "
      fillnames c=both

      MultiSelect value "Noun Position Class {i} Lexical Rule Type {j} Feature {k} Value" "Value: " ""
      fillvalues p=noun-pc{i}_lrt{j}_feat{k}_name

    EndIter feat

    BeginIter lri{k} "a Lexical Rule Instance" 0 1

      Radio inflecting "Noun Position Class {i} Lexical Rule Type {j} Lexical Rule Instance {k} Has Orthography" "Instance {k}" ""
      . no "No" "" "No affix"
      . yes "Yes" "" "Affix spelled"

      Text orth "Noun Position Class {i} Lexical Rule Type {j} Lexical Rule Instance {k} Spelling" "" "" 20

    EndIter lri

  EndIter lrt

EndIter noun-pc

BeginIter verb-pc{i} "a Position Class" 1

  Text name "Verb Position Class {i} Name" "<b>Verb Position Class {i}</b>:<br/>Position Class Name: " " " 20

  Check obligatory "Verb Position Class {i} Obligatory" "<br/>Obligatorily occurs:" ""

  Select order "Verb Position Class {i} Order" "<br/>Appears as a prefix or suffix:" ""
  . prefix "Prefix" "prefix"
  . suffix "Suffix" "suffix"

  BeginIter lrt{j} "a Lexical Rule Type" 1 1

    Text name "Verb Position Class {i} Lexical Rule Type {j} Name" "<b>Lexical Rule Type {j}</b>:<br/>Name: " "" 20

    BeginIter feat{k} "a Feature"

      Select name "Verb Position Class {i} Lexical Rule Type {j} Feature {k} Name" "Name: " " "
      fillnames c=both

      MultiSelect value "Verb Position Class {i} Lexical Rule Type {j} Feature {k} Value" "Value: " ""
      fillvalues p=verb-pc{i}_lrt{j}_feat{k}_name

      Select head "Verb Position Class {i} Lexical Rule Type {j} Feature {k} Head" "Specified on: " ""
      . verb "The verb" "the verb"
      . subj "The subject" "the subject NP"
      . obj "The object" "the object NP"

    EndIter feat

    BeginIter lri{k} "a Lexical Rule Instance" 0 1

      Radio inflecting "Verb Position Class {i} Lexical Rule Type {j} Lexical Rule Instance {k} Has Orthography" "Instance {k}" ""
      . no "No" "" "No affix"
      . yes "Yes" "" "Affix spelled"

      Text orth "Verb Position Class {i} Lexical Rule Type {j} Lexical Rule Instance {k} Spelling" "" "" 20

    EndIter lri

  EndIter lrt

EndIter verb-pc

Proposals

Serialization Format

Using a standard serialization format allows the Grammar Matrix code, other code, and humans to more easily read and write choices files. By relying on an existing third party implementation, this would also reduce the amount of code in the Grammar Matrix repo, making development easier.

Serialization Language

Several candidates were considered for the serialization format: JSON, YAML, StrictYAML, and TOML.

We are recommending TOML in large part due to the lack of problems compared to YAML and its readability over JSON.

Serialization Specifics

On top of the language, there are decisions about how choices are organized into the choices file:

  1. Lists of Choices
  2. Nested Choices

Lists of Choices

As of writing, choices are stored as lists using indexing, e.g.:

section=lexicon
  noun1_det=imp
    noun1_stem1_orth=n1
    noun1_stem1_pred=_n1_n_rel
  noun2_det=imp
    noun2_stem1_orth=n2
    noun2_stem1_pred=_n2_n_rel

noun1 is the first noun object as defined by this matrixdef's BeginIter noun, which also contains the nested definition BeginIter stem leading to the object noun1_stem1, which has two choices: orth and pred.

In TOML, these could be represented in a couple different ways.

The first way would be without explicit indexing, using the existing mixed map and list nesting:

[lexicon]
  [[lexicon.noun]]
  det = "imp"

    [[lexicon.noun.stem]]
    orth = "n1"
    pred = "_n1_n_rel"

  [[lexicon.noun]]
  det = "imp"

    [[lexicon.noun.stem]]
    orth = "n2"
    pred = "_n2_n_rel"

Another proposal is to keep the indexing and store everything in maps/dictionaries:

[lexicon]
  [lexicon.noun1]
  det = "imp"

    [lexicon.noun1.stem]
    orth = "n1"
    pred = "_n1_n_rel"

  [lexicon.noun2]
  det = "imp"

    [lexicon.noun2.stem]
    orth = "n1"
    pred = "_n2_n_rel"

Some considered pros and cons of each approach:

As of July 2020, lists of references are stored as comma delimited strings. In TOML, they would be stored as lists.

# July 2020
noun-pc1_inputs=noun1,noun2

# TOML
[[morphology.noun-pc]]
inputs = ["noun1", "noun2"]

Nested choices

As of writing, choices are nested using chained keys with _ delimiters, e.g. noun1_stem1_orth is the value for orth of the first stem of the first noun.

In TOML, these nested choices would be explicitly nested as nested dictionaries/maps, which uses a similar notation:

# July 2020
section=lexicon
  noun1_det=imp
    noun1_stem1_orth=n1
    noun1_stem1_pred=_n1_n_rel

# TOML
[lexicon.noun1]
det = "imp"

  [lexicon.noun1.stem]
  orth = "n1"
  pred = "_n1_n_rel"

Serialized Example

The following is an example TOML serialization of the example choices from the background section above, following the Maps proposal above.

[morphology.noun-pc1]
inputs = ["noun1"]
name = "num"
obligatory = "on"
order = "suffix"

  [morphology.noun-pc1.lrt1]
  name = "singular"
    [morphology.noun-pc1.lrt1.feat1]
    name = "number"
    value = ["sg"]

    [morphology.noun-pc1.lrt1.lri1]
    inflecting = "no"

  [morphology.noun-pc1.lrt2]
  name = "plural"
    [morphology.noun-pc1.lrt2.feat1]
    name = "number"
    value = ["pl"]

    [morphology.noun-pc1.lrt2.lri1]
    inflecting = "yes"
    orth = "s"

[morphology.verb-pc1]
inputs = ["verb"]
name = "pernum"
obligatory = "on"
order = "suffix"

  [morphology.verb-pc1.lrt1]
  name = "3sg"
    [morphology.verb-pc1.lrt1.feat1]
    head = "subj"
    name = "person"
    value = ["3rd"]

    [morphology.verb-pc1.lrt1.feat2]
    head = "subj"
    name = "number"
    value = ["sg"]

    [morphology.verb-pc1.lrt1.lri1]
    inflecting = "yes"
    orth = "s"

  [morphology.verb-pc1.lrt2]
  name = "pl"
    [morphology.verb-pc1.lrt2.feat1]
    head = "subj"
    name = "number"
    value = ["pl"]

    [morphology.verb-pc1.lrt2.lri1]
    inflecting = "no"

  [morphology.verb-pc1.lrt3]
  name = "non-3rd"
    [morphology.verb-pc1.lrt3.feat1]
    head = "subj"
    name = "person"
    value = ["1st", "2nd"]

    [morphology.verb-pc1.lrt3.lri1]
    inflecting = "no"

Choices Schema

Currently, a schema for choices is defined in matrixdef. Separating this schema away from the questionnaire information of matrixdef allows the Grammar Matrix code, other code, and humans to more easily read and write choices files. It also better enables other systems to target the Grammar Matrix.

This proposal has two parts:

  1. Decoupling choices and questionnaire content
  2. An API for validating schemas, choices files, and matrixdef files.

Currently, reading and writing choices files has these dependencies:

matrix.cgi (used choices, serialization code)
├── deffile.py (deserialization code)
│   └── matrixdef (section definitions, choice definitions)
└── choices.py (data structures)

A choices schema would store all of the required information about a given choices version in one place such that the choices code dependencies would be much simpler:

choices.py (data structures, schema (section definitions, choice definitions))
├── tomlkit 3rd party dependency (serialization code, deserialization code)
└── schema 3rd party dependency (schema parsing, validation)

Decoupling Choices and Questionnaire Content

Introducing a schema definition means that the matrixdef file is no longer the source of truth for the choice definitions, but this is okay as much of the content of the matrixdef file is actually view definition that is interpreted into HTML and JavaScript for producing the questionnaire. New choices could be added to the schema and then automatically imported into the questionnaire. Doing this allows other programs to read the choices schema without a dependency on matrixdef or its custom syntax, making the goal of targeting the Matrix much easier. This also eases later modularization of the choices component of the Grammar Matrix project, leaving the Definition File logic in place.

As an example of the current issue with using choices in other projects, the AGGREGATION code has a divergent copies of ChoicesDict, deffile.py, and matrixdef in order to read and write choices files. Moving to a decoupled choices format will allow for a single shared dependency.

Schema Definition

The schema definition is similar to the existing matrixdef file, but would not not store view information, but only name and type information. To write and validate the schema, we propose using the schema project. The schema is written in Python, such as the following example schema for the morphology section of matrixdef above (which is not production ready):

from schema import Schema, And, Or, Regex

withIndex = lambda expected: Regex(rf"^{expected}\d+$")
Value = And(str, len)
yes_no = Or('yes', 'no')
on_off = Or('on', 'off')
orders = Or('prefix', 'suffix')

lri = {
  'inflecting': yes_no,
  Optional('orth'): Value
}

schema = Schema({'morphology':
  {withIndex('noun-pc'):
    {'name': Value,
     'obligatory': on_off,
     'order': orders,
     'inputs': [Value], # References to nominal lexical types or lexical rules
      withIndex('lrt'): {'name': Value,
                         withIndex('feat'): {'name': Value,
                                             'value': [Value]},
                         withIndex('lri'): lri}},
   withIndex('verb-pc'): 
    {'name': Value,
     'obligatory': on_off,
     'order': orders,
     'inputs': [Value],
     withIndex('lrt'): {'name': Value,
                        withIndex('feat'): {'name': Value,
                                            'value': [Value],
                                            'head': Or("verb", "subj", "obj"),},
                        withIndex('lri'): lri}}
  }})

Some notes:

The structure generally map to types/functionality in matrixdef:

Top level keys => Section
withIndex keys => (Begin|End)Iter
Value => Text|TextArea|fillcache
[Value] => MultiSelect
bool => Check
Or("a", "b") => Select|Radio

An Choices Schema API

Given a choices schema, one would want to do several things with it. This is a preliminary list:

Meta

Date: July 2020 Contributers: T.J. Trimble, Chris Curtis, Michael Goodman, Olga Zamaraeva

MatrixChoicesRfc (last edited 2020-07-09 15:13:50 by TJTrimble)

(The DELPH-IN infrastructure is hosted at the University of Oslo)