Page Status

This page presents user-supplied information, hence may be inaccurate in some details, or not necessarily reflect use patterns anticipated by the [incr tsdb()] developers. This page was initiated by FrancisBond; please feel free to make additions or corrections as you see fit. However, before revising this page, one should be reasonably confident of the information given being correct.

Further, observe that [incr tsdb()] database internals can change over time. For a given database (i.e. profile), the database schema is defined by the file relations, which is part of the profile directory. To cope with differences in database versions over time, [incr tsdb()] will always support reading from older formats, while writing is typically limited to current versions.

Reference

This page includes some low level information about [incr tsdb()] (ItsdbTop). You may also be interested in ItsdbCustomization.

Formatting Conventions

For all tables in [[incr tsdb()]], the field (i.e. column) separator is '@', and a newline is a record (i.e. row) separator. Where these characters must be used within a field, they must be escaped:

escape characters

@

\s

newline

\n

\

\\

Database Files

Some fields of the item file are:

i-difficulty

Difficulty

1

6:

i-category

Category

S,XP

7:

i-input

String

parse me

8:

i-wf

Well Formedness

0,1,2

9:

i-length

String length (words)

integer

10:

i-comment

Comment

11:

i-author

Author

uname

12:

i-date

Date created

5-8-2003

An actual entry might look like this:

1@csli@formal@none@1@S@Abrams works .@1@2@@@jul-98

Note that [[incr tsdb()]] does not always check that the i-ids are unique, but they should always be kept unique. Also, it is a good idea to keep the items sorted.

In the Hinoki project, the i-comment is used to give the source of the utterance (definition sentence, example, other corpus), the ID in the source corpus, and, for definition and examples sentences, some information about the headword being defined or exemplified.

Output File Format

It is possible to store information about desired outputs, for example translations. They are stored in a skeleton's output file.

A minimal example of (Japanese) translations of the sentence shown in the item file format is:

1@@@@-1@-1@@エーブラムズ が 働く 。@@@-1@@
1@@@@-1@-1@@エーブラムズ が 仕事 する 。@@@-1@@

Field

Name

Explanation

Example Value

1:

i-id

Item for this output specification

integer

8:

o-surface

Expected surface string

string

All the fields are described in the relations file found in each skeleton.

It is possible to have multiple correct outputs (e.g., multiple reference translations).

Text Input Formats

This describes obsolete versions of the format --- revision coming as soon as possible

Plain Text Input Format

This file format allows you to record more information about the text in an easy to manipulate format:

[1010] +The Cathedral and the Bazaar
[1030] Linux is subversive.

The file format is:

        [1-2 |] Preikestolen

This can be used to make a profile, which contains an item file, by [incr tsdb()], as described below.

Bi-text Import Format

The Cathedral and the Bazaar
伽藍 と バザール
伽藍 と 勧工場

Linux is subversive.
リナックス は 、 既存 の 概念 を 打ち 砕く もの で ある 。

This can be used to create a profile (containing an item and output file) as described below.

Well Formedness (i-wf)

Value

Meaning

0

Illformed (Ungrammatical)

1

Wellformed (Grammatical)

2

Ignored


The grammticality judgements can be used to measure lack of coverage and overgeneration, respectively:

How to make a new Skeleton

  ((:path . "newtest") (:content . "example test suite"))

How to make a Profile from a text input file

   file > import > Test items
   in-path/newtest
   out-dir/newtest

This makes a profile (out-dir/newtest) with an item file (with default results for the fields, and numbering starting from 0 or 1) and a relations file.

   file > import > Bi-text Items
   in-path/newtest
   out-dir/newtest

If there are translations, then it also makes an output file. This is useful for automatically scoring machine translation.

ItsdbReference (last edited 2013-03-20 23:56:48 by MichaelGoodman)

(The DELPH-IN infrastructure is hosted at the University of Oslo)