[osis-editors] OSIS reference grains

Wed Mar 23 06:42:11 MST 2005

I am resubmitting the larger part of a message which I sent to you on
13th February 2004, and to which I received no reply. These were
comments on the OSIS 2.0.1 draft. The OSIS 2.1 draft has now been
brought to my attention. I see that in this draft there has been some
clarification of the concept of grains, but that my concerns about
unusability for words repeated within a verse have not been addressed.

I am copying this to Kirk Lowery, because I see that he has been
involved in these aspects of OSIS, and I would expect him to have a
similar need for word level pointers into the text.

I am in fact no longer working directly on this for the KTBH project,
but retain an advisory role there. The latest current version of the
Unicode document on text boundaries is at
http://www.unicode.org/reports/tr29/, updated from the URL given below.

=========

I am working on the KTBH project (see http://www.ktbh-team.org/), in
which we are basically developing a database of lexical information on
Hebrew words including references to their occurrences in the Hebrew
Bible. These stored references include word level and morpheme level
pointers into the Hebrew text. We would like to convert this data into
XML in a way which is compatible with OSIS, e.g. to use OSIS reference
formats. For this reason I have been looking at the OSIS reference
format including its concept of grains, which looks hopeful for
supporting references at the word level, if not at the morpheme level.

The idea of grains is promising, but still does not do quite what is
needed. The s type grain is useless because it will always match the
first occurrence of a string in a verse. (The format could however be
extended to allow matching of the nth occurrence of the string in a
verse, which would make it potentially useful.) The cp type grain is
more promising, but suffers from being ill-defined (in a Unicode
context), because whether an accented letter e.g. e acute counts as one
or two characters depends on the normalisation form (and the situation
becomes much more complicated with some complex scripts); such counting
should instead be in terms of the well-defined Unicode concept of a
grapheme cluster (see http://www.unicode.org/reports/tr29/tr29-5.html).
Also it is not clearly specified whether markup, additional white space,
new lines etc are to be counted here.

I would like to suggest an alternative type of grain, an additional
grain operator, which counts words, defined as in
http://www.unicode.org/reports/tr29/tr29-5.html, basically as sequences
of alphanumeric characters and/or certain other characters. (But some
tailoring of these rules may be necessary.) This is a widely useful
concept which is much easier to handle than either character counting or
string matching. Admittedly it is not useful in certain Asian scripts
which do not use spaces between words, but that is not a good reason to
disallow it for the majority of languages which do use spaces.

There is a danger with all of these uses of grains that, although they
should be used with a specified prefix indicating a specific Bible text,
there may be small differences between different editions of the same
text (revisions, American vs. Anglicised, different encoding decisions
for complex scripts) which will throw out the matching pocess. The s
grain operator is most sensitive to this in that a match may fail
completely with a variant text. The cp grain operator is less sensitive;
in most cases it will match only a few characters from the correct
place. My proposed word grain operator would be much less sensitive to
this, being thrown out only by revisions which change the number of
words in a verse.

-- 
Peter Kirk
peter at qaya.org (personal)
peterkirk at qaya.org (work)
http://www.qaya.org/

-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.308 / Virus Database: 266.8.0 - Release Date: 21/03/2005