[osis-core] Pointer syntax

Thu, 9 May 2002 07:43:59 -0400

I got too involved trying to write this up..... I was just trying to 
write regexes for Patrick. This is pretty much what I think we said 
in Rome, except the first part (on works), which got messy and I've 
added some speculations....

Pointer syntax from Rome meeting
   (as well as I can remember)

We have 3 main parts to a pointer:

1) The 'work', which identifies a whole document such as a Bible 
version, a work of literature such as a play or history, a reference 
work such as a commentary or lexicon, etc.

2) The 'refname', which identifies a named portion of the work using 
some (more-or-less standardized) reference scheme. The references 
used are expected to appear in the work, identified as such on 
elements such as book/chapter/verse, or on generalized divs. For 
example, "Mat.1.1"

3) The 'grain', which identifies a more precise location than can be 
done via standardized references. Since grains are used to point to 
places that have no formal reference name, they use simplistic 
algorithms (such as counting characters) that software can apply 
without knowing much about the data.

A work includes data known as "inRefs" to identify named locations 
within itself.

A work identifies itself with its full work name in the header.

A work identifies available refnames within it via an attribute on 
units such as divs, chapters, verses, etc. If a particular location 
covers multiple refnames (such as often occurs in less-literal 
translations; say, a paragraph that covers John 3:16-18 but cannot be 
clearly divided into 3 parts), then that location must encode all the 
applicable refnames, not a range. Ranges may only be specified on 
references.

A work does not identify any grains explicitly; they are counted 
mechanically if needed when a reference is interpreted.

All works can contain 'outRefs' to anywhere: elsewhere in themselves, 
other works, or (though it would be kind of pointless) back to the 
same location in themselves.

An outRef logically consists of a work, and specifications for a pair 
of locations in that work, each consisting of an inRef value that can 
be found in that work, and a grain. The referenced location runs from 
the start of the location specified by the first pair, to the end of 
the location specified by the second pair.

Commonly the work would be defaulted, and commonly the grains would 
be left out, meaning the reference is to the entire named location, 
not to a point or span within it. Very commonly the entire second 
pair will be omitted, meaning it is the same as the first pair.

The refLocs
-----------

We need to identify a reference system; beyond that, because the way 
works are divided differ, we can't say much more than that a refLoc 
is a bunch of dot-separate tokens. For the Bible we further specify 
the scheme as

    Book.chapter.verse

Books would be NMTOKENs except that we want them to be able to start 
with digits, so they are [0-9a-zA-Z]+. I think we might as well make 
all the tokens within refLocs fit that. Note I haven't included any 
punctuation; we want '.' as field separator, and maybe '-' as range 
separator?

So in grammar:

    refLoc    ::=  (refsys ':')? token ('.' token)*
    refSys    ::=  token
    token     ::=  tokenchar+
    tokenchar ::= XML 'namechar' minus '.' and '-'

Or we could limit ourselves to Latin-1 for tokenchars for the moment; 
I'd rather not.

The regex (I'm using \w to mean word-characters, I don't remember 
which char it actually is in schema regexes, or how the escaping of 
groups goes...):

    {(\w+:)}?\w+{\.\w+}*

The Grains
----------

The grains proposed have been:

char:n which counts Unicode code points in normalized form (there's 
some appropriate citation to the Unicode spec for this -- basically 
it ensures that precomposed and postcomposed characters come out the 
same).

token:n which counts tokens separated by runs of XML-defined 
whitespace (non-terminal S). This is not so useful, especially in 
(mainly Eastern) languages that don't use whitespace for separaters)

string:s which finds the first match of the specified string s. This 
has the advantage of nicer re-attachment after editing. Giving *just* 
a string is not completely functional, though. For example there 
would be no way to point to the second 'then' in:

    then they went to the house, yes, then

in most cases you can just give a longer string to the right to 
disambiguate -- but in this case there is none to give. the usual 
solution is (like XPointer) to provide arguments for instance number, 
or for an offset and length to take from the start of the match.

Perhaps we could simplify by saying you always use offset, but can 
append as much string as you like thereafter, which serves solely as 
a redundant double-check, and a way to try to re-attach in case of 
breakage. I think that's a pretty nice compromise.

Thus:

    char:284(logos)

With only one grain type for now, we could pitch the grain-type 
prefix ('char:').

String matching should be case-sensitive. We can either punt on 
allowing close paren in the string, or define escaping, or define a 
wildcard characters (say '.') so if they're stuck they can at least 
get a longer match without the ')'. I think I like this -- just state 
that '.' is a wildcard.

For length of match, we could have another parameter or assume the 
length of the string. Except the string is optional, and anyway you 
might want to point to some part of the string, or the point before 
it, or...

So let's say:

    char:284+5(logos)

Or in grammar:

    grain  ::=  'char:' offset length? content?
    offset  ::=  integer
    length  ::=  '+' integer
    content ::= '(' [^)]+ ')'
    integer ::= [0-9]+

Regex:

    char:\d*{\+\d*}?{(\w+)}?

Components of 'work'
--------------------

Going to Rome we had:

    work   ::= Author? Title Version?

Author is mainly optional because of the Bible, for which no author 
is typically cited. Edition can specify a named translation such as 
NIV or TEV, and is optional so that an outRef may refer to the work 
in the abstract, allowing use of any available edition.

At Rome it was (I think correctly) pointed out that language, 
dialect, and edition should all be distinguished. For example, one 
may well want to refer to any English version, or to any Greek 
version; likewise one might want to refer specifically to the NIV 
edition of 2001 (major translations may be re-released every few 
years).

One subtle point is that it may be worth distinguishing language from 
dialect. For example, one might want the American English NIV, but 
failing that it would be better to fall back to the British NIV that 
to the American English RSV.

Given all this, the entire tuple for work would be:

Author Title Language Version Dialect Edition

For example:

    Bible English NIV British 1999

or

    Herodotus Histories English Loeb American 1960

The language and dialect seem better kept together for intuitiveness, 
even though a (perhaps?) likely sort algorithm will deal with them in 
another order.

The easy way to do this is just to call it a bunch of tokens; but we 
need to know which fields we have since most are optional. This gets 
messy without markup, so packing this all into an attribute or onto 
the end of a URL is a pain. For packing, we could distinguish each 
field by separate delimiters, use the same delimiter and require it 
even when the field is missing, or use a syntax with names. Basically:

    Herodotus.Joe:Histories|EN-US^NIV@1999

    Herodotus.Joe|Histories|EN-US|NIV|1999

    AU=Herodotus.Joe TI=Histories LA=EN-US VER=NIV ED=1999

    AU:Herodotus.Joe TI:Histories LA:EN-US VER:NIV ED:1999

Most of this will be inherited or entirely omitted most of the time, 
so the uglines or verbosity isn't quite as bad as it seems. 
Distinct-delimiter systems like the first 2 seem too confusing and 
hard to remember (I couldn't even come up with reasons to choose 
particular delimiters). The third is clear (I'm not particular about 
the particular pseudo-attribute names), but unless we quote the 
values it looks too much like attribute syntax without actually 
*being* attributes. So I favor the namespace-like approach of the 
last one.

Or, we could give up and make these all separate attributes. As I 
write this, it strikes me that that might not be such a bad idea; for 
inRefs they'd only go in one place in the header; so the whole set 
would only go on outRefs.

On outrefs (mainly just the reference element) we can just have the 
whole set too; not soooo terrible, since many can be omitted; and 
that much less parser for anyone to write. Then we define a simple 
mapping to munge them into a URI fragment identifier, and into a 
speakable form -- that can be a separate deliverable that comes 
slightly later.

Maybe we've been trying to solve 2 problems at once: specifying the 
whole structure of a canonical reference, and where to put it in the 
syntax. What if we introduced a workDcl in the header, where you must 
give all this stuff in full form, and give a local key to it? The 
further we go with OSIS, the more I think the TEI notion like this is 
critical, and we need to just take it a lot farther than TEI did.

Ahhhh --- this mapping, from short standard keys to works, is what 
our name-declaration files do. We'd been thinking of them as only 
doing author/title, but they could do more. Oh yeah, and that reminds 
me I left out a field: volume and issue number for journals (year 
isn't enough).

I think split 'em up and make 'em declare 'em in the header. So all 
you validate for work in the references themselves is NMTOKEN.

Putting them all together
---------------------

At the top level we need a separator for the 3 parts, and a way to 
combine to make a range.

Trying some random delimiters, we get:

    outRef    ::=   ( work '/' )?  inWorkLoc
    inWorkLoc ::=   loc ( '--' loc)?
    loc       ::=   refLoc ('@' grain)?

So a full-blown one would be (punting the internals of work for the moment):

Bible.EN.NIV.US.1999/KJV:Matt.1.1@char:5(word)--KJV:Matt.1.3:24(faith)

This really seems to be screaming to be in markup.....

A more typical one in reality would be:

    NIV/Matt.1.1

or even just

    Matt.1.1

-- 

Steve DeRose -- http://www.stg.brown.edu/~sjd
Chair, Bible Technologies Group -- http://www.bibletechnologies.net
Email: sderose@speakeasy.net
Backup email: sderose@mac.com, sjd@stg.brown.edu