[osis-core] Pointer syntax
Steven DeRose
osis-core@bibletechnologieswg.org
Thu, 9 May 2002 07:43:59 -0400
I got too involved trying to write this up..... I was just trying to
write regexes for Patrick. This is pretty much what I think we said
in Rome, except the first part (on works), which got messy and I've
added some speculations....
Pointer syntax from Rome meeting
(as well as I can remember)
We have 3 main parts to a pointer:
1) The 'work', which identifies a whole document such as a Bible
version, a work of literature such as a play or history, a reference
work such as a commentary or lexicon, etc.
2) The 'refname', which identifies a named portion of the work using
some (more-or-less standardized) reference scheme. The references
used are expected to appear in the work, identified as such on
elements such as book/chapter/verse, or on generalized divs. For
example, "Mat.1.1"
3) The 'grain', which identifies a more precise location than can be
done via standardized references. Since grains are used to point to
places that have no formal reference name, they use simplistic
algorithms (such as counting characters) that software can apply
without knowing much about the data.
A work includes data known as "inRefs" to identify named locations
within itself.
A work identifies itself with its full work name in the header.
A work identifies available refnames within it via an attribute on
units such as divs, chapters, verses, etc. If a particular location
covers multiple refnames (such as often occurs in less-literal
translations; say, a paragraph that covers John 3:16-18 but cannot be
clearly divided into 3 parts), then that location must encode all the
applicable refnames, not a range. Ranges may only be specified on
references.
A work does not identify any grains explicitly; they are counted
mechanically if needed when a reference is interpreted.
All works can contain 'outRefs' to anywhere: elsewhere in themselves,
other works, or (though it would be kind of pointless) back to the
same location in themselves.
An outRef logically consists of a work, and specifications for a pair
of locations in that work, each consisting of an inRef value that can
be found in that work, and a grain. The referenced location runs from
the start of the location specified by the first pair, to the end of
the location specified by the second pair.
Commonly the work would be defaulted, and commonly the grains would
be left out, meaning the reference is to the entire named location,
not to a point or span within it. Very commonly the entire second
pair will be omitted, meaning it is the same as the first pair.
The refLocs
-----------
We need to identify a reference system; beyond that, because the way
works are divided differ, we can't say much more than that a refLoc
is a bunch of dot-separate tokens. For the Bible we further specify
the scheme as
Book.chapter.verse
Books would be NMTOKENs except that we want them to be able to start
with digits, so they are [0-9a-zA-Z]+. I think we might as well make
all the tokens within refLocs fit that. Note I haven't included any
punctuation; we want '.' as field separator, and maybe '-' as range
separator?
So in grammar:
refLoc ::= (refsys ':')? token ('.' token)*
refSys ::= token
token ::= tokenchar+
tokenchar ::= XML 'namechar' minus '.' and '-'
Or we could limit ourselves to Latin-1 for tokenchars for the moment;
I'd rather not.
The regex (I'm using \w to mean word-characters, I don't remember
which char it actually is in schema regexes, or how the escaping of
groups goes...):
{(\w+:)}?\w+{\.\w+}*
The Grains
----------
The grains proposed have been:
char:n which counts Unicode code points in normalized form (there's
some appropriate citation to the Unicode spec for this -- basically
it ensures that precomposed and postcomposed characters come out the
same).
token:n which counts tokens separated by runs of XML-defined
whitespace (non-terminal S). This is not so useful, especially in
(mainly Eastern) languages that don't use whitespace for separaters)
string:s which finds the first match of the specified string s. This
has the advantage of nicer re-attachment after editing. Giving *just*
a string is not completely functional, though. For example there
would be no way to point to the second 'then' in:
then they went to the house, yes, then
in most cases you can just give a longer string to the right to
disambiguate -- but in this case there is none to give. the usual
solution is (like XPointer) to provide arguments for instance number,
or for an offset and length to take from the start of the match.
Perhaps we could simplify by saying you always use offset, but can
append as much string as you like thereafter, which serves solely as
a redundant double-check, and a way to try to re-attach in case of
breakage. I think that's a pretty nice compromise.
Thus:
char:284(logos)
With only one grain type for now, we could pitch the grain-type
prefix ('char:').
String matching should be case-sensitive. We can either punt on
allowing close paren in the string, or define escaping, or define a
wildcard characters (say '.') so if they're stuck they can at least
get a longer match without the ')'. I think I like this -- just state
that '.' is a wildcard.
For length of match, we could have another parameter or assume the
length of the string. Except the string is optional, and anyway you
might want to point to some part of the string, or the point before
it, or...
So let's say:
char:284+5(logos)
Or in grammar:
grain ::= 'char:' offset length? content?
offset ::= integer
length ::= '+' integer
content ::= '(' [^)]+ ')'
integer ::= [0-9]+
Regex:
char:\d*{\+\d*}?{(\w+)}?
Components of 'work'
--------------------
Going to Rome we had:
work ::= Author? Title Version?
Author is mainly optional because of the Bible, for which no author
is typically cited. Edition can specify a named translation such as
NIV or TEV, and is optional so that an outRef may refer to the work
in the abstract, allowing use of any available edition.
At Rome it was (I think correctly) pointed out that language,
dialect, and edition should all be distinguished. For example, one
may well want to refer to any English version, or to any Greek
version; likewise one might want to refer specifically to the NIV
edition of 2001 (major translations may be re-released every few
years).
One subtle point is that it may be worth distinguishing language from
dialect. For example, one might want the American English NIV, but
failing that it would be better to fall back to the British NIV that
to the American English RSV.
Given all this, the entire tuple for work would be:
Author Title Language Version Dialect Edition
For example:
Bible English NIV British 1999
or
Herodotus Histories English Loeb American 1960
The language and dialect seem better kept together for intuitiveness,
even though a (perhaps?) likely sort algorithm will deal with them in
another order.
The easy way to do this is just to call it a bunch of tokens; but we
need to know which fields we have since most are optional. This gets
messy without markup, so packing this all into an attribute or onto
the end of a URL is a pain. For packing, we could distinguish each
field by separate delimiters, use the same delimiter and require it
even when the field is missing, or use a syntax with names. Basically:
Herodotus.Joe:Histories|EN-US^NIV@1999
Herodotus.Joe|Histories|EN-US|NIV|1999
AU=Herodotus.Joe TI=Histories LA=EN-US VER=NIV ED=1999
AU:Herodotus.Joe TI:Histories LA:EN-US VER:NIV ED:1999
Most of this will be inherited or entirely omitted most of the time,
so the uglines or verbosity isn't quite as bad as it seems.
Distinct-delimiter systems like the first 2 seem too confusing and
hard to remember (I couldn't even come up with reasons to choose
particular delimiters). The third is clear (I'm not particular about
the particular pseudo-attribute names), but unless we quote the
values it looks too much like attribute syntax without actually
*being* attributes. So I favor the namespace-like approach of the
last one.
Or, we could give up and make these all separate attributes. As I
write this, it strikes me that that might not be such a bad idea; for
inRefs they'd only go in one place in the header; so the whole set
would only go on outRefs.
On outrefs (mainly just the reference element) we can just have the
whole set too; not soooo terrible, since many can be omitted; and
that much less parser for anyone to write. Then we define a simple
mapping to munge them into a URI fragment identifier, and into a
speakable form -- that can be a separate deliverable that comes
slightly later.
Maybe we've been trying to solve 2 problems at once: specifying the
whole structure of a canonical reference, and where to put it in the
syntax. What if we introduced a workDcl in the header, where you must
give all this stuff in full form, and give a local key to it? The
further we go with OSIS, the more I think the TEI notion like this is
critical, and we need to just take it a lot farther than TEI did.
Ahhhh --- this mapping, from short standard keys to works, is what
our name-declaration files do. We'd been thinking of them as only
doing author/title, but they could do more. Oh yeah, and that reminds
me I left out a field: volume and issue number for journals (year
isn't enough).
I think split 'em up and make 'em declare 'em in the header. So all
you validate for work in the references themselves is NMTOKEN.
Putting them all together
---------------------
At the top level we need a separator for the 3 parts, and a way to
combine to make a range.
Trying some random delimiters, we get:
outRef ::= ( work '/' )? inWorkLoc
inWorkLoc ::= loc ( '--' loc)?
loc ::= refLoc ('@' grain)?
So a full-blown one would be (punting the internals of work for the moment):
Bible.EN.NIV.US.1999/KJV:Matt.1.1@char:5(word)--KJV:Matt.1.3:24(faith)
This really seems to be screaming to be in markup.....
A more typical one in reality would be:
NIV/Matt.1.1
or even just
Matt.1.1
--
Steve DeRose -- http://www.stg.brown.edu/~sjd
Chair, Bible Technologies Group -- http://www.bibletechnologies.net
Email: sderose@speakeasy.net
Backup email: sderose@mac.com, sjd@stg.brown.edu