[osis-core] Free-range grain-fed sacred cows
Steve DeRose
osis-core@bibletechnologieswg.org
Tue, 30 Jul 2002 13:13:57 -0400
At 11:23 AM -0700 07/22/02, Troy A. Griffitts wrote:
>>>self-identify some of the more contemporary texts that versify by
>>>paragraphs. e.g. "This paragraph is Mark 1:1-9"
>>>
>>
>>
>>In this case the "1-9" becomes a verse name in the current reference
>>system in same way that "4" is a verse name in "Gen.1.4". Other options
>>were discussed in recent posts.
>
>This seems useless, unless self referencing in the same document.
>If I have a Bible such as this installed in some Bible software
>application, and I have a commentary that has a <reference> to
>Mark.1.7, I would hope my Bible would jump to the "This paragraph is
>Mark 1:1-9" paragraph. I don't think we decided how to do this.
>
>In Dallas, we decided to force these types of Bibles to have
>multiple milestone starts, so we could still, easily do a
>string-match reference resolution system.
>
>e.g.
><verseStart ref="Mark.1.1" />
><verseStart ref="Mark.1.2" />
>...
>
>
>Now that we're using containers, I'm not sure how we've decided to
>allow this. I still think it's not a trivial jump we're making if
>we decide to allow ranges. I'm not necessarily against it, but am
>concerned about the complexities introduced. The multiple milestart
>start solution was brainless and made for easy implementation. I
>could write XPath to resolve to any versification reference that
>this Bible claims to implement. In the range solution, this is no
>longer true.
I'm also still pretty nervous about using ranges there.
It does force extra work on software, and the algorithm doesn't seem
entirely obvious to me. Why not just put the individual markers all
in? It does put a burden on those editions that only mark by
paragraphs (or whatever); but that burden can be automated by a
utility that expands the markup for them before they release the text
as being in OSIS; those with smart software won't even have to notice
(they type in whatever they want and it expands underneath). Such a
utility is much simpler than the range-intersection algorithm, *and*
it only has to be implemented once, rather than implemented within
every separate piece of OSIS-supporting software (editing,
typesetting, retrieval, browsing...).
Also, it seems to me structurally incorrect -- something like
Mark.1.1-3 is not an identifier as I understand it -- it is a
structured expression that *uses* other identifiers. I think if you
asked most laymen what that string means, they would be hard pressed
to say anything about it without referring to those other
identifiers. Thus, I claim this is an expression, conceptually.
Note that this looks like a range, but isn't. The syntax and
semantics are not the same as the range Mark.1.1-Mark.1.3. They're
related, but a range reference involves selecting on 3 keys and
concatenating the results (or something faintly like that); the
compound-verse identifier is a special kind of key semantics, where
retrieval has to be smart enough to know that a variety of query keys
will match this (meta-) key value in the data. Quite different
implementation issues.
Also, it isn't really *just* another identifier token -- the numbers
have constraints like a range would (like having to be in order).
Also, weird numbering systems would make the implicit loop not work
-- for example, what it one version has marked Matt.1.2a (now *that*
seems to me like a real identifier -- just a token to match), and you
click on it to find parallels. The loop that expands Matt.1.1-3 is
not going to generate the '2a' reference; and heaven forbit that
anyone should number their verses backwards (seems unlikely, but in
this business I wouldn't bet much money against it happening
somewhere).
I recently realized that this gets messier when we cross it with the
idea of using grains to mark the parts of discontiguous verses.
My first problem with grains for this is that it seems conceptually
incorrect -- we defined grains as being for mechanically identifying
locations within the smallest units -- that is, as the escape for
dealing with finer-grained addressing that the system allows. But:
1) users will seldom request the part of a verse that we had to break
off into a separate part because it was right after an embedded quote
(etc. etc) -- and if they do want something like that, they can't
readily predict what the grain identifier for it would be.
2) Using grains to identify these parts conflates 2 separate notions
(as I think Harry pointed out earlier): tieing together the parts,
vs. identifying the whole.
3) We raise new error conditions: What if the grain identifiers on
the parts do not in fact evaluate to those parts? For example, it
says @char(44) but in fact the first character within is character
45, or 50, or 200? Is it a validity condition that these be right?
4) The fact that there *can* be a contradiction, suggests that the
data is non-normalized in a slightly dangerous way. In practice, this
leads to situations where it is extra work (human or automated) to
keep such things in sync. Thus:
a) the identifiers will creep off as editing occurs, and authors will not
be happy about having to fix them
b) what happens when a new edition comes out with slight changes
-- all these
identifiers become invalid? If these were really identifiers I think they
shouldn't die so easily.
c) how does this support re-ordering? If parts of the verse occur out of
order, their grain values will too; in which case the semantics of grains
are ambiguous:
i) In a grain used in a self-id, the grain is definitive: regardless of
where this piece occurs, the self-id's grain constitutes an assertion
that you are at this grain position. This massively complicates
the implementation of grain-finding (it ain't just counting anymore)
ii) In a grain used in a reference, the grain is a query: you must search
for it.
This then raises the nasty case that for any re-ordering, there will
be grain-references that could lead to two places. For example,
consider:
<z id='John.1.1@char(01)'>In the beginning </z>
<z id='John.1.1@char(22)'>the Word</z>
<z id='John.1.1@char(18)'>was </z>
Not a great example, but I think it will do.
Now, where does a reference to John.1.1@char(19) lead? to 'a' or to
'h'? Better, yet, does a reference to John.1.1@char(22) lead to 't'
or to 'W'? Is @char(3) out of range, does is point to 'w'?
5) Authors will also have trouble generating these beasties in the
first place; thus we impose a barrier of software support between us
and acceptance/use.
I think we'd be alright simply putting the whole verse's ID on each
part, and let them be distinguished via the next/prev stuff. Yes, it
does mean that you get 3 'hits' (or whatever) for a verse retrieval.
Oh, and if this is stored in an RDB, when you get the 3 partial-verse
records back, you can sort correctly in the face of reordering, only
if you have nexxzt/prev, but not if you just have grains.
Concatenating by chaining through the next/prevs seems to me easier
to implement.
Note also that users are already used to one kind of partial-verse
identifier that doesn't have most of these problems (though they have
not been very formal to date): appending a letter to designate
successive parts of a verse.
So my proposal for this would be to disallow grains on self-ids, and
to either suggest or (preferably?) require appending a, b,... to the
identifiers if you want to make the parts accessible (those should be
declared in a reference scheme, but I'm not picky about that part,
seems like a nit at this point). This gives us nice predictable
names for the parts of a discontiguous verse (say, for use in the
next/prev values), and makes a trivial algorithm to strip them off.
Indeed, we could use the default fallback algorithm, which is to
strip off trailing tokens of a reference; to do that we just put '.'
before the 'a'.
S
--
Steve DeRose -- http://www.stg.brown.edu/~sjd
Chair, Bible Technologies Group -- http://www.bibletechnologies.net
Email: sderose@speakeasy.net
Backup email: sjd@stg.brown.edu