[osis-core] Hi
Todd Tillinghast
osis-core@bibletechnologieswg.org
Wed, 10 Dec 2003 13:19:54 -0700
Steve,
What are you trying to index?
1) Things in <w> like morph and Strong's numbers
2) All words
3) References (things in osisIDs)
4) Element types and range of containers (both XML containers and
milestone constainers)
5) Other things
Are indexing with the intent of supporting a search process?
More below.
Todd
> -----Original Message-----
> From: osis-core-admin@bibletechnologieswg.org [mailto:osis-core-
> admin@bibletechnologieswg.org] On Behalf Of Steven J. DeRose
> Sent: Tuesday, December 09, 2003 9:18 AM
> To: osis-core@bibletechnologieswg.org
> Subject: [osis-core] Hi
>
> I'm coding away on an XML indexer -- Got Java generating dialog
> boxes, almost parsing XML files, and almost spitting out an output
> stream of index content, though not in any special format yet. Once
> that's done, I just need to tweak it to output the formal index
> format, implement the high-speeed vocab table, and support merge,
> delete, and garbage collector, and a query lg (prob. XPath first off,
> starting from Xerces impl).
>
> It's not too bad a pain except for the number of libraries and
> interfaces to learn. I bought a 3x4 wallchart of the Java class
> libraries, but even it doesn't have enough room to show any methods
> or arguments. Fortunately the Metrowerks IDE has nice auto-completion
> that does show you what's available as you start to type. Not bad.
>
> Two significant problems I'm encountering:
>
> 1) SAX doesn't report empty elements -- just starts and ends. So for
> trojan milestones, do I depend on startid and endid absolutely, or do
> I keep track for everything opened, whether any content has shown up,
> and if so complain if I see startid/endid? Latter seems better, and
> not tooo painful.
If I understand the question correctly, I would complain if content
exists with an element that has an eID or sID attribute AND I would
process purely off of the sID and eID attribute values to identify the
start and end of their containers.
>
> 2) Entities. Obviously the indexer has to store something for every
> token it finds -- what do I store as location for stuff that was in
> external entities? The iNode and offset in the entity (croaks for
> internal entities, as well as making me wonder what to return when a
> search happens (like, where to tell the app to start reading -- esp.
> if the entity was referenced more than once!).
>
> Or, maybe better, I'm thinking I should just store the iNode of the
> root document, and the XPath chilod-sequence to the node (perhaps as
> extended to allow char offsets, in the XPointer draft that I couldn't
> get finalize by W3c (grumble grumble). That's a lot more verbose
> than, say, 8 bytes for a hex offset; but it's butt-simple to
> implement, and the pointers themselves allow you to do all sorts of
> comparisions without even looking at the source -- like contains(),
> precedes(), equals(), isLeaf(), depth(), parent(), etc.
>
> In that case, though, there should be a table somewhere to map from
> child seq to file and offset, for fast processing. Not sure about
> that.
>
> We never really solved the entity problem in DynaText -- though I
> don't think customers ever noticed. Surprising especially with
> legal/mil/gov customers who have standardized boilerplate entities
> (like mandated warnings and notices) that get invoked squidzillions
> of times).
>
> BTW, this thing is going to store starting and ending offsets for
> both ends of all elements, so it can index Trojan milestones and LMNL
> or JITS constructs just fine. (Troy, we have to get someone named
> Helen in on the project, to complete the allusion...).
For Bibles it makes sense to store things using osisRef syntax. We have
already resolved the mechanism for character offsets using the grain.
For other OSIS documents that make regular use of osisIDs, the osisRef
syntax will also work well.
With the exception of the grain part an osisRef can be converted into an
XPath expression.
With the combination of the value of <identifier type="OSIS"> and the
osisRef you should have a value that is precise and reliable.
Eg: Bible.en.CEV.1995:Gen.1.3@cp[34]-Gen.1.3@cp[38]
OR Bible.en.CEV.1995:Gen.1.3!note.a@cp[34] for a character offset within
a note. (I always give notes an osisID so that they can be precisely
referenced.)
There are some cases for which this would not work as nicely (like
titles). In these cases it might make sense to use the OSIS identifier
for the work with XPath like
Bible.en.CEV.1995://div[osisID='Gen']/title[4]
Todd
<snip>