[osis-core] Hi
Steven J. DeRose
osis-core@bibletechnologieswg.org
Tue, 9 Dec 2003 11:17:30 -0500
I'm coding away on an XML indexer -- Got Java generating dialog
boxes, almost parsing XML files, and almost spitting out an output
stream of index content, though not in any special format yet. Once
that's done, I just need to tweak it to output the formal index
format, implement the high-speeed vocab table, and support merge,
delete, and garbage collector, and a query lg (prob. XPath first off,
starting from Xerces impl).
It's not too bad a pain except for the number of libraries and
interfaces to learn. I bought a 3x4 wallchart of the Java class
libraries, but even it doesn't have enough room to show any methods
or arguments. Fortunately the Metrowerks IDE has nice auto-completion
that does show you what's available as you start to type. Not bad.
Two significant problems I'm encountering:
1) SAX doesn't report empty elements -- just starts and ends. So for
trojan milestones, do I depend on startid and endid absolutely, or do
I keep track for everything opened, whether any content has shown up,
and if so complain if I see startid/endid? Latter seems better, and
not tooo painful.
2) Entities. Obviously the indexer has to store something for every
token it finds -- what do I store as location for stuff that was in
external entities? The iNode and offset in the entity (croaks for
internal entities, as well as making me wonder what to return when a
search happens (like, where to tell the app to start reading -- esp.
if the entity was referenced more than once!).
Or, maybe better, I'm thinking I should just store the iNode of the
root document, and the XPath chilod-sequence to the node (perhaps as
extended to allow char offsets, in the XPointer draft that I couldn't
get finalize by W3c (grumble grumble). That's a lot more verbose
than, say, 8 bytes for a hex offset; but it's butt-simple to
implement, and the pointers themselves allow you to do all sorts of
comparisions without even looking at the source -- like contains(),
precedes(), equals(), isLeaf(), depth(), parent(), etc.
In that case, though, there should be a table somewhere to map from
child seq to file and offset, for fast processing. Not sure about
that.
We never really solved the entity problem in DynaText -- though I
don't think customers ever noticed. Surprising especially with
legal/mil/gov customers who have standardized boilerplate entities
(like mandated warnings and notices) that get invoked squidzillions
of times).
BTW, this thing is going to store starting and ending offsets for
both ends of all elements, so it can index Trojan milestones and LMNL
or JITS constructs just fine. (Troy, we have to get someone named
Helen in on the project, to complete the allusion...).
Thoughts, anyone?
--
Steve DeRose -- http://www.derose.net
Chair, Bible Technologies Group -- http://www.bibletechnologies.net
Email: sderose@acm.org or steve@derose.net