[osis-core] Hi

Tue, 9 Dec 2003 11:17:30 -0500

I'm coding away on an XML indexer -- Got Java generating dialog 
boxes, almost parsing XML files, and almost spitting out an output 
stream of index content, though not in any special format yet. Once 
that's done, I just need to tweak it to output the formal index 
format, implement the high-speeed vocab table, and support merge, 
delete, and garbage collector, and a query lg (prob. XPath first off, 
starting from Xerces impl).

It's not too bad a pain except for the number of libraries and 
interfaces to learn. I bought a 3x4 wallchart of the Java class 
libraries, but even it doesn't have enough room to show any methods 
or arguments. Fortunately the Metrowerks IDE has nice auto-completion 
that does show you what's available as you start to type. Not bad.

Two significant problems I'm encountering:

1) SAX doesn't report empty elements -- just starts and ends. So for 
trojan milestones, do I depend on startid and endid absolutely, or do 
I keep track for everything opened, whether any content has shown up, 
and if so complain if I see startid/endid? Latter seems better, and 
not tooo painful.

2) Entities. Obviously the indexer has to store something for every 
token it finds -- what do I store as location for stuff that was in 
external entities? The iNode and offset in the entity (croaks for 
internal entities, as well as making me wonder what to return when a 
search happens (like, where to tell the app to start reading -- esp. 
if the entity was referenced more than once!).

Or, maybe better, I'm thinking I should just store the iNode of the 
root document, and the XPath chilod-sequence to the node (perhaps as 
extended to allow char offsets, in the XPointer draft that I couldn't 
get finalize by W3c (grumble grumble). That's a lot more verbose 
than, say, 8 bytes for a hex offset; but it's butt-simple to 
implement, and the pointers themselves allow you to do all sorts of 
comparisions without even looking at the source -- like contains(), 
precedes(), equals(), isLeaf(), depth(), parent(), etc.

In that case, though, there should be a table somewhere to map from 
child seq to file and offset, for fast processing. Not sure about 
that.

We never really solved the entity problem in DynaText -- though I 
don't think customers ever noticed. Surprising especially with 
legal/mil/gov customers who have standardized boilerplate entities 
(like mandated warnings and notices) that get invoked squidzillions 
of times).

BTW, this thing is going to store starting and ending offsets for 
both ends of all elements, so it can index Trojan milestones and LMNL 
or JITS constructs just fine. (Troy, we have to get someone named 
Helen in on the project, to complete the allusion...).

Thoughts, anyone?

-- 

Steve DeRose -- http://www.derose.net
Chair, Bible Technologies Group -- http://www.bibletechnologies.net
Email: sderose@acm.org  or  steve@derose.net