[sword-devel] XML DOM
DM Smith
dmsmith555 at yahoo.com
Fri Mar 9 18:40:04 MST 2007
Greg,
Using an XML parser is actually quite viable to do the parsing for
osis2mod. The fundamental behavior of the program is to identify and
gather all the "chunks" that need to go into the index and then call
the Sword API routines to store a chunk against a key.
The Sword API keeps track of all the offsets and the size of the data
as it goes. It does not have any memory of what it has done, but only
knows the current size of the output file (via tell, IIRC) and the
size of what it is writing. This info is written to the index file in
the slot reserved for that verse.
The key is called a verse, but it might be an intro to a testament,
book or chapter. The other main trick in osis2mod is the
identification of headings and their placement into the verse that
follows. osis2mod also does some normalization of the input.
All of this can be readily done with Xerces as the parser, using
either SAX, DTM or DOM and even by using XSLT.
There are drawbacks:
It requires a new skill set to maintain osis2mod. Several developers
currently maintain it. Though I have been told it's mine since I
touched it last ;)
It requires well-formed input. The current parser does not, but does
warn when input is not.
The current program works well. The new program would need extensive
certification. Or both would need to exist until we are satisfied
with the replacement.
To me the biggest motivation for a rewrite would be to handle other
kinds of modules besides Bibles.
DM
On Mar 9, 2007, at 4:04 PM, Greg Hellings wrote:
> When I asked about this question in the past, specifically related to
> the utilities as you are, is when I finally received my insight into
> how the Sword library holds its files. Due to he fact that most XML
> parsers obfuscate the actual number of bytes that have been read, and
> since the Sword library generates an index file for the module that
> relies on the number of bytes into the data file a certain occurrence
> is located, using a DOM or SAX parser, I was told, is not viable.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.crosswire.org/pipermail/sword-devel/attachments/20070309/7f4a0fae/attachment.html
More information about the sword-devel
mailing list