[sword-devel] XML DOM

DM Smith dmsmith555 at yahoo.com
Fri Mar 9 18:40:04 MST 2007


Using an XML parser is actually quite viable to do the parsing for  
osis2mod. The fundamental behavior of the program is to identify and  
gather all the "chunks" that need to go into the index and then call  
the Sword API routines to store a chunk against a key.

The Sword API keeps track of all the offsets and the size of the data  
as it goes. It does not have any memory of what it has done, but only  
knows the current size of the output file (via tell, IIRC) and the  
size of what it is writing. This info is written to the index file in  
the slot reserved for that verse.

The key is called a verse, but it might be an intro to a testament,  
book or chapter. The other main trick in osis2mod is the  
identification of headings and their placement into the verse that  
follows. osis2mod also does some normalization of the input.

All of this can be readily done with Xerces as the parser, using  
either SAX, DTM or DOM and even by using XSLT.

There are drawbacks:

It requires a new skill set to maintain osis2mod. Several developers  
currently maintain it. Though I have been told it's mine since I  
touched it last ;)

It requires well-formed input. The current parser does not, but does  
warn when input is not.

The current program works well. The new program would need extensive  
certification. Or both would need to exist until we are satisfied  
with the replacement.

To me the biggest motivation for a rewrite would be to handle other  
kinds of modules besides Bibles.

On Mar 9, 2007, at 4:04 PM, Greg Hellings wrote:

> When I asked about this question in the past, specifically related to
> the utilities as you are, is when I finally received my insight into
> how the Sword library holds its files.  Due to he fact that most XML
> parsers obfuscate the actual number of bytes that have been read, and
> since the Sword library generates an index file for the module that
> relies on the number of bytes into the data file a certain occurrence
> is located, using a DOM or SAX parser, I was told, is not viable.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.crosswire.org/pipermail/sword-devel/attachments/20070309/7f4a0fae/attachment.html 

More information about the sword-devel mailing list