[sword-devel] Improvements to osis2mod to handle XML comments and <header> correctly
DM Smith
dmsmith at crosswire.org
Mon Apr 5 11:18:35 MST 2010
On 04/05/2010 01:44 PM, Weston Ruter wrote:
> DM:
>
> But what we really need is not a parser but a tokenizer. I'm
> thinking about writing one (my degree work was in compiler
> writing). Basically, we repeat the same tokenization code in
> several places. It should be trivial to write a complete, accurate
> one.
>
>
> I've also been wanting to work on a tokenizer. At Open Scriptures, the
> text of a work is currently represented by two models
> <http://github.com/openscriptures/api/blob/master/models.py> (database
> tables): Token
> <http://github.com/openscriptures/api/blob/master/models.py#L242> and
> Structure
> <http://github.com/openscriptures/api/blob/master/models.py#L315>.
> Tokens are the smallest divisible units of text, such as words,
> punctuation, and whitespace; and structures are the spans of tokens
> that form logical units, such as verses, paragraphs, quotes, etc. The
> structures are standoff-markup for the tokens. With the underlying
> data stored in this way, it can then be serialized in whichever
> hierarchy desired (book-section-paragraph, book-chapter-verse,
> all-milestoned, etc) or whichever data format is needed (OSIS, SWORD
> Module, XHTML, etc.)
This is a lot lower than what an xml tokenizer needs. This would be a
tokenizer for the text between tags. Having a single tokenizer that does
both would be more efficient when both are wanted and slower when only
xml tokens are needed.
I think a model could be constructed that could do both and allow one to
ask for the depth of tokenization that is needed.
There is a big complication with the parsing of text: it is language
dependent. For example, Thai has words but not word breaks. Basically,
the task will require a Unicode and somewhat language aware word-break
algorithm. The best I've seen is in ICU.
Lucene has a wonderful example in their Jira issues database of how to
do tokenization. (1488, if I remember.)
>
> So what I'm currently rumenating on is the process of importing the
> raw data into the Token and Structure models. I wrote an importer
> <http://github.com/openscriptures/api/blob/master/importers/Tischendorf-2.5.py>
> for the Tischendorf GNT data which does everything both tokenizing and
> parsing, but obviously there is going to be a lot of code in common
> with other importers that are written. So I too am thinking about how
> these importers can be reduced to the bare minimum to handle the
> unique aspects of the raw data (i.e. normalize it), and then stream
> the tokens back to a central importer that parses the input and stores
> it into the Token and Structure models. This central importer facility
> could be a web service.
>
> I've love to collaborate with you on this. We could come up with a
> common tokenizer that can be used by both SWORD and Open Scriptures.
> The importer web service could take tokens as input and as output
> generate a SWORD module and also populate the Open Scriptures models
> at the same time.
>
> Thoughts?
Sounds good to me, too.
In Him,
DM
>
> Weston
>
>
>
> On Mon, Apr 5, 2010 at 10:24 AM, Daniel Owens <dhowens at pmbx.net
> <mailto:dhowens at pmbx.net>> wrote:
>
> Yes, I agree, and if there were a feedback mechanism for the
> module creator to let them know how to start fixing an OSIS file
> or conf file, it would save Chris (or whoever else approves
> modules) time on the basic stuff.
>
> Daniel
>
>
> On 4/5/2010 11:09 AM, DM Smith wrote:
>
> This is a great idea. Rather than emailing source to modules
> at crosswire dot org, one could upload it via a web service.
> We could have stages of validation (xmllint) and construction
> (osis2mod). Such a service could evaluate the quality of the
> submission.
>
> In Him,
> DM
>
> On 04/05/2010 12:01 PM, Weston Ruter wrote:
>
> Why not turn osis2mod into a web service? Then it wouldn't
> matter how it is implemented since it would be abstracted
> away by the web service interface. It could use the best
> XML libraries available today and written in the
> programming language of choice, both of which would make
> maintenance and the addition of new features much easier.
>
> Weston
>
>
>
>
> On Mon, Apr 5, 2010 at 9:05 AM, DM Smith <dmsmith at crosswire.org
> <mailto:dmsmith at crosswire.org>> wrote:
>
> On 04/05/2010 09:03 AM, Dmitrijs Ledkovs wrote:
>
> On 5 April 2010 13:55, Manfred
> Bergmann<manfred.bergmann at me.com
> <mailto:manfred.bergmann at me.com>> wrote:
>
> Hi DM.
>
> Am 05.04.2010 um 13:21 schrieb DM Smith:
>
>
> Regarding using a "real" parser, it is a good idea.
> But we don't want SWORD to be dependant on an external
> parser.
>
> What's the reason for that?
> I could understand if it would mean for the user to
> install certain libraries manually but when the sources
> can be integrated into the project and has the appropriate
> licence then why not?
>
>
> Manfred
>
>
> IMHO there is no harm in bringing in libxml or a much more
> lightweight
> parser like GMarkup. The build system just needs to be adjusted to
> link e.g. libxml for the osis2mod binary and not shared sword
> library.
> in can be even called a new tool osisxml2mod for example and
> make it
> be build optionally such that you can still have full sword dev
> environment without libxml.
>
> Tools for creating modules do not have be linked with sword or
> even
> live in sword taball / svn. Although it does help consistent
> distribution of tools.
>
> I don't remember all of Troy's reasoning when I argued for a true
> parser.
>
> >From what I recall:
> o To maintain freedom to re-license SWORD (e.g. for some other
> Bible society) we need to be able to keep 3-rd party library
> dependencies well managed. The license needs to be compatible with
> the GPL but cannot be GPL.
>
> o The parser that we have is minimal and simple, sacrificing
> accuracy and completeness for speed. Regarding accuracy, e.g. the
> parser allows for spaces around = in attribute declarations.
> Regarding completeness, e.g. it does not handle namespaces, cdata,
> dtds/schemas, .... Significantly, it does not require a
> well-formed document, allowing for fragments. Rather than an
> error, it continues when an xml parser is required to stop.
>
> o This parser has better error reporting in that it is based upon
> knowledge of the input. E.g. it reports the verse having the problem.
>
> o By SWORD having the parser, we are not dependent on finding an
> implementation for every platform (e.g. Windows).
>
> There may be other reasons. I'm willing to live with it.
>
> But what we really need is not a parser but a tokenizer. I'm
> thinking about writing one (my degree work was in compiler
> writing). Basically, we repeat the same tokenization code in
> several places. It should be trivial to write a complete, accurate
> one.
>
> In His Service,
> DM
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> <mailto:sword-devel at crosswire.org>
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20100405/7e7808e6/attachment-0001.html>
More information about the sword-devel
mailing list