[jsword-devel] JSword and Map/Image modules
DM Smith
dmsmith555 at yahoo.com
Wed Jan 28 18:24:18 MST 2009
Brian Fernandes wrote:
> DM Smith wrote:
>> Brian Fernandes wrote:
>>> To answer my own questions:
>>>
>>> 1) Map modules can be both hierarchical and simple lists. I should
>>> have realized that all the modules I talked about earlier were in
>>> fact Map modules; but two (FrarsiBibleAtlas and AbsMaps) were stored
>>> in gen book format? (not sure of the right terminology here). The
>>> other two were in dictionary format.
>>>
>>> So basically we'll have to make a small change to BD code to see
>>> which of these formats is being used and use a list or a tree
>>> accordingly.
>>
>> This should be handled already. All dictionary modules are treated as
>> lists and all gen book modules are treated as trees.
> What's happening is slightly different. BD is looking at the stated
> module *category* and not the actual module type to decide between
> tree and list. Both FarsiBibleAtlas and Epiphany Maps are in the Map
> *category*. However, for FarsiBibleAtlas the actual Book object is an
> instance of SwordGenBook. For Epiphany Maps, the Book object is an
> instance of SwordDictionary.
>
> Here is relevant info from the corresponding .conf files.
>
> [EpiphanyMaps]
> DataPath=./modules/lexdict/rawld4/epiphany-maps/maps
> ModDrv=RawLD4
> Category=Maps
>
> [ABSMaps]
> DataPath=./modules/genbook/rawgenbook/absmaps/maps
> ModDrv=RawGenBook
> Category=Maps
>
>
> Since BD is taking a decision based on category, it chooses to use a
> tree for both, ergo FarsiBibleAtlas works just fine, but Epiphany
> fails to show any sort of list.
>
> The options I see are:
> a) Change logic (just for maps) to make a list/tree decision based on
> the type of book object.
I think this is the right choice.
>
> b) Decide that all maps should be either hierarchical or linear and
> make sure BD works for that decision. This decision can only be taken
> if there is some consensus on what a map module should be (which I
> haven't found yet) and then the "losing type" of modules would need to
> be rebuilt.
>
> I prefer the former option, of course :) The fix would be pretty
> simple; my development version of FireBible now supports both types.
>
>
>>> 2) NetMaps was not working because it's content is something like this:
>>>
>>> <br><br>Journey of Paul (JP) #1, grid D2<br><img
>>> src="/images/jp1.jpg"/><br><br>Journey of Paul (JP) #2, grid
>>> D2<br><img src="/images/jp2.jpg"/><br><br>Journey of Paul (JP) #3,
>>> grid D2<br><img src="/images/jp3.jpg"/><br><br>Journey of Paul (JP)
>>> #4, grid D2<br><img src="/images/jp4.jpg"/>
>>>
>>>
>>> This is not valid XML as the <br> tags are not closed and the
>>> fallback code simply removes all tags in an effort to display
>>> something.
>>>
>>> If you replace <br> with <br/>, it works just fine. The THMLFilter
>>> class makes 3 attempts to parse the text.
>>>
>>> a) The first attempt is made after removing invalid '&' characters
>>> in the text.
>>>
>>> b) If the above fails, it does some further character clean up,
>>> removing disallowed characters from the XML.
>>>
>>> c) If this still fails, it simply removes all tags.
>>>
>>> Maybe we can add an additional step between b and c which would
>>> replace "<br>" with "<br/>"? Or perhaps do it as part of step b.
>>> Any other tags like this which we may want to clean up?
>>>
>>> DM, what do you think?
>> The behavior of JSword certainly could be improved. The typical
>> problem that JSword encounters is a verse that is not well-formed
>> XML. This can readily happen in modules. I have tracked the problem
>> to the following:
>> - Modules built from IMP format and are not validated against the
>> spec. For ThML and OSIS, this should be both well-formed XML and
>> valid against the schema. Further for ThML it should only contain the
>> SWORD supported elements. For GBF, it should match the spec. When it
>> is in IMP format, there are no validation tools. Also, osis2mod is
>> transformational to what SWORD can handle. This is side stepped,
>> which can cause problems.
>> - OSIS modules in BSP structure can build verses that are not
>> well-formed. This causes problems for all front-ends. But for JSword
>> it is worse than all others. The version of osis2mod in SVN fixes
>> this by using milestoned versions of all
>> - The module building tools do not validate input. It is expected
>> that the module creator does that first. In fact the module creators
>> are relatively brain-dead and merely look for start and end of verses
>> and pass everything in between as
>
> Appreciate the insight & experience. So this really is a bad ThML module.
>
>>
>> Here is what I think the fallback mechanism should be changed by
>> adding another step (before the tag stripping):
>> As each un-matched end element is encountered, an opening tag for
>> that element should be prefixed.
>> As each un-matched begin tag is encountered, a closing tag for that
>> element should be suffixed.
>> This is not trivial, essentially a quasi-xml streaming parser needs
>> to be written that uses a stack to know what is opened and what is
>> closed. And the insertions need to be written in the correct order.
>> I say quasi because the xml parser standard requires it to fail on
>> bad input.
> Agree, malformed XML is usually a fatal error which generally kills
> the parsing.
>
> I suggested replace "<br>" with "<br/>" because whenever you try to
> parse say HTML as XML, the <br> tag is the primary cause of failure.
> Most of the other tags are closed, even in simple HTML and once this
> "fix" is made, parsing succeeds most of the time. I thought we could
> achieve a similar quick fix by using this approach only because it's
> cheap. If it fails, we do move on to stripping all tags out anyway.
>
> My experience however, is limited to parsing HTML as XML, and I have
> *no experience* with the actual content of Bible modules. So if they
> are prone to more failures from malformed XML where other tags are
> involved, then just fixing <br> does not make sense.
>
> Given the option of either making the parser more "accommodating" or
> insisting on well formed input, I will choose the latter and agree
> with your bottom line below - let's get Karl to fix the module :)
Is <br> is the only element in HTML defined to have not content? If we
have a complete list, I'd be happy for your suggested change to be added.
>
>
>>
>> One of the reasons I advocate using an XML parser for our module
>> creator is that it would not allow input that is not well-formed.
>
> Agreed.
>
>>
>> The other thing would be to change the ThML filter to not use an XML
>> parser. This too is not trivial.
>
> Sticking with XML seems to be the way to go, especially since that is
> what ThML is supposed to contain. Does JSword already use another
> parsing mechanism for some other source formats?
Yes. For plain text new lines are replace with <lb/> in the
transformation to OSIS. This is a simple substitution. GBF looks a lot
like XML but only superficially. We have a custom parser for that.
>
>>
>> Bottom Line: The definition of ThML is that it is not a superset of
>> html but of xhtml. I don't think we should handle invalid ThML but
>> only valid ThML.
>>
>> Karl is very responsive to fixing problems in his modules. (Yeah!!!)
>> I think that Karl should fix his module to be valid ThML.
>>
> I'll make a post to sword-devel about this. Unless he's already
> listening here too ;)
>
> Brian.
Brian,
If you can work up a patch for any of this it would be appreciated.
In Him,
DM
More information about the jsword-devel
mailing list