[jsword-devel] JSword and Map/Image modules

Wed Jan 28 17:55:27 MST 2009

DM Smith wrote:
> Brian Fernandes wrote:
>> To answer my own questions:
>>
>> 1) Map modules can be both hierarchical and simple lists. I should 
>> have realized that all the modules I talked about earlier were in fact 
>> Map modules; but two (FrarsiBibleAtlas and AbsMaps) were stored in gen 
>> book format? (not sure of the right terminology here). The other two 
>> were in dictionary format.
>>
>> So basically we'll have to make a small change to BD code to see which 
>> of these formats is being used and use a list or a tree accordingly.
> 
> This should be handled already. All dictionary modules are treated as 
> lists and all gen book modules are treated as trees.
What's happening is slightly different. BD is looking at the stated 
module *category* and not the actual module type to decide between tree 
and list. Both FarsiBibleAtlas and Epiphany Maps are in the Map 
*category*. However, for FarsiBibleAtlas the actual Book object is an 
instance of SwordGenBook. For Epiphany Maps, the Book object is an 
instance of SwordDictionary.

Here is relevant info from the corresponding .conf files.

[EpiphanyMaps]
DataPath=./modules/lexdict/rawld4/epiphany-maps/maps
ModDrv=RawLD4
Category=Maps

[ABSMaps]
DataPath=./modules/genbook/rawgenbook/absmaps/maps
ModDrv=RawGenBook
Category=Maps

Since BD is taking a decision based on category, it chooses to use a 
tree for both, ergo FarsiBibleAtlas works just fine, but Epiphany fails 
to show any sort of list.

The options I see are:
a) Change logic (just for maps) to make a list/tree decision based on 
the type of book object.

b) Decide that all maps should be either hierarchical or linear and make 
sure BD works for that decision. This decision can only be taken if 
there is some consensus on what a map module should be (which I haven't 
found yet) and then the "losing type" of modules would need to be rebuilt.

I prefer the former option, of course :) The fix would be pretty simple; 
my development version of FireBible now supports both types.

>> 2) NetMaps was not working because it's content is something like this:
>>
>> <br><br>Journey of Paul (JP) #1, grid D2<br><img 
>> src="/images/jp1.jpg"/><br><br>Journey of Paul (JP) #2, grid 
>> D2<br><img src="/images/jp2.jpg"/><br><br>Journey of Paul (JP) #3, 
>> grid D2<br><img src="/images/jp3.jpg"/><br><br>Journey of Paul (JP) 
>> #4, grid D2<br><img src="/images/jp4.jpg"/>
>>
>>
>> This is not valid XML as the <br> tags are not closed and the fallback 
>> code simply removes all tags in an effort to display something.
>>
>> If you replace <br> with <br/>, it works just fine. The THMLFilter 
>> class makes 3 attempts to parse the text.
>>
>> a) The first attempt is made after removing invalid '&' characters in 
>> the text.
>>
>> b) If the above fails, it does some further character clean up, 
>> removing disallowed characters from the XML.
>>
>> c) If this still fails, it simply removes all tags.
>>
>> Maybe we can add an additional step between b and c which would 
>> replace "<br>" with "<br/>"?  Or perhaps do it as part of step b. Any 
>> other tags like this which we may want to clean up?
>>
>> DM, what do you think? 
> The behavior of JSword certainly could be improved. The typical problem 
> that JSword encounters is a verse that is not well-formed XML. This can 
> readily happen in modules. I have tracked the problem to the following:
> - Modules built from IMP format and are not validated against the spec. 
> For ThML and OSIS, this should be both well-formed XML and valid against 
> the schema. Further for ThML it should only contain the SWORD supported 
> elements. For GBF, it should match the spec. When it is in IMP format, 
> there are no validation tools. Also, osis2mod is transformational to 
> what SWORD can handle. This is side stepped, which can cause problems.
> - OSIS modules in BSP structure can build verses that are not 
> well-formed. This causes problems for all front-ends. But for JSword it 
> is worse than all others. The version of osis2mod in SVN fixes this by 
> using milestoned versions of all
> - The module building tools do not validate input. It is expected that 
> the module creator does that first. In fact the module creators are 
> relatively brain-dead and merely look for start and end of verses and 
> pass everything in between as

Appreciate the insight & experience. So this really is a bad ThML module.

> 
> Here is what I think the fallback mechanism should be changed by adding 
> another step (before the tag stripping):
> As each un-matched end element is encountered, an opening tag for that 
> element should be prefixed.
> As each un-matched begin tag is encountered, a closing tag for that 
> element should be suffixed.
> This is not trivial, essentially a quasi-xml streaming parser needs to 
> be written that uses a stack to know what is opened and what is closed. 
> And the insertions need to be written in the correct order.
> I say quasi because the xml parser standard requires it to fail on bad 
> input.
Agree, malformed XML is usually a fatal error which generally kills the 
parsing.

I suggested replace "<br>" with "<br/>" because whenever you try to 
parse say HTML as XML, the <br> tag is the primary cause of failure. 
Most of the other tags are closed, even in simple HTML and once this 
"fix" is made, parsing succeeds most of the time. I thought we could 
achieve a similar quick fix by using this approach only because it's 
cheap. If it fails, we do move on to stripping all tags out anyway.

My experience however, is limited to parsing HTML as XML, and I have *no 
experience* with the actual content of Bible modules. So if they are 
prone to more failures from malformed XML where other tags are involved, 
then just fixing <br> does not make sense.

Given the option of either making the parser more "accommodating" or 
insisting on well formed input, I will choose the latter and agree with 
your bottom line below - let's get Karl to fix the module :)

> 
> One of the reasons I advocate using an XML parser for our module creator 
> is that it would not allow input that is not well-formed.

Agreed.

> 
> The other thing would be to change the ThML filter to not use an XML 
> parser. This too is not trivial.

Sticking with XML seems to be the way to go, especially since that is 
what ThML is supposed to contain. Does JSword already use another 
parsing mechanism for some other source formats?

> 
> Bottom Line: The definition of ThML is that it is not a superset of html 
> but of xhtml. I don't think we should handle invalid ThML but only valid 
> ThML.
> 
> Karl is very responsive to fixing problems in his modules. (Yeah!!!) I 
> think that Karl should fix his module to be valid ThML.
> 
I'll make a post to sword-devel about this. Unless he's already 
listening here too ;)

Brian.