[jsword-devel] JSword and Map/Image modules

DM Smith dmsmith555 at yahoo.com
Wed Jan 28 07:59:52 MST 2009


Brian Fernandes wrote:
> To answer my own questions:
>
> 1) Map modules can be both hierarchical and simple lists. I should 
> have realized that all the modules I talked about earlier were in fact 
> Map modules; but two (FrarsiBibleAtlas and AbsMaps) were stored in gen 
> book format? (not sure of the right terminology here). The other two 
> were in dictionary format.
>
> So basically we'll have to make a small change to BD code to see which 
> of these formats is being used and use a list or a tree accordingly.

This should be handled already. All dictionary modules are treated as 
lists and all gen book modules are treated as trees.

>
>
> 2) NetMaps was not working because it's content is something like this:
>
> <br><br>Journey of Paul (JP) #1, grid D2<br><img 
> src="/images/jp1.jpg"/><br><br>Journey of Paul (JP) #2, grid 
> D2<br><img src="/images/jp2.jpg"/><br><br>Journey of Paul (JP) #3, 
> grid D2<br><img src="/images/jp3.jpg"/><br><br>Journey of Paul (JP) 
> #4, grid D2<br><img src="/images/jp4.jpg"/>
>
>
> This is not valid XML as the <br> tags are not closed and the fallback 
> code simply removes all tags in an effort to display something.
>
> If you replace <br> with <br/>, it works just fine. The THMLFilter 
> class makes 3 attempts to parse the text.
>
> a) The first attempt is made after removing invalid '&' characters in 
> the text.
>
> b) If the above fails, it does some further character clean up, 
> removing disallowed characters from the XML.
>
> c) If this still fails, it simply removes all tags.
>
> Maybe we can add an additional step between b and c which would 
> replace "<br>" with "<br/>"?  Or perhaps do it as part of step b. Any 
> other tags like this which we may want to clean up?
>
> DM, what do you think? 
The behavior of JSword certainly could be improved. The typical problem 
that JSword encounters is a verse that is not well-formed XML. This can 
readily happen in modules. I have tracked the problem to the following:
- Modules built from IMP format and are not validated against the spec. 
For ThML and OSIS, this should be both well-formed XML and valid against 
the schema. Further for ThML it should only contain the SWORD supported 
elements. For GBF, it should match the spec. When it is in IMP format, 
there are no validation tools. Also, osis2mod is transformational to 
what SWORD can handle. This is side stepped, which can cause problems.
- OSIS modules in BSP structure can build verses that are not 
well-formed. This causes problems for all front-ends. But for JSword it 
is worse than all others. The version of osis2mod in SVN fixes this by 
using milestoned versions of all
- The module building tools do not validate input. It is expected that 
the module creator does that first. In fact the module creators are 
relatively brain-dead and merely look for start and end of verses and 
pass everything in between as

Here is what I think the fallback mechanism should be changed by adding 
another step (before the tag stripping):
As each un-matched end element is encountered, an opening tag for that 
element should be prefixed.
As each un-matched begin tag is encountered, a closing tag for that 
element should be suffixed.
This is not trivial, essentially a quasi-xml streaming parser needs to 
be written that uses a stack to know what is opened and what is closed. 
And the insertions need to be written in the correct order.
I say quasi because the xml parser standard requires it to fail on bad 
input.

One of the reasons I advocate using an XML parser for our module creator 
is that it would not allow input that is not well-formed.

The other thing would be to change the ThML filter to not use an XML 
parser. This too is not trivial.

Bottom Line: The definition of ThML is that it is not a superset of html 
but of xhtml. I don't think we should handle invalid ThML but only valid 
ThML.

Karl is very responsive to fixing problems in his modules. (Yeah!!!) I 
think that Karl should fix his module to be valid ThML.

In Him,
    DM





More information about the jsword-devel mailing list