[sword-devel] osis2mod import issue

DM Smith dmsmith at crosswire.org
Thu Jun 4 09:53:43 MST 2009


Mattias Põldaru wrote:
> Ühel kenal päeval, K, 2009-06-03 kell 19:25, kirjutas DM Smith:
>   
>> On Jun 3, 2009, at 1:36 PM, Mattias Põldaru wrote:
>>
>>     
>>> Hi everybody.
>>>
>>> It is nice to see you (DM, I suppose) got the osis2mod working in no
>>> time at all. There is one more issue with preverse stuff. Some
>>> whitespace gets counted as preverse on my file and I think this is
>>> wrong, although it isn't that complicated at all to remove whitespace
>>> from my source document. I paste a example here.
>>>
>>>
>>> Here is the input osis file. Please correct me, if I have something
>>> wrong here.
>>> <!-- start of example clip -->
>>> <div type="bookGroup">
>>>        <title>Vana Testament</title>
>>>        <div type="book" osisID="Gen" canonical="true">
>>>                <title type="main">1. Moosese</title>
>>>                        <div type="section" scope="Gen.1.1-Gen.2.3" >
>>>                        <title>Maailma ja inimese loomine</title>
>>>                        <chapter sID="Gen.1" osisID="Gen.1" />
>>>                            <title type="chapter">1. peatükk</title>
>>>                            <p>
>>>                                <verse sID="Gen.1.1" osisID="Gen. 
>>> 1.1" />
>>> Alguses lõi Jumal taevad ja maa.
>>>                                <verse eID="Gen.1.1" />
>>>                            </p>
>>>                            <p>
>>>                                <verse sID="Gen.1.2" osisID="Gen. 
>>> 1.2" />
>>> Ja maa oli tühi ja paljas ja pimedus oli sügavuse peal ja Jumala Vaim
>>> hõljus vete kohal.
>>>                                <verse eID="Gen.1.2" />
>>>                            </p>
>>> <!-- end of example clip -->
>>>
>>>
>>>
>>>
>>> And here is the corresponding module output. Please notice the one  
>>> space
>>> only preverse.
>>> <!-- start of example clip -->
>>> <div sID="gen1" type="bookGroup"/> <title>Vana Testament</title> <div
>>> canonical="true" osisID="Gen" sID="gen2" type="book"/> <title
>>> type="main">1. Moosese</title> <div sID="gen3" scope="Gen.1.1-Gen.2.3"
>>> type="section"/> <title>Maailma ja inimese loomine</title>
>>> <chapter osisID="Gen.1" sID="Gen.1"/> <title type="chapter">1.
>>> peatükk</title> <div sID="gen4" type="paragraph"/>
>>> Alguses lõi Jumal taevad ja maa.  <div eID="gen4" type="paragraph"/>
>>> <div type="x-milestone" subType="x-preverse" sID="pv1"/><div  
>>> sID="gen5"
>>> type="paragraph"/> <div type="x-milestone" subType="x-preverse"
>>> eID="pv1"/> Ja maa oli tühi ja paljas ja pimedus oli sügavuse peal ja
>>> Jumala Vaim hõljus vete kohal.  <div eID="gen5" type="paragraph"/>
>>> <!-- end of example clip -->
>>>       
>> The pre-verse contains "<p> " (the paragraph start and the space)
>>
>> Handling of whitespace is a bit problematic. What osis2mod does is  
>> replace sequences of whitespace (newlines, spaces and tabs) with a  
>> single space. If a verse contains leading or trailing space, it is  
>> trimmed. (I don't think it should do this trimming.)
>>
>> What osis2mod does not have knowledge of the containment model of the  
>> OSIS schema. That is, if it did, it could remove whitespace between  
>> element tags that don't allow for text.
>>
>> In this case, the OSIS schema allows for whitespace after the opening  
>> paragraph tag and before the verse tag. One could have:
>> <p>yada yada yada <verse>verse text</verse> yada yada yada</p>
>> In this case, it would be inappropriate to trim the whitespace off of  
>> the text that precedes the verse.
>>
>> If we can come up with a good heuristic I'd be glad to implement it.
>>
>>     
> For the case I have, it would be sufficient to check if the preverse has
> any printing characters and not to add an empty preverse.
>   

The preverse is not empty, it contains
<div type="paragraph" sID="gen5">
which is the transformation of <p> into a milestoned representation.

It also has a single space following that element.

Where should the paragraph be put? It either is appended to the prior 
verse or it is pre-verse.

The one solution I thought of is that any whitespace immediately 
following a block element start (<div>, <lg>, <p>, ...) can be deleted. 
Likewise for any whitespace immediately before the end element.

Would this work?

In Him,
    DM



More information about the sword-devel mailing list