[sword-devel] osis2mod import issue
DM Smith
dmsmith at crosswire.org
Thu Jun 4 09:53:43 MST 2009
Mattias Põldaru wrote:
> Ühel kenal päeval, K, 2009-06-03 kell 19:25, kirjutas DM Smith:
>
>> On Jun 3, 2009, at 1:36 PM, Mattias Põldaru wrote:
>>
>>
>>> Hi everybody.
>>>
>>> It is nice to see you (DM, I suppose) got the osis2mod working in no
>>> time at all. There is one more issue with preverse stuff. Some
>>> whitespace gets counted as preverse on my file and I think this is
>>> wrong, although it isn't that complicated at all to remove whitespace
>>> from my source document. I paste a example here.
>>>
>>>
>>> Here is the input osis file. Please correct me, if I have something
>>> wrong here.
>>> <!-- start of example clip -->
>>> <div type="bookGroup">
>>> <title>Vana Testament</title>
>>> <div type="book" osisID="Gen" canonical="true">
>>> <title type="main">1. Moosese</title>
>>> <div type="section" scope="Gen.1.1-Gen.2.3" >
>>> <title>Maailma ja inimese loomine</title>
>>> <chapter sID="Gen.1" osisID="Gen.1" />
>>> <title type="chapter">1. peatükk</title>
>>> <p>
>>> <verse sID="Gen.1.1" osisID="Gen.
>>> 1.1" />
>>> Alguses lõi Jumal taevad ja maa.
>>> <verse eID="Gen.1.1" />
>>> </p>
>>> <p>
>>> <verse sID="Gen.1.2" osisID="Gen.
>>> 1.2" />
>>> Ja maa oli tühi ja paljas ja pimedus oli sügavuse peal ja Jumala Vaim
>>> hõljus vete kohal.
>>> <verse eID="Gen.1.2" />
>>> </p>
>>> <!-- end of example clip -->
>>>
>>>
>>>
>>>
>>> And here is the corresponding module output. Please notice the one
>>> space
>>> only preverse.
>>> <!-- start of example clip -->
>>> <div sID="gen1" type="bookGroup"/> <title>Vana Testament</title> <div
>>> canonical="true" osisID="Gen" sID="gen2" type="book"/> <title
>>> type="main">1. Moosese</title> <div sID="gen3" scope="Gen.1.1-Gen.2.3"
>>> type="section"/> <title>Maailma ja inimese loomine</title>
>>> <chapter osisID="Gen.1" sID="Gen.1"/> <title type="chapter">1.
>>> peatükk</title> <div sID="gen4" type="paragraph"/>
>>> Alguses lõi Jumal taevad ja maa. <div eID="gen4" type="paragraph"/>
>>> <div type="x-milestone" subType="x-preverse" sID="pv1"/><div
>>> sID="gen5"
>>> type="paragraph"/> <div type="x-milestone" subType="x-preverse"
>>> eID="pv1"/> Ja maa oli tühi ja paljas ja pimedus oli sügavuse peal ja
>>> Jumala Vaim hõljus vete kohal. <div eID="gen5" type="paragraph"/>
>>> <!-- end of example clip -->
>>>
>> The pre-verse contains "<p> " (the paragraph start and the space)
>>
>> Handling of whitespace is a bit problematic. What osis2mod does is
>> replace sequences of whitespace (newlines, spaces and tabs) with a
>> single space. If a verse contains leading or trailing space, it is
>> trimmed. (I don't think it should do this trimming.)
>>
>> What osis2mod does not have knowledge of the containment model of the
>> OSIS schema. That is, if it did, it could remove whitespace between
>> element tags that don't allow for text.
>>
>> In this case, the OSIS schema allows for whitespace after the opening
>> paragraph tag and before the verse tag. One could have:
>> <p>yada yada yada <verse>verse text</verse> yada yada yada</p>
>> In this case, it would be inappropriate to trim the whitespace off of
>> the text that precedes the verse.
>>
>> If we can come up with a good heuristic I'd be glad to implement it.
>>
>>
> For the case I have, it would be sufficient to check if the preverse has
> any printing characters and not to add an empty preverse.
>
The preverse is not empty, it contains
<div type="paragraph" sID="gen5">
which is the transformation of <p> into a milestoned representation.
It also has a single space following that element.
Where should the paragraph be put? It either is appended to the prior
verse or it is pre-verse.
The one solution I thought of is that any whitespace immediately
following a block element start (<div>, <lg>, <p>, ...) can be deleted.
Likewise for any whitespace immediately before the end element.
Would this work?
In Him,
DM
More information about the sword-devel
mailing list