[sword-devel] Entities in modules
DM Smith
dmsmith at crosswire.org
Thu Nov 12 11:05:03 MST 2009
On 11/12/2009 12:08 PM, Sebastien Koechlin wrote:
> On Wed, Nov 11, 2009 at 03:50:12PM -0500, DM Smith wrote:
>
>> We have a few modules that have entities in them. These are of the fashion
>> (a character entity),U (a numeric decimal entity) andÅ
>> (a numeric hex entity).
>>
>> These cause various problems:
>>
> This is because osis2mod does not use an XML parser.
I'm not seeing the problem in OSIS modules, but in ThML modules. They
are perfectly valid in ThML modules, but are problematic. I will be
going over all the modules looking for these and will report problematic
CrossWire modules in www.crosswire.org/bugs. And I'll pass along any
problems I find in the Xiphos and Bible.org modules.
My understanding is that a true XML parser has strict requirements as to
how it is to handle errors: put out an error message and die.
If we used a true XML parser for osis2mod, it would die on the first
character entity that was not &, <, > or " unless it were
defined in the schema. OSIS does not define additional character entities.
We make the assumption that input to osis2mod has been validated against
the OSIS schema. If this is true then there are no character entities in
the input.
> Character entitie is
> just a useful way to write a characters you can not or you want not to
> put in your XML file. When parsed and resolved, they must not be
> distinguable from others characters. The same apply for CDATA sections.
>
I agree with the statement above as far as it goes. But what is the XML
parser to do when it discovers a character entity that it cannot resolve?
> osis2mod should not keep entities when reading an OSIS file. I think it's a
> big mistake and we should not rely on external programs many people will
> have trouble to run.
>
I'd agree that numeric entities should be converted. And I think that
osis2mod should complain if it finds entities that are not valid for an
OSIS document and prompt the user to validate the input document.
Regarding module writers having trouble running tools, we've talked
about having a web service at CrossWire.org that would provide the
appropriate validation, conversion, creation, .... of an OSIS text.
We've just not had a volunteer step up to the task.
> We also had troubles with non-canonical Unicode sequences and I think
> osis2mod was corrected.
>
> Named entities as nbsp came from HTML and should not be used in OSIS as they
> are not declared in osisCore.2.1.1.xsd, it result in an invalid document.
> BUT, as we do not use an XML parser, we can use the HTML DTD[1] to resolve its
> and be more friendly with OSIS writers.
>
The problem with using entities that are not allowed in OSIS is that one
cannot validate against the OSIS schema. And because OSIS is not HTML,
one cannot validate against it either.
For osis2mod to handle other character entities other than the 4
mentioned above, means that it cannot expect valid OSIS.
>
> [1] see thoses URL, for this a perl program can produce a .cc or .h file.
> http://www.w3.org/TR/html4/HTMLlat1.ent
> http://www.w3.org/TR/html4/HTMLsymbol.ent
> http://www.w3.org/TR/html4/HTMLspecial.ent
>
The code I provided does so many more than just these character entities.
>
> (Sorry if my message look rude, I'm not native english speaker)
>
I didn't take your response as rude. I appreciate your input. I think
our goals are the same, to produce the highest quality modules
minimizing the effort to do so.
All for God's glory.
In Him,
DM
More information about the sword-devel
mailing list