[sword-devel] usfm2osis.pl

Chris Little chrislit at crosswire.org
Mon Jul 9 00:13:13 MST 2012


On 7/8/2012 10:43 PM, Greg Hellings wrote:
> Guys,
>
> Was just running usfm2osis.pl across some files that my Aunt and Uncle
> have given me to convert for the language they're working with through
> Wycliffe. It ran great, saw no problems with it. When I tried to run
> title_cleanup.pl across the output it revealed a minor issue... the
> language they have used appears to use the "French style" of quotation
> mark, but it is marked up in the SFM text as "<<" and ">>". A pair of
> ASCII angle characters. This causes title_cleanup.pl, which is
> expecting good XML, to puke on parsing the file. Of course, it would
> also cause osis2mod to puke when I get to that stage.
>
> Obviously this is an encoding issue in the source file, but I thought
> I should mention that this is also a bug/shortcoming of usfm2osis.pl.
> If it is supposed to be outputting well-formed XML then it should
> encode the plain text to escape such characters with their proper XML
> entity representations. Is there anyone who wants to look into that,
> or do I need to roll up my Perl sleeves and get dirty?

Handling of <</>>-style SFM quotation marks was formerly part of 
usfm2osis.pl, but has been commented out. The angle-brackets are not 
necessarily used to encode French-style chevrons for quotation marks, 
since they were also used in many SFM files to encode curly-quotes, as 
used in English typography.

I don't think I've ever seen angle-brackets in a USFM file that were 
supposed to be present. The example you cite is SFM, which we obviously 
can't reliably support. The fact that we do not handle angle-brackets 
helps to identify encoding errors in the text. The alternative would be 
to convert them to XML escapes and pass the mis-encoded characters on to 
the OSIS document, where they would probably go unnoticed.

So, all things considered, I think it's a good thing that the output of 
usfm2osis.pl caused later utilities to choke, thereby signaling that you 
need to correct the character encoding problem in the source.

--Chris



More information about the sword-devel mailing list