[sword-devel] usfm2osis.pl
Chris Little
chrislit at crosswire.org
Mon Jul 9 00:13:13 MST 2012
On 7/8/2012 10:43 PM, Greg Hellings wrote:
> Guys,
>
> Was just running usfm2osis.pl across some files that my Aunt and Uncle
> have given me to convert for the language they're working with through
> Wycliffe. It ran great, saw no problems with it. When I tried to run
> title_cleanup.pl across the output it revealed a minor issue... the
> language they have used appears to use the "French style" of quotation
> mark, but it is marked up in the SFM text as "<<" and ">>". A pair of
> ASCII angle characters. This causes title_cleanup.pl, which is
> expecting good XML, to puke on parsing the file. Of course, it would
> also cause osis2mod to puke when I get to that stage.
>
> Obviously this is an encoding issue in the source file, but I thought
> I should mention that this is also a bug/shortcoming of usfm2osis.pl.
> If it is supposed to be outputting well-formed XML then it should
> encode the plain text to escape such characters with their proper XML
> entity representations. Is there anyone who wants to look into that,
> or do I need to roll up my Perl sleeves and get dirty?
Handling of <</>>-style SFM quotation marks was formerly part of
usfm2osis.pl, but has been commented out. The angle-brackets are not
necessarily used to encode French-style chevrons for quotation marks,
since they were also used in many SFM files to encode curly-quotes, as
used in English typography.
I don't think I've ever seen angle-brackets in a USFM file that were
supposed to be present. The example you cite is SFM, which we obviously
can't reliably support. The fact that we do not handle angle-brackets
helps to identify encoding errors in the text. The alternative would be
to convert them to XML escapes and pass the mis-encoded characters on to
the OSIS document, where they would probably go unnoticed.
So, all things considered, I think it's a good thing that the output of
usfm2osis.pl caused later utilities to choke, thereby signaling that you
need to correct the character encoding problem in the source.
--Chris
More information about the sword-devel
mailing list