[sword-devel] USFM2OSIS

Mike Hart just_mike_y at yahoo.com
Fri Dec 13 07:12:53 MST 2013

I typed that 1st message out very quickly, and it is somewhat flawed. If anyone uses this info for a permanent fix...The USFM manual does indicate the presence of a space After the 3 letter code... 

^\id (...) ([^\]+)  

is a better interpretation of the spec. 

The spec does not declare 3 characters either.  It only has a table where all entries are 3 characters long... Trapping the code to character class would be better than blind dots. 

On Tuesday, December 10, 2013 4:33 PM, Kahunapule Michael Johnson <Kahunapule at eBible.org> wrote:
Unfortunately, the USFM reference is ambiguous on this point. Although the examples for \id and \rem both show information following the marker all on one line, but there is no where in the standard that I can find that says that either marker's data is terminated by a newline. This is actually a problem in the case of a \rem comment, limiting the places it can be used safely.

The master reference implementation of USFM disambiguates this by interpreting \rem data to terminate at the next paragraph-style marker. It actually doesn't even terminate at a verse marker. The reason it does that, upon examination of the standard usfm.sty, is that Paratext regards both \id and \rem to be paragraph styles. Thus \rem can only safely be used between paragraphs, where it is easy enough to know exactly what to disregard for presentation purposes. Everything following the 3-character book code
on the \id line is a comment, too, just like what follows \rem_. The space after the book code might be a newline, and Paratext is happy with

Normally, Paratext formats USFM outputs such that \rem and \id data is all on one line, and all paragraph style markers start on a new line, but apparently there is a recent version of Paratext that doesn't always do that. (I haven't noticed that, but someone on the Paratext Supporters forum has.) It seems that there is nothing actually written explicitly in the USFM standard that would require that.

On 12/10/2013 12:04 PM, Chris Little wrote:
> Per the USFM Reference, in the specific cases of \id and a very small number of other tags like \rem, the tag does end with the newline and it is not appropriate to interpret any following text as a new tag. It's not that usfm2osis.py requires \mt1, \c, or \q1 to start on a new line, it's that everything after \id is part of that tag's unformatted & unstructured data.
> The specific language on the note for these tags is: "The text following this marker is not normally used in any formatted presentation." They all have the potential for containing a bunch of unstructured data and are approximately on the level of comments. USFM comments (\rem) have exactly the same note about unformatted data continuing to the end of the line, and the example markup for \rem in the USFM reference makes clear that \rem continues to the next newline and not the next USFM marker, since two
> lines of comment require two distinct \rem's.
> If usfm2osis.py has a problem interpreting data in other cases, I might be willing to fix it, but in this case, I am not. usfm2osis.py is behaving correctly. Paratext has a bug and is generating markup in violation of the USFM spec. If UBS wants to declare that the markup being generated from Paratext is correct, then the USFM
 Reference needs to be corrected. That is all to say that I will not modify usfm2osis.py to work around Paratext bugs that generate out of spec markup.
> --Chris
> On 12/10/2013 11:11 AM, Mike Hart wrote:
>> FYI -- item came up on another mailing list.
>> It appears that recently USFM tagging completely ignores the return
>> character in many places, and validates only on the start of another tag.
>> That is, USFM2OSIS apparently considers something like (regex)
>> \\id (...)(.+$)
>> to be the ID field; while ParaTExt USFM now considers something more like
>> \\id (...)([^\]+)
>> to be the ID field.
>> ( \1 = machine readable Bible book ID for import, \2= Optional human
>> readable text explaining what the file is.)
>> Further discussion describes this 'ignore-return-trend' is appearing
>> around other tags as well, with chapters starting without a return after
>> the end of the last verse....
>> Robert Hunt wrote:
>> To:
>> Paratext Supporters ‎
>> Attachments:
>> ATT00001.txt‎ (231 B‎) <https://outlook.tblusa.org/owa/#>
>> Tuesday, December 10, 2013 3:43 AM
 Dear all,
>>      With increasing pressure to get Bibles and even partial Bibles onto
>> mobile devices these days, there is lots of interest in converting from
>> Paratext/USFM files to other formats. Crosswire Bible Society
>> <https://outlook.tblusa.org/owa/redir.aspx?C=uF6tZoPryESH1yaE7W0snDg8eAhjydAIuniVydqnEu6X8J0zEHxq1IQZAzmiYDgcr33HTqwWsio.&URL=http%3a%2f%2fwww.crosswire.org%2findex.jsp> have
>> the Sword Project which has its own binary format for Bible modules
>> which are readable by "front-ends
>> <https://outlook.tblusa.org/owa/redir.aspx?C=uF6tZoPryESH1yaE7W0snDg8eAhjydAIuniVydqnEu6X8J0zEHxq1IQZAzmiYDgcr33HTqwWsio.&URL=http%3a%2f%2fwww.crosswire.org%2fapplications.jsp>"
>> on many operating systems, including Windows, Linux, Android, etc.
>> However, the current Crosswire usfm2osis.py converter chokes on the
>> following:
>> \id 1TH My test version \mt2 The first letter of Paul to the
>> \mt1 Corinthians
>> \c 1
>> \s Paul introduces himself
>> \p
>> \v 1 Hi there, I'm Paul.
>> In reading the USFM spec, I can't find
 confirmation that markers like
>> \mt2 MUST start on a new line. The closest that I can see is:
>> Most paragraph or poetic markers (like \p, \m, \q# etc.) can be followed
>> immediately by
>> a verse number (\v) on a new line.
>> All examples, however, do show these (what I call "newline markers") on
>> new lines.
>> However, I notice that the last few Paratext versions have a tendency to
>> pop some markers and their text up onto the end of the previous line.
>> I'm pretty sure that PT6 didn't do this. I don't think this is an
>> intentional feature, but seems to be either a bug or some kind of weird
>> side-effect. (It happens often enough that I don't think the user can be
>> blamed for it, especially the way \c markers pop onto the previous line,
>> but of course because Paratext usually displays by chapter, the user
>> can't even see that without changing view mode.)
>> So anyway, I have a few questions:
>>  1. Do you agree that these types of markers (\mt2, \c, \q1) should/must
>>     start on a new line?
>>  2. If so, would it be good to make that clear in the USFM standard (or
>>     did I miss something)?
>>  3. Is having these markers pop up to the end of the previous line a
>>     known bug in Paratext?
>>  4. Is there any way in Paratext to automatically fix this in the USFM
>>     files?
>>  5. Does the Pathway code handle files like this better than the
>>     Crosswire converter?
>> Thanks,
>> Robert.
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page

> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page

sword-devel mailing list: sword-devel at crosswire.org
Instructions to unsubscribe/change your settings at above page
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20131213/5a3b8f9f/attachment.html>

More information about the sword-devel mailing list