<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 01/02/2016 10:48 AM, Ryan Hiebert
wrote:<br>
</div>
<blockquote
cite="mid:17EA0FBA-551D-4474-8BBE-43917D836974@ryanhiebert.com"
type="cite">
<pre wrap="">I'm completely new to USFM, so I'm sketching out my ideas on how the parser probably should look. This is unearthing some of the many things I don't understand about USFM, so I'll post my questions here. Feel free to forward me to a better forum if there is one.
These questions are all related:
1. Is text allowed to be on a line _without_ a marker starting the line?</pre>
</blockquote>
<br>
Yes. Newline and space are equivalent.<br>
<br>
<blockquote
cite="mid:17EA0FBA-551D-4474-8BBE-43917D836974@ryanhiebert.com"
type="cite">
<pre wrap="">2. Are blank lines semantically meaningful?</pre>
</blockquote>
<br>
No.<br>
<br>
<blockquote
cite="mid:17EA0FBA-551D-4474-8BBE-43917D836974@ryanhiebert.com"
type="cite">
<pre wrap=""> That is, if all the blank lines are removed, does the file mean _exactly_ the same thing?</pre>
</blockquote>
<br>
Yes. Two or more consecutive white spaces are the same as one white
space. A white space can be space, tab, or newline.<br>
<br>
<blockquote
cite="mid:17EA0FBA-551D-4474-8BBE-43917D836974@ryanhiebert.com"
type="cite">
<pre wrap="">3. Are the non-text markers (one that don't have the ending form( \usfm* ) required at the beginning of all meaningful lines?</pre>
</blockquote>
No.<br>
<br>
Note that there are FOUR classes of markers, not just two, the way I
parse them (which is regularly tested against Paratext output):<br>
1: Starts and beginning of a line (normally), and indicates a
paragraph or metadata. Its effects extend until the next such
marker. Examples: \id, \c, \v, \p, \q1.<br>
2: Footnote/cross reference styles, non-nestable, terminated by the
next such style. For historical reasons, these can also be
terminated by an end marker like the next case, so when reading,
allow either syntax. Examples: \fr, \ft or \ft ...\ft*, \fqa.<br>
3: Normal character markers with both beginning and ending markers,
with the end marker the same as the beginning marker but ending with
"*". These are not allowed to be nested or nested within the above
style markers. Examples: \nd ...\nd*, \wj ...\wj*.<br>
4: Nested character markers start with "\+" and terminate with the
same marker ended by "*". These are otherwise the same as #3, but
cannot occur unless they are inside of a style of case #2 or #3.
Examples: \+nd ...\+nd*, \+wj ...\+wj*<br>
<br>
The class numbers above aren't in the USFM specification, but the
concepts are both there and in the master reference implementation
of USFM, which is Paratext.<br>
<br>
<br>
Sometimes Paratext produces USFM files where markers of the first
kind can be in other positions than the beginning of a line. When
writing USFM, put them at the beginning of a line. When reading
USFM, be more tolerant.<br>
<br>
<blockquote
cite="mid:17EA0FBA-551D-4474-8BBE-43917D836974@ryanhiebert.com"
type="cite">
<pre wrap="">
4. Is only one non-text marker allowed per line?</pre>
</blockquote>
No, but when writing USFM, class #1 markers, put them at the
beginning of a line.<br>
<br>
<blockquote
cite="mid:17EA0FBA-551D-4474-8BBE-43917D836974@ryanhiebert.com"
type="cite">
<pre wrap="">
5. Must a non-text marker be only at the beginning of a line?</pre>
</blockquote>
<br>
That is best practice. Always write them there if you are writing,
but allow them elsewhere if you are reading.<br>
<br>
<blockquote
cite="mid:17EA0FBA-551D-4474-8BBE-43917D836974@ryanhiebert.com"
type="cite">
<pre wrap="">Thanks for any help you can give with assisting me in sorting this out. I'm obviously completely new to USFM, so I don't know what I don't know.</pre>
</blockquote>
<br>
Take a look at some test cases from <a class="moz-txt-link-freetext" href="http://ebible.org/Scriptures/">http://ebible.org/Scriptures/</a>,
files ending in _usfm.zip. Also, if you want to read some C# code,
you can check out the Haiola source code for how I parse USFM.<br>
<br>
Also, one word of caution: There is no proper way to do a
one-to-one, lossless, round tripable correspondence between USFM to
OSIS.<br>
<br>
-- <br>
<div class="moz-signature">
<meta http-equiv="CONTENT-TYPE" content="text/html; charset=utf-8">
<p><font color="#000000">Aloha,<br>
<i>Kahunapule Michael Johnson</i></font></p>
<table cellpadding="7" cellspacing="0">
<tbody>
<tr>
<td style="background: rgb(255, 255, 0)"><font
color="#000000"><b>MICHAEL JOHNSON<br>
PO BOX 881143<br>
PUKALANI HI 96788-1143</b><br>
USA</font></td>
<td style="background: rgb(0, 255, 255)"><font
color="#000000">
<a href="http://eBible.org">eBible.org</a><br>
<a href="http://MLJohnson.org">MLJohnson.org</a><br>
Mobile: +1 <b>808-333-6921</b><br>
Skype: kahunapule</font></td>
</tr>
</tbody>
</table>
</div>
</body>
</html>