<html><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">This is in response to SWORD and JSword's inability to handle a - in Bible book names. Here I'll outline the challenges and how JSword handles them.<div><br></div><div>The design challenges:</div><div>First the challenges of identifying a single reference (e.g. Rev. 2:3)</div><div>1) Arbitrary user input. While references in SWORD modules (known as Books in JSword lingo) can be normalized, they have not been. Users enter in all kinds of variations. Bible book names vary widely in their representation and in their abbreviations. Some examples:</div><div>1.1) Some books are known by several names: Song of Songs, Canticles, Songs, Song of Solomon</div><div>1.2) Some books are abbreviated in various fashions. e.g. Pt, Ptr (Peter), Jn, Jhn (John), Phlm (Philemon)</div><div>1.2) Some book names are more than one word: Revelation of John</div><div>1.3) Some books are part of a series. 1 John, 2 John, 3 John</div><div>1.4) Series may be represented in a variety of fashions: Digits [0-9], Roman numerals (i, ii, iii)</div><div>1.5) Sometimes punctuation may be used for series and abbreviations, 1. Moses (aka Genesis), Rev. 1:1</div><div>1.6) Punctuation separates book from chapter, but might vary. E.g. Rev 1:1 and Rev 1.1</div><div>1.7) SWORD defines chapter 0 for each book and verse 0 for each chapter as introduction to the book and chapter, respectively. Most SWORD modules don't have introductory material.</div><div>1.8) Typos and misspellings</div><div>1.9) An upcoming design change is that the verse, when in the form of an osisID (e.g. Gen.1.1) can be prefixed with a workID. The workID can be dotted (e.g. Bible.KJV) and is followed by a ':'. For SWORD modules, the dotted structure represents a hierarchy from general to specific and may include the type of work, the versification of the work and the last part (or the only part), the initials of the module. (That is what is found in the module's conf between the [].)</div><div>1.10) Books with only one chapter do not use chapter numbers. So 3 Jn 5 is verse 5 of the only chapter of 3 John.</div><div><br></div><div><br></div><div>2) Scope</div><div>2.1) Book, Chapter and Verse (BCV) - This is the most obvious. For example: Rev 1:1</div><div>2.2) Book and Chapter (BC) - One can refer to a whole chapter (not sure if it should start at C:0 or C:1).</div><div>2.3) Book (B) alone - Or a book by itself. (Not sure if it should start at 0:0 or 1:1)</div><div><br></div><div>3) Ambiguity - While it is always possible to have unambiguous references (e.g. osisIDs and osisRefs), arbitrary user input can cause ambiguity.</div><div>3.1) Ambiguous abbreviations, e.g. Jud (Judges or Jude), Jo (Job, Jonah or John)</div><div>3.2) Context - When in a particular book, 1:1 refers to that book. When viewing a chapter, 3 refers to a particular verse in that chapter. There are a few cases where "v. 3" means verse 3 in the current chapter and "c.5" means chapter 3 in the same book.</div><div><br></div><div>4) Internationalization (i18n)</div><div>4.1) Always need to handle English input as these are used in modules as cross references.</div><div>4.2) Need to handle book names in the locale of the user.</div><div>4.3) All the issues of 1) apply here as well, but with a further issues.</div><div>4.3.1) A different number system might be used. This may be for book series numbers and for chapter and verse numbers.</div><div>4.3.2) Multiple number systems might be used.</div><div>4.4) Bible book names may include punctuation, such as, '-', the dash, as in Indonesian. Ultimately, there can be no assumptions about what characters can be in a book name.</div><div><br></div><div>Second the challenges of defining a passage, that is a contiguous range of verses:</div><div>1) There needs to be an indicator that a range is needed. For example, '-', the dash. (Note the ambiguity with using a '-' within a book name. (I'll use, '-', the dash below as the separator)</div><div>2) Context - The reference before the dash is the context of what follows. The context might be B, BC or BCV</div><div>2.1) B - as a range context is understood as the start of the book. Since SWORD defines chapter 0 and verse 0, it is not clear to me whether this refers to 0:0 or 1:1.</div><div>2.1) BC - this context is the start of the chapter. (C:0 or C:1)</div><div>2.3) BCV - explicit full reference</div><div>3) End reference - After the dash the reference identifies the last verse in the range. This can be BCV, CV, V, C, BC, and B. </div><div>3.1) B - includes the entire book to the last verse of the last chapter</div><div>3.2) BC - includes the entire chapter of the specified book, to it's last verse.</div><div>3.3) BCV - includes the specified verse</div><div>3.4) CV - includes the specified chapter and verse in the same book as the context.</div><div>3.5) V and C - these are a bit ambiguous. It all depends on the context. If the context explicitly includes the verse (i.e. BCV), then it is V, otherwise it is C.</div><div>3.6) Since the book name can start with a number, it is necessary to look ahead: e.g. John 1:1 - 3 John 5.</div><div>3.7) ff - The double ff represents the last verse in a chapter. One should be able to say John 3:15ff to refer to John 3:15 - 36. Note that this does not use the dash.</div><div>3.8) Order - A range should go from an earlier verse to a later verse.</div><div>3.9) Alternate Versification (v11n) - Currently SWORD and JSword are limited to the KJV versification, but some versifications have a different order of books. Work is going on handle alternate versification and JSword will follow SWORD in providing an implementation for this. When we get there, it will be important to know which versification is used.</div><div><br></div><div>Third the challenges of defining a list of verses and passages.</div><div>1) Individual verses and ranges can be in a delimited list. White space may suffice for a delimiter, but when one is present it has to be either expected or unambiguous. That is, it cannot be the range delimiter. The typical choice is the comma. Other punctuation might work as well. (Below, I'll use the comma)</div><div>2) The list does not have to be ordered, but the second and subsequent list entries have the prior entry's endpoint as it's context.</div><div>3) Interpreting the start:</div><div>3.1) B - The entire or start of a book, depending on whether it is part of a range.</div><div>3.2) BC - The entire or start of a chapter of the book, depending on whether it is part of a range.</div><div>3.3) BCV - The specific verse</div><div>3.4) CV - The chapter and verse of the context's book</div><div>3.5) V and C - Just like the range end, this is context dependent, but with a twist. If it can be understood as a V it should. Otherwise as a C. For example, Jn 3:15-17, 19 should be understood as Jn 3:19. Jn 3:15-20, 19 should be understood as John 19 because 20 >= 19 and 19 is a valid chapter.</div><div><br></div><div>I think this is a reasonably complete discussion of properly parsing passages.</div><div><br></div><div>A while ago, I scraped all the cross references in all modules into a file and tested JSword and SWORD against it. Some entries are bad and JSword and SWORD did their own thing (garbage in, garbage out). Both came up with the same results. If anyone is interested, the file is here: <a href="http://www.crosswire.org/~dmsmith">http://www.crosswire.org/~dmsmith</a></div><div><br></div><div>Now as to how JSword handles passage list parsing. You will see it has some shortcomings. Hopefully, I'll be able to comment on them.</div><div>The entry point for converting a passage to a string is o.c.j.passage.AbstractPassage.addVerses(String refs).</div><div><br></div><div>This splits refs on comma, semi-colon, tab and line breaks (AbstractPassage.REF_ALLOWED_DELIMS) treats each of those as potential VerseRange. For simplicity, JSword defines a VerseRange as one or more adjacent verses that can be parsed by o.c.j.passage.<span class="Apple-style-span" style="font-family: Monaco; font-size: 11px; ">VerseRangeFactory.fromString(String ref).</span></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;"><br></span></font></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;">(Aside: The purpose of VerseRangeFactory is to allow for plugging in a different implementation of VerseRange parsing, but we did not use the plugin model here and instead hard-coded the implementation. While it would be trivial to do, I think an entirely different mechanism is needed to overcome the problems in the current implementation.)</span></font></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;"><br></span></font></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;">Having split the refs string into potential VerseRange strings, the first call to fromString does not have a context (called basis in the code) (SWORD uses Gen 1:1, JSword figures it should be an error if there is no Bible book name in the first reference). The first VerseRange is the context for the second. The second is the context for the third and so forth. As it gets VerseRanges they are added to the AbstractPassage. The actual derived class figures out how to store them.</span></font></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;"><br></span></font></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;">The potential VerseRange is split on VerseRange.RANGE_ALLOWED_DELIMS, which is hard coded to '-'. When there is no '-' then the start and the end of the VerseRange are the same.</span></font></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;"><br></span></font></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;">The parsing engine is o.c.j.passage.AccuracyType. The basic responsibility of AccuracyType is to determine what a string reference is given it's context (or basis). AccuracyType.tokenize(String ref) parses the reference into parts on digit boundaries. There is an undocumented assumption in the code that book names do not end in numbers. This pertains to books like 3 John, and it may be that this does not work for other languages which might call it the equivalent of John 3. This routine will return an array of 1 to 3 strings. There other assumption in this is that Character.isLetter(char) and Character.isDigit(char) are sufficient to determine the parts. Implicitly, if it is not isLetter or isDigit, then it is treated/ignored as if it were a space. This probably is a problem with internationalized book names. Note: This routine does not try to actually identify the book name.</span></font></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;"><br></span></font></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;">Once tokenized the parts are used by AccuracyType to determine the how to interpret the parts against the basis. This is defined in terms of an AccuracyType. This AccuracyType is able to generate a start or end verse from the parts and its context. This is done twice, once for the first part of the range using the passed in VerseRange as the basis for the start and once for the second part of the range using the first verse of the range as the basis.</span></font></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;"><br></span></font></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;">Some of the flaws:</span></font></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;">1) Uses hard-coded delimiters for the verse list and for the ranges. This might be OK, but splitting on them arbitrarily is not OK.</span></font></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;">2) The code looks up the book several times. This is an expensive operation, it should only be done once, if at all possible.</span></font></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;">3) The code handles osisRefs as a fall back case. These use spaces to delimit verse references. On failure, spaces are replaced with commas and reparsed. As we go more and more to osisRefs and osisIDs in SWORD modules, this should be the norm, not the exception.</span></font></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;">4) I think it would be better to have a streaming tokenization that normalizes book names as they are found.</span></font></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;"><br></span></font></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;">Some of the other bugs:</span></font></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;">1) Does not handle verse 0.</span></font></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;">2) Does not handle 5ff properly. This is taken as 5, ff and not 5-ff.</span></font></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;"><br></span></font></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;">Consider a reference that starts with a book name. As we gather text into what might be a book name we could determine which book name it could be. We have a limited catalog of names and abbreviations. Given this universe, there are only so many start characters. If we see one of these, then we only need to consider those words. The second letter narrows it further. At some point, we some words in our universe are done. These are candidates. If there is more input that can match, we continue. When we are done, we are left with the candidates which may need to be disabmiguated. At this point we have a valid, matched book name, or an error/revovery condition. If the names were built into a Trie and one walked down it as given above, I think that would work.</span></font></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;"><br></span></font></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;">In Christ,</span></font></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;"><span class="Apple-tab-span" style="white-space:pre">        </span>DM<br></span></font></div><div><br></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;"><br></span></font></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;"><br></span></font></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;"><br></span></font></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;"><br></span></font></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;"><br></span></font></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;"><br></span></font></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;"><br></span></font></div><div><font class="Apple-style-span" face="Monaco" size="3"><span class="Apple-style-span" style="font-size: 11px;"><br></span></font></div><div><br></div></body></html>