[jsword-devel] Passage List Parsing

Fri Aug 29 13:39:24 MST 2008

This is in response to SWORD and JSword's inability to handle a - in  
Bible book names. Here I'll outline the challenges and how JSword  
handles them.

The design challenges:
First the challenges of identifying a single reference (e.g. Rev. 2:3)
1) Arbitrary user input. While references in SWORD modules (known as  
Books in JSword lingo) can be normalized, they have not been. Users  
enter in all kinds of variations. Bible book names vary widely in  
their representation and in their abbreviations. Some examples:
1.1) Some books are known by several names: Song of Songs, Canticles,  
Songs, Song of Solomon
1.2) Some books are abbreviated in various fashions. e.g. Pt, Ptr  
(Peter), Jn, Jhn (John), Phlm (Philemon)
1.2) Some book names are more than one word: Revelation of John
1.3) Some books are part of a series. 1 John, 2 John, 3 John
1.4) Series may be represented in a variety of fashions: Digits [0-9],  
Roman numerals (i, ii, iii)
1.5) Sometimes punctuation may be used for series and abbreviations,  
1. Moses (aka Genesis), Rev. 1:1
1.6) Punctuation separates book from chapter, but might vary. E.g. Rev  
1:1 and Rev 1.1
1.7) SWORD defines chapter 0 for each book and verse 0 for each  
chapter as introduction to the book and chapter, respectively. Most  
SWORD modules don't have introductory material.
1.8) Typos and misspellings
1.9) An upcoming design change is that the verse, when in the form of  
an osisID (e.g. Gen.1.1) can be prefixed with a workID. The workID can  
be dotted (e.g. Bible.KJV) and is followed by a ':'. For SWORD  
modules, the dotted structure represents a hierarchy from general to  
specific and may include the type of work, the versification of the  
work and the last part (or the only part), the initials of the module.  
(That is what is found in the module's conf between the [].)
1.10) Books with only one chapter do not use chapter numbers. So 3 Jn  
5 is verse 5 of the only chapter of 3 John.

2) Scope
2.1) Book, Chapter and Verse (BCV) - This is the most obvious. For  
example: Rev 1:1
2.2) Book and Chapter (BC) - One can refer to a whole chapter (not  
sure if it should start at C:0 or C:1).
2.3) Book (B) alone - Or a book by itself. (Not sure if it should  
start at 0:0 or 1:1)

3) Ambiguity - While it is always possible to have unambiguous  
references (e.g. osisIDs and osisRefs), arbitrary user input can cause  
ambiguity.
3.1) Ambiguous abbreviations, e.g. Jud (Judges or Jude), Jo (Job,  
Jonah or John)
3.2) Context - When in a particular book, 1:1 refers to that book.  
When viewing a chapter, 3 refers to a particular verse in that  
chapter. There are a few cases where "v. 3" means verse 3 in the  
current chapter and "c.5" means chapter 3 in the same book.

4) Internationalization (i18n)
4.1) Always need to handle English input as these are used in modules  
as cross references.
4.2) Need to handle book names in the locale of the user.
4.3) All the issues of 1) apply here as well, but with a further issues.
4.3.1) A different number system might be used. This may be for book  
series numbers and for chapter and verse numbers.
4.3.2) Multiple number systems might be used.
4.4) Bible book names may include punctuation, such as, '-', the dash,  
as in Indonesian. Ultimately, there can be no assumptions about what  
characters can be in a book name.

Second the challenges of defining a passage, that is a contiguous  
range of verses:
1) There needs to be an indicator that a range is needed. For example,  
'-', the dash. (Note the ambiguity with using a '-' within a book  
name. (I'll use, '-', the dash below as the separator)
2) Context - The reference before  the dash is the context of what  
follows. The context might be B, BC or BCV
2.1) B - as a range context is understood as the start of the book.  
Since SWORD defines chapter 0 and verse 0, it is not clear to me  
whether this refers to 0:0 or 1:1.
2.1) BC - this context is the start of the chapter. (C:0 or C:1)
2.3) BCV - explicit full reference
3) End reference - After the dash the reference identifies the last  
verse in the range. This can be BCV, CV, V, C,  BC, and B.
3.1) B - includes the entire book to the last verse of the last chapter
3.2) BC - includes the entire chapter of the specified book, to it's  
last verse.
3.3) BCV - includes the specified verse
3.4) CV - includes the specified chapter and verse in the same book as  
the context.
3.5) V and C - these are a bit ambiguous. It all depends on the  
context. If the context explicitly includes the verse (i.e. BCV), then  
it is V, otherwise it is C.
3.6) Since the book name can start with a number, it is necessary to  
look ahead: e.g. John 1:1 - 3 John 5.
3.7) ff - The double ff represents the last verse in a chapter. One  
should be able to say John 3:15ff to refer to John 3:15 - 36. Note  
that this does not use the dash.
3.8) Order - A range should go from an earlier verse to a later verse.
3.9) Alternate Versification (v11n) - Currently SWORD and JSword are  
limited to the KJV versification, but some versifications have a  
different order of books. Work is going on handle alternate  
versification and JSword will follow SWORD in providing an  
implementation for this. When we get there, it will be important to  
know which versification is used.

Third the challenges of defining a list of verses and passages.
1) Individual verses and ranges can be in a delimited list. White  
space may suffice for a delimiter, but when one is present it has to  
be either expected or unambiguous. That is, it cannot be the range  
delimiter. The typical choice is the comma. Other punctuation might  
work as well. (Below, I'll use the comma)
2) The list does not have to be ordered, but the second and subsequent  
list entries have the prior entry's endpoint as it's context.
3) Interpreting the start:
3.1) B - The entire or start of a book, depending on whether it is  
part of a range.
3.2) BC - The entire or start of a chapter of the book, depending on  
whether it is part of a range.
3.3) BCV - The specific verse
3.4) CV - The chapter and verse of the context's book
3.5) V and C - Just like the range end, this is context dependent, but  
with a twist. If it can be understood as a V it should. Otherwise as a  
C. For example, Jn 3:15-17, 19 should be understood as Jn 3:19. Jn  
3:15-20, 19 should be understood as John 19 because 20 >= 19 and 19 is  
a valid chapter.

I think this is a reasonably complete discussion of properly parsing  
passages.

A while ago, I scraped all the cross references in all modules into a  
file and tested JSword and SWORD against it. Some entries are bad and  
JSword and SWORD did their own thing (garbage in, garbage out). Both  
came up with the same results. If anyone is interested, the file is  
here: http://www.crosswire.org/~dmsmith

Now as to how JSword handles passage list parsing. You will see it has  
some shortcomings. Hopefully, I'll be able to comment on them.
The entry point for converting a passage to a string is  
o.c.j.passage.AbstractPassage.addVerses(String refs).

This splits refs on comma, semi-colon, tab and line breaks  
(AbstractPassage.REF_ALLOWED_DELIMS) treats each of those as potential  
VerseRange. For simplicity, JSword defines a VerseRange as one or more  
adjacent verses that can be parsed by  
o.c.j.passage.VerseRangeFactory.fromString(String ref).

(Aside: The purpose of VerseRangeFactory is to allow for plugging in a  
different implementation of VerseRange parsing, but we did not use the  
plugin model here and instead hard-coded the implementation. While it  
would be trivial to do, I think an entirely different mechanism is  
needed to overcome the problems in the current implementation.)

Having split the refs string into potential VerseRange strings, the  
first call to fromString does not have a context (called basis in the  
code) (SWORD uses Gen 1:1, JSword figures it should be an error if  
there is no Bible book name in the first reference). The first  
VerseRange is the context for the second. The second is the context  
for the third and so forth. As it gets VerseRanges they are added to  
the AbstractPassage. The actual derived class figures out how to store  
them.

The potential VerseRange is split on VerseRange.RANGE_ALLOWED_DELIMS,  
which is hard coded to '-'. When there is no '-' then the start and  
the end of the VerseRange are the same.

The parsing engine is o.c.j.passage.AccuracyType. The basic  
responsibility of AccuracyType is to determine what a string reference  
is given it's context (or basis). AccuracyType.tokenize(String ref)  
parses the reference into parts on digit boundaries. There is an  
undocumented assumption in the code that book names do not end in  
numbers. This pertains to books like 3 John, and it may be that this  
does not work for other languages which might call it the equivalent  
of John 3. This routine will return an array of 1 to 3 strings. There  
other assumption in this is that Character.isLetter(char) and  
Character.isDigit(char) are sufficient to determine the parts.  
Implicitly, if it is not isLetter or isDigit, then it is treated/ 
ignored as if it were a space. This probably is a problem with  
internationalized book names. Note: This routine does not try to  
actually identify the book name.

Once tokenized the parts are used by AccuracyType to determine the how  
to interpret the parts against the basis. This is defined in terms of  
an AccuracyType. This AccuracyType is able to generate a start or end  
verse from the parts and its context. This is done twice, once for the  
first part of the range using the passed in VerseRange as the basis  
for the start and once for the second part of the range using the  
first verse of the range as the basis.

Some of the flaws:
1) Uses hard-coded delimiters for the verse list and for the ranges.  
This might be OK, but splitting on them arbitrarily is not OK.
2) The code looks up the book several times. This is an expensive  
operation, it should only be done once, if at all possible.
3) The code handles osisRefs as a fall back case. These use spaces to  
delimit verse references. On failure, spaces are replaced with commas  
and reparsed. As we go more and more to osisRefs and osisIDs in SWORD  
modules, this should be the norm, not the exception.
4) I think it would be better to have a streaming tokenization that  
normalizes book names as they are found.

Some of the other bugs:
1) Does not handle verse 0.
2) Does not handle 5ff properly. This is taken as 5, ff and not 5-ff.

Consider a reference that starts with a book name. As we gather text  
into what might be a book name we could determine which book name it  
could be. We have a limited catalog of names and abbreviations. Given  
this universe, there are only so many start characters. If we see one  
of these, then we only need to consider those words. The second letter  
narrows it further. At some point, we some words in our universe are  
done. These are candidates. If there is more input that can match, we  
continue. When we are done, we are left with the candidates which may  
need to be disabmiguated. At this point we have a valid, matched book  
name, or an error/revovery condition. If the names were built into a  
Trie and one walked down it as given above, I think that would work.

In Christ,
	DM

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.crosswire.org/pipermail/jsword-devel/attachments/20080829/04bd97e0/attachment-0001.html