[jsword-devel] Passage List Parsing
dmsmith555 at yahoo.com
Fri Aug 29 13:39:24 MST 2008
This is in response to SWORD and JSword's inability to handle a - in
Bible book names. Here I'll outline the challenges and how JSword
The design challenges:
First the challenges of identifying a single reference (e.g. Rev. 2:3)
1) Arbitrary user input. While references in SWORD modules (known as
Books in JSword lingo) can be normalized, they have not been. Users
enter in all kinds of variations. Bible book names vary widely in
their representation and in their abbreviations. Some examples:
1.1) Some books are known by several names: Song of Songs, Canticles,
Songs, Song of Solomon
1.2) Some books are abbreviated in various fashions. e.g. Pt, Ptr
(Peter), Jn, Jhn (John), Phlm (Philemon)
1.2) Some book names are more than one word: Revelation of John
1.3) Some books are part of a series. 1 John, 2 John, 3 John
1.4) Series may be represented in a variety of fashions: Digits [0-9],
Roman numerals (i, ii, iii)
1.5) Sometimes punctuation may be used for series and abbreviations,
1. Moses (aka Genesis), Rev. 1:1
1.6) Punctuation separates book from chapter, but might vary. E.g. Rev
1:1 and Rev 1.1
1.7) SWORD defines chapter 0 for each book and verse 0 for each
chapter as introduction to the book and chapter, respectively. Most
SWORD modules don't have introductory material.
1.8) Typos and misspellings
1.9) An upcoming design change is that the verse, when in the form of
an osisID (e.g. Gen.1.1) can be prefixed with a workID. The workID can
be dotted (e.g. Bible.KJV) and is followed by a ':'. For SWORD
modules, the dotted structure represents a hierarchy from general to
specific and may include the type of work, the versification of the
work and the last part (or the only part), the initials of the module.
(That is what is found in the module's conf between the .)
1.10) Books with only one chapter do not use chapter numbers. So 3 Jn
5 is verse 5 of the only chapter of 3 John.
2.1) Book, Chapter and Verse (BCV) - This is the most obvious. For
example: Rev 1:1
2.2) Book and Chapter (BC) - One can refer to a whole chapter (not
sure if it should start at C:0 or C:1).
2.3) Book (B) alone - Or a book by itself. (Not sure if it should
start at 0:0 or 1:1)
3) Ambiguity - While it is always possible to have unambiguous
references (e.g. osisIDs and osisRefs), arbitrary user input can cause
3.1) Ambiguous abbreviations, e.g. Jud (Judges or Jude), Jo (Job,
Jonah or John)
3.2) Context - When in a particular book, 1:1 refers to that book.
When viewing a chapter, 3 refers to a particular verse in that
chapter. There are a few cases where "v. 3" means verse 3 in the
current chapter and "c.5" means chapter 3 in the same book.
4) Internationalization (i18n)
4.1) Always need to handle English input as these are used in modules
as cross references.
4.2) Need to handle book names in the locale of the user.
4.3) All the issues of 1) apply here as well, but with a further issues.
4.3.1) A different number system might be used. This may be for book
series numbers and for chapter and verse numbers.
4.3.2) Multiple number systems might be used.
4.4) Bible book names may include punctuation, such as, '-', the dash,
as in Indonesian. Ultimately, there can be no assumptions about what
characters can be in a book name.
Second the challenges of defining a passage, that is a contiguous
range of verses:
1) There needs to be an indicator that a range is needed. For example,
'-', the dash. (Note the ambiguity with using a '-' within a book
name. (I'll use, '-', the dash below as the separator)
2) Context - The reference before the dash is the context of what
follows. The context might be B, BC or BCV
2.1) B - as a range context is understood as the start of the book.
Since SWORD defines chapter 0 and verse 0, it is not clear to me
whether this refers to 0:0 or 1:1.
2.1) BC - this context is the start of the chapter. (C:0 or C:1)
2.3) BCV - explicit full reference
3) End reference - After the dash the reference identifies the last
verse in the range. This can be BCV, CV, V, C, BC, and B.
3.1) B - includes the entire book to the last verse of the last chapter
3.2) BC - includes the entire chapter of the specified book, to it's
3.3) BCV - includes the specified verse
3.4) CV - includes the specified chapter and verse in the same book as
3.5) V and C - these are a bit ambiguous. It all depends on the
context. If the context explicitly includes the verse (i.e. BCV), then
it is V, otherwise it is C.
3.6) Since the book name can start with a number, it is necessary to
look ahead: e.g. John 1:1 - 3 John 5.
3.7) ff - The double ff represents the last verse in a chapter. One
should be able to say John 3:15ff to refer to John 3:15 - 36. Note
that this does not use the dash.
3.8) Order - A range should go from an earlier verse to a later verse.
3.9) Alternate Versification (v11n) - Currently SWORD and JSword are
limited to the KJV versification, but some versifications have a
different order of books. Work is going on handle alternate
versification and JSword will follow SWORD in providing an
implementation for this. When we get there, it will be important to
know which versification is used.
Third the challenges of defining a list of verses and passages.
1) Individual verses and ranges can be in a delimited list. White
space may suffice for a delimiter, but when one is present it has to
be either expected or unambiguous. That is, it cannot be the range
delimiter. The typical choice is the comma. Other punctuation might
work as well. (Below, I'll use the comma)
2) The list does not have to be ordered, but the second and subsequent
list entries have the prior entry's endpoint as it's context.
3) Interpreting the start:
3.1) B - The entire or start of a book, depending on whether it is
part of a range.
3.2) BC - The entire or start of a chapter of the book, depending on
whether it is part of a range.
3.3) BCV - The specific verse
3.4) CV - The chapter and verse of the context's book
3.5) V and C - Just like the range end, this is context dependent, but
with a twist. If it can be understood as a V it should. Otherwise as a
C. For example, Jn 3:15-17, 19 should be understood as Jn 3:19. Jn
3:15-20, 19 should be understood as John 19 because 20 >= 19 and 19 is
a valid chapter.
I think this is a reasonably complete discussion of properly parsing
A while ago, I scraped all the cross references in all modules into a
file and tested JSword and SWORD against it. Some entries are bad and
JSword and SWORD did their own thing (garbage in, garbage out). Both
came up with the same results. If anyone is interested, the file is
Now as to how JSword handles passage list parsing. You will see it has
some shortcomings. Hopefully, I'll be able to comment on them.
The entry point for converting a passage to a string is
This splits refs on comma, semi-colon, tab and line breaks
(AbstractPassage.REF_ALLOWED_DELIMS) treats each of those as potential
VerseRange. For simplicity, JSword defines a VerseRange as one or more
adjacent verses that can be parsed by
(Aside: The purpose of VerseRangeFactory is to allow for plugging in a
different implementation of VerseRange parsing, but we did not use the
plugin model here and instead hard-coded the implementation. While it
would be trivial to do, I think an entirely different mechanism is
needed to overcome the problems in the current implementation.)
Having split the refs string into potential VerseRange strings, the
first call to fromString does not have a context (called basis in the
code) (SWORD uses Gen 1:1, JSword figures it should be an error if
there is no Bible book name in the first reference). The first
VerseRange is the context for the second. The second is the context
for the third and so forth. As it gets VerseRanges they are added to
the AbstractPassage. The actual derived class figures out how to store
The potential VerseRange is split on VerseRange.RANGE_ALLOWED_DELIMS,
which is hard coded to '-'. When there is no '-' then the start and
the end of the VerseRange are the same.
The parsing engine is o.c.j.passage.AccuracyType. The basic
responsibility of AccuracyType is to determine what a string reference
is given it's context (or basis). AccuracyType.tokenize(String ref)
parses the reference into parts on digit boundaries. There is an
undocumented assumption in the code that book names do not end in
numbers. This pertains to books like 3 John, and it may be that this
does not work for other languages which might call it the equivalent
of John 3. This routine will return an array of 1 to 3 strings. There
other assumption in this is that Character.isLetter(char) and
Character.isDigit(char) are sufficient to determine the parts.
Implicitly, if it is not isLetter or isDigit, then it is treated/
ignored as if it were a space. This probably is a problem with
internationalized book names. Note: This routine does not try to
actually identify the book name.
Once tokenized the parts are used by AccuracyType to determine the how
to interpret the parts against the basis. This is defined in terms of
an AccuracyType. This AccuracyType is able to generate a start or end
verse from the parts and its context. This is done twice, once for the
first part of the range using the passed in VerseRange as the basis
for the start and once for the second part of the range using the
first verse of the range as the basis.
Some of the flaws:
1) Uses hard-coded delimiters for the verse list and for the ranges.
This might be OK, but splitting on them arbitrarily is not OK.
2) The code looks up the book several times. This is an expensive
operation, it should only be done once, if at all possible.
3) The code handles osisRefs as a fall back case. These use spaces to
delimit verse references. On failure, spaces are replaced with commas
and reparsed. As we go more and more to osisRefs and osisIDs in SWORD
modules, this should be the norm, not the exception.
4) I think it would be better to have a streaming tokenization that
normalizes book names as they are found.
Some of the other bugs:
1) Does not handle verse 0.
2) Does not handle 5ff properly. This is taken as 5, ff and not 5-ff.
Consider a reference that starts with a book name. As we gather text
into what might be a book name we could determine which book name it
could be. We have a limited catalog of names and abbreviations. Given
this universe, there are only so many start characters. If we see one
of these, then we only need to consider those words. The second letter
narrows it further. At some point, we some words in our universe are
done. These are candidates. If there is more input that can match, we
continue. When we are done, we are left with the candidates which may
need to be disabmiguated. At this point we have a valid, matched book
name, or an error/revovery condition. If the names were built into a
Trie and one walked down it as given above, I think that would work.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the jsword-devel