[jsword-devel] Passage List Parsing

Sat Aug 30 05:19:07 MST 2008

On Aug 30, 2008, at 1:45 AM, Tonny Kohar wrote:

> Hi,
>
> Whoah it really such a long list of problem :)
>
>> Some of the flaws:
>> 1) Uses hard-coded delimiters for the verse list and for the  
>> ranges. This
>> might be OK, but splitting on them arbitrarily is not OK.
>
> To avoid aribtrarily delimiter, the delimiters could be defined in the
> config/properties file along with the BibleNames as in specific
> locale, the drawback is that the book name could not have a
> punctuation char same with the range delim char.

Putting the delimiter in a properties file is easy to do.

The challenge is knowing which language is being used for Bible book  
names. By the time we get to fromString, we don't know whether the  
reference came from external user input or internally from a module.

Almost every module having cross-references uses - as a range  
delimiter. Many use , as the list delimiter. These all use English  
Bible book names.

We'd still need to use ''-' to process these.

>
>
>> 2) The code looks up the book several times. This is an expensive  
>> operation,
>> it should only be done once, if at all possible.
>> 3) The code handles osisRefs as a fall back case. These use spaces to
>> delimit verse references. On failure, spaces are replaced with  
>> commas and
>> reparsed. As we go more and more to osisRefs and osisIDs in SWORD  
>> modules,
>> this should be the norm, not the exception.
>
> Did you mean, the osisRefs/osisID will be handled first and if fail
> for various reason, it will use the current algorithm ?

The current algorithm splits on comma, semi-colon, tab and new lines  
to determine a list of verse reanges to parse. osisID and osisRefs  
uses spaces as their delimiter as in
osisID="Gen.1.1 Gen.1.2 Gen.1.3 Gen.1.5"
and
osisRef="Gen.1.1-Gen.1.3 Gen.1.5"

Splitting these on the list delimiter does nothing. The process fails  
at some later point with a NoSuchVerseException. We catch this and  
then replace space with comma and try again.

I mean that the algorithm should be re-written so that osisID and  
osisRef work the first time, as well as all the other references.  
There should be no fall back.

>
>
>> 4) I think it would be better to have a streaming tokenization that
>> normalizes book names as they are found.
>
>> Some of the other bugs:
>> 1) Does not handle verse 0.
>
> Is this what you mean by introductory verses as point 1.7?

Yes.

> is there
> any sword module that have this introductory things, that I could use
> as example ?

I'd have to look, I don't know of any off hand. Does anyone else know  
of one?

>
>
>> 2) Does not handle 5ff properly. This is taken as 5, ff and not 5-ff.
>
>> Consider a reference that starts with a book name. As we gather  
>> text into
>> what might be a book name we could determine which book name it  
>> could be. We
>> have a limited catalog of names and abbreviations. Given this  
>> universe,
>> there are only so many start characters. If we see one of these,  
>> then we
>> only need to consider those words. The second letter narrows it  
>> further. At
>> some point, we some words in our universe are done. These are  
>> candidates. If
>> there is more input that can match, we continue. When we are done,  
>> we are
>> left with the candidates which may need to be disabmiguated. At  
>> this point
>> we have a valid, matched book name, or an error/revovery condition.  
>> If the
>> names were built into a Trie and one walked down it as given above,  
>> I think
>> that would work.
>>
>
> How is the tree solve the problem of
It is a Trie not a Tree. A tree has data at the nodes. A Trie encodes  
data on the path to the leaf.

>
> - point 1.5
Book names with punctuation:
I didn't tell the whole story, JSword programmatically normalizes  
Bible book names as an alternative name. These would be put in the  
Trie, also.
>
> - point 3.1
Ambiguous user input for book names.
The outcome of the Trie lookup for Jo would be Job, Jonah and John.  
JSword prioritizes these differently than SWORD: NT first and then by  
the order that they occur in the testament.

>
> - point 3.6
Disambiguation of C, V:
This is handled by AccuracyType

>
> - "The parsing engine is o.c.j.passage.AccuracyType. The basic
> responsibility of AccuracyType is to determine what a string reference
> is given it's context (or basis). AccuracyType.tokenize(String ref)
> parses the reference into parts on digit boundaries. There is an
> undocumented assumption in the code that book names do not end in
> numbers. This pertains to books like 3 John, and it may be that this
> does not work for other languages which might call it the equivalent
> of John 3."
>
> Or the tree does not care about those, it only try to get the book
> name, the disambiquity will be handled in different process ?

Yes, it only tries to get the book name. Disambiguation is a different  
process with each handled differently.

The cheapest solution would be to replace the '-' split with a custom  
tokenize. If it found punctuation while in a Bible book name it would  
check to see if it were a prefix of a Bible book name. (Part of what  
the lookup does is check to see if there is any book name that starts  
with the user input.) Splitting would only take place when the '-'  
failed that check.

If it is appropriate to have additional range delimiters, we'd check  
these too, in the same way.

>
>
> Side note: is the Character.isLetter(char) and Character.isDigit(char)
> works for locale other than english ?

It is independent of locale. In Unicode, each character has a set of  
characteristics that define it. In Sun's Java these are accessible via  
the Character class, which uses the package protected CharacterData  
class. See javadoc for CharacterData.getType for more details.