[osis-users] SWORD reference parser (was: Non-Anglophone Bible references)
Troy A. Griffitts
scribe at crosswire.org
Sat Jun 19 15:12:28 MST 2010
Dear Weston,
Ben is correct, this code is pretty ugly. It has been augmented and
attached and patched for the past 20+ years to handle all sorts of
referencing we've encountered. There is a large chunk of duplicated
code which needs to be factored out, and it's just not pretty to look
at. On the bright side, it works quite well.
It DOES NOT handle osisWork prefix or granularity, though it will parse
the osisRef spec without these. We could really use the addition of this.
Please use what you want/can. We try to keep things modular so you can
use independent parts of our code if you need, but this section of code
relies on a few other components in our api:
LocaleMgr - handles all of our localization
VersificationMgr - handles different versification systems
and the primary component is VerseKey.
Primary objects in our API include:
SWMgr - represents your library of books and the overall engine
SWModule - represents a book
SWKey - represents a stateful position in (and sometimes a limited
domain of) an SWModule
VerseKey is a specific type of SWKey and knows about Versification
Systems and Bible references.
So you can do:
VerseKey vk = "Jn.3.16";
And it will know how to position a module to the correct place.
So you can do:
SWMgr library;
SWModule *kjv = library.getModule("KJV");
kjv->setKey(vk);
or simply:
kjv->setKey("Jn.3.16");
then:
cout << kjv->RenderText();
Anyway, all this to say, a larger picture of how our engine works might
make it easier to lift code, and if you have any inclination to brave
what is our verse reference parser, we'd love to have it support the
components osisWork and granularity.
One last thing.
As you proceed, however that might be. I cannot stress enough how
valuable a good set of unit tests is, especially for this part of code.
This problem has so many paths as the solution parses character by
character, that there aren't many (any) people who remember all paths
we're trying to solve. It is a very complicated problem (as you've
seen: roman numerals, f. ff., book names which include numbers and roman
numerals, multiple punctuation marks to determine different things at
different states, ranges, implied ranges (e.g. rom 7), decisions about
ambiguous numbers (Rom.1.1;2), et. al.)
We started to build an exhaustive set of unit tests for this code a
couple years back and it does not yet cover everything we handle, but
when we find a problem and fix it, we add that problem case to the unit
test so we're getting better. And what it DOES do for me, is when I
'fix' something, I can run the code through the tests to be sure I
haven't changed the parsing logic for a case I forgot about.
Anyway, I hope this is helpful.
Troy
PS. I need to add localization of punctuation and numerals to the
parser soon. I will make every effort to get around to finally
factoring out the duplicated code segment.
On 06/18/2010 12:15 PM, Weston Ruter wrote:
> Troy, This is great! Where's the source code for the reference parser?
>
> As part of the Open Scriptures osis.py module for representing OSIS
> identifier objects (osisID, osisRef, osisWork, etc), the next step is
> to have a pluggable/extensible system for converting human-formatted
> references into their OSIS equivalents, and also to go in the reverse:
> converting any OSIS object into a localized human-friendly
> representation. Collaboration between SWORD and Open Scriptures would
> obviously be a win. That being said, hopefully I haven't duplicated
> too much of what SWORD has already for handling OSIS identifiers.
>
> I've got OsisWork, OsisPassage, and OsisID classes assembled so far:
> http://github.com/openscriptures/api
> See tests for how the objects can be used:
> http://github.com/openscriptures/api/blob/a73bdd7d267b70a9e1303a3205c4241f52d3a83e/osis.py#L763
>
> Weston
>
>
> On Fri, Jun 18, 2010 at 11:58 AM, Troy A. Griffitts
> <scribe at crosswire.org <mailto:scribe at crosswire.org>> wrote:
>
> Regarding what we accept currently, you can try experimenting with:
>
> http://crosswire.org/study/examples/parsevs.jsp
>
>
> We do have the ability to provide alternate versification schemes
> which
> include other books (e.g., apoc.), or completely different book names
> like a versification of Josephus or DSS, but this tool defaults to the
> Protestant KJV v11n.
>
> Troy
>
>
>
> Forwarded conversation
> Subject: *[sword-devel] Non-Anglophone Bible references*
> ------------------------
>
> From: *David Haslam* <d.haslam at ukonline.co.uk
> <mailto:d.haslam at ukonline.co.uk>>
> Date: Thu, Jun 17, 2010 at 1:58 PM
> To: sword-devel at crosswire.org <mailto:sword-devel at crosswire.org>
>
>
>
> Tim Bulkeley has written a short item on this topic here.
>
> http://bigbible.org/sansblogue/bible/non-anglophone-bible-references/
> Non-Anglophone
> <http://bigbible.org/sansblogue/bible/non-anglophone-bible-references/%0ANon-Anglophone>
> Bible references
>
> The topic arises out of his frustration at having to perform a massive
> search and replace task to submit an article to a certain European
> theological journal.
>
> As many CrossWire developers are Anglophone, this may prompt some further
> thoughts that could benefit all our projects.
>
> David
>
>
> --
> View this message in context:
> http://sword-dev.350566.n4.nabble.com/Non-Anglophone-Bible-references-tp2259480p2259480.html
> Sent from the SWORD Dev mailing list archive at Nabble.com.
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> <mailto:sword-devel at crosswire.org>
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
>
> ----------
> From: *David Instone-Brewer* <Technical at tyndale.cam.ac.uk
> <mailto:Technical at tyndale.cam.ac.uk>>
> Date: Fri, Jun 18, 2010 at 2:42 AM
> To: sword-devel at crosswire.org <mailto:sword-devel at crosswire.org>
>
>
> Tim has over-simplified the situation.
> Other systems include different ways of abbreviating the books.
> In the following, "Am" is an abbreviation which does not include the
> end of the word,
> while Jas (ie James) does include the end of the word, so it shouldn't
> have a dot after it,
> which results in different systems:
>
> Am.7 and Jas 1
> Am.7 and Jas.1
> Am. 7 and Jas 1
> Am. 7 and Jas. 1
>
> Also, in numbering, the Dead Sea Scrolls have re-popularised the use
> of dots instead of colons, ie
>
> Am 7.1-3, 4-5
>
> And we haven't dealt with variations in listing other chapters
>
> Am 7.1-3; 8.1-2
> Am 7:1-3. 8:1-2
> etc
>
> And then we have the problem of references which span a chapter:
>
> Am 7.1--8.2 [or use an 'en' dash]
> Am 7.1-8.2
> Am 7.1 - 8.2
> etc
>
> There are so many 'standards' that it is best simply to pick the one
> which works best for you and stick to it.
>
> I'd suggest the following is the best compromise between humans and
> people.
>
> Amo 7.1-2; 8.1-2--9.2: Thus says the Lord....
> Jos Ant 1.2.15: On this day...
> 1QS 3.1
> 4Q496 2.6.1
> 4Qp.Is.a 1.1
> b.San 15.a-b [this means folio 15, sides a and b]
>
> This uses:
> - no dots but a space after the abbreviation of the title of the work
> - preceding dot instead of superscript (the "a" at the end of "4Qp.Is"
> is normally superscript)
> - normal numbers where possible (ie no Roman numerals but occasionally
> you need lower case letters)
> - no italics ("Ant" is normally in italic, as a non-Biblical book title)
> - 3-letter Bible book abbreviation (preferably the same as that used
> by BibleWorks and others)
> - dots dividing between verses, chapters, books and any other levels
> of division.
> - single hyphen for spans of verses
> - double hyphen for spans of chapters
> - semi-colon for separate references
> - colon used to separate a reference from the content
>
>
> David IB
> ----------
> From: *David Haslam* <d.haslam at ukonline.co.uk
> <mailto:d.haslam at ukonline.co.uk>>
> Date: Fri, Jun 18, 2010 at 6:37 AM
> To: sword-devel at crosswire.org <mailto:sword-devel at crosswire.org>
>
>
>
> I was somewhat amused by the sentence that reads, "I'd suggest the
> following
> Notwithstanding, should spans of verse be punctuated by a hyphen or by the
> ndash character?
>
> cf. I came across a tip for MS-Word yesterday which claimed that the ndash
> is the proper standard for numerical ranges.
>
> Methinks such a change would be abhorrent to a lot of Bible software!
>
> David
> --
> View this message in context:
> http://sword-dev.350566.n4.nabble.com/Non-Anglophone-Bible-references-tp2259480p2260214.html
>
> ----------
> From: *Greg Hellings* <greg.hellings at gmail.com
> <mailto:greg.hellings at gmail.com>>
> Date: Fri, Jun 18, 2010 at 6:49 AM
> To: SWORD Developers' Collaboration Forum <sword-devel at crosswire.org
> <mailto:sword-devel at crosswire.org>>
>
>
> Personally I have a hyphen key I can push. I don't have an ndash key
> I can push. I vote for hyphens!
>
> ----------
> From: *David Haslam* <d.haslam at ukonline.co.uk
> <mailto:d.haslam at ukonline.co.uk>>
> Date: Fri, Jun 18, 2010 at 7:03 AM
> To: sword-devel at crosswire.org <mailto:sword-devel at crosswire.org>
>
>
>
> Perhaps more to the point for CrossWire developers, should we create a new
> wiki page to address this subject?
>
> Within David IB's examples, which of these are not valid references in
> relation to our software?
>
> Assuming we could [eventually] make use of any of these referenced
> biblical
> texts within the SWORD API, i.e. even those for the Dead Sea Scrolls, etc.
>
> David
> --
> View this message in context:
> http://sword-dev.350566.n4.nabble.com/Non-Anglophone-Bible-references-tp2259480p2260248.html
>
> ----------
> From: *Troy A. Griffitts* <scribe at crosswire.org
> <mailto:scribe at crosswire.org>>
> Date: Fri, Jun 18, 2010 at 11:58 AM
> To: sword-devel at crosswire.org <mailto:sword-devel at crosswire.org>
>
>
> Regarding what we accept currently, you can try experimenting with:
>
> http://crosswire.org/study/examples/parsevs.jsp
>
>
> We do have the ability to provide alternate versification schemes which
> include other books (e.g., apoc.), or completely different book names
> like a versification of Josephus or DSS, but this tool defaults to the
> Protestant KJV v11n.
>
> Troy
>
>
>
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/osis-users/attachments/20100619/fff61bba/attachment.html>
More information about the osis-users
mailing list