[jsword-devel] New versification mapping - preload mappings to prevent pauses

Sat Jan 18 17:56:35 MST 2014

On Jan 18, 2014, at 5:47 PM, Chris Burrell <chris at burrell.me.uk> wrote:

> Presumably if we have a basic osis ref parser, we can easily construct a verse range directly. Meaning a lot of the parsing on the versification mapping is simplified.

We don't have a basic osisID parser. If we did then...

> 
> The slight complication with the versification mappings is that it currently copes for the case as follows:
> 
> Gen.1.1=Gen.1-2
> Gen.1.1=Gen.1.4

These should be written:
Gen.1.1=Gen.1.1-Gen.1.2
Gen.1.1=Gen.1.4
That is the right hand side of the range should be an osisID.

> 
> i.e. split ranges, meaning that Gen 1:1 would map to verses 1,2 and 4 (leaving verse 3 out).
> 
> So if we used a osis ref parser, we could two ranges very very quickly (Gen.1-2 and Gen.1-4).
Yes.

> However, what structure would we use to store the combination of both? I was thinking a RangedPassage but you suggested the construction of this would be heavy?
The construction of a ranged passage one verse at a time is heavy. Especially when the verses are not ordered.

Imagine if we added the verses for Gen 1:1 to Gen 1:6 one verse at a time, but in the following order 1, 3, 2, 4, 6, 7, 5, 9, 10, 8.

1 > Create a VerseRange for Gen 1:1 to Gen 1:1 and add it to the collection.
3 > Check to see if 3 is already represented or if it is adjacent to any existing VerseRange. Since it is not create a new VerseRange for Gen 1:3 to Gen 1:3.
2 > Do the checks, finding that it is adjacent to Gen 1:1. Extend that VerseRange to become Gen 1:1-Gen 1:2. Since it was also adjacent to the other, throw that one away and extend the first to encompass all the verses in that.
4> Extend the one VerseRange.
6> Create a new VerseRange.
7> Find and extend a VerseRange.
5> Find and extend a VerseRange. Find and another adjacent VerseRange and extend the other to include it. Then throw it away. (Now we have one verse range Gen 1:1-7.
9> Create a Verse Range.
10> Extend that Verse Range
8> Find, extend, find, merge, wack.

If the RangePassage is simply constructed (e.g. construct a verse range and add it to an empty RangedPassage) There is no work.

If the RangedPassage is "Gen.1.1-Gen.1.2 Gen.1.4", is built as a VerseRange for both Gen.1.1-Gen.1.2 and for Gen.1.4-Gen.1.4 and added to the RangedPassage, there will be little overhead.
(In OSIS a passage is a space separate list of osisIDs and/or osisRefs).

I just don't know if this is directly supported in the RangedPassage code, but it should.

> Presumably we could introduce a constructor that takes a collection of VerseRanges, meaning its construction would be also very quick?

Yes.

> 
> One more point, on the osis ref parser. Presumably we would want to use BibleBookNames but looks like it already normalizes the booknames to the English locale.
Not really. The toUpper and toLower should always be called with an explicit Locale. Using Locale.ENGLISH means that it won't use the user's Locale. Since we control the text, we know that Locale.ENGLISH is the correct thing to do.

The normalizing of the names is a concession of end user text entry, where they might not know that Gen is the one and only way to specify Genesis in OSIS.

It would be better to have the lookup be exact. Maybe a new hash and a new call?

> Is there any particular reason for the normalize() method in the static initializer?

> Seems redundant to me... (apart from the lowercase).

This could be optimized to call toLower directly.

The principle is that the index and the search should be normalized in exactly the same way. If we chose, we could change normalize to upper odd chars and lower even. The code would continue to work.

But if we change it then comments should be here and in normalize to note that a change to the one may require a change to the other.

> I suggest we optimize that, saving another hosts of pattern matches. (only a handful), but on Android it will save a number of method calls which tend to be more expensive than expected. Is there a strong reason to have the lowercase? or would it be possible to define them lowercase to start with?

These values are used to construct osisIDs and the Mixed case is defined by the standard.

> Presumably our osis parser would want to ignore case...

The osis parser should probably not want to ignore case. The specification is mixed case. Skipping toLower would be a bit faster.

> 
> Chris
> 
> 
> 
> On 18 January 2014 22:19, DM Smith <dmsmith at crosswire.org> wrote:
> See embedded.
> On Jan 18, 2014, at 4:05 PM, Chris Burrell <chris at burrell.me.uk> wrote:
> 
>> I think to encourage the use of the Passages directly we need to make those constructors public. That's the main reason I have stayed away from using it.
> 
> I see that TallyPassage is fully public and the rest are not. This must have been so that Lucene ranked searches could use them.
> 
> Maybe, we should modify PassageKeyFactory to take an optional PassageType?
> 
>> 
>> Obviously a fast way to construct the versification mappings would be to turn them into verse numbers. But that means losing readability.
> 
> I think readability is critical to maintenance. You'll note that the v11n tables are readable.
> 
>> The obvious thought here is to generate an intermediary file as part of the build from what we have already code (custom maven plugin/ant task). But that seems rather hacky, but would line up with the way the versifications are defined in code.
>> 
>> So basically what we want is:
>> * an easy way of reading in a verse range/passage from text form into Verse/VerseRanges
> 
> The are a couple of things here that we could benefit from:
> 1) An osisID parser and an osisRef parser. An osisRef is merely two osisIDs separated by a dash.
> Note: Gen.1.1-2 is not an osisRef as 2 is not an osisID. To parse "2" the parser has to know that the "basis" (the context) of "2" is Genesis 1:1. Thus 2 is understood as a verse and not a chapter. However, if the basis was Gen.1 then the 2 would be understood as a chapter.
> 
> By having fully formed OSIS references*, we gain the following:
> There are no spaces to trim.
> Splitting on - will split into two parts, both of which are fully formed osisIDs.
> Splitting an osisID on . will result in 1 to 3 parts. book, book and chapter or book, chapter and verse. The verse is always a number, never "ff" or a roman numeral.
> Lookup of the book names is simplified. No need to try to figure out that Jo means John (or does it mean Job, or Jonah???). Book names are fixed ascii, never internationalized, never have spaces, dashes, periods or roman numerals.
> 
> *Later, when we have references with works and grains, we may want to handle those. E.g. ignore them.
> 
> 2) Today, conceptually a Passage is an ordered list of Verses and VerseRanges that don't overlap. However, JSword does not have a way to build a Verse or a VerseRange from a String. Maybe we need a way to construct a Passage from bottom up. I.e. create a Verse or VerseRange from a String (from osisIDs and osisRefs) and then put them into an initially empty Passage. Since we control the input, we know that the VerseRanges do not overlap, so maybe we can have a smart constructor that takes a List of Verse and VerseRanges and adds them.
> 
>> 
>> Basically, the way the code works is to iterate through the properties file and map and reverse-map every verse present to the maximum size available (i.e. 1 verse to a Passage). The idea behind this is that we store less that way. 
>> 
>> So if a mapping specifies as follows: Gen.1.1=Gen.1.1-2, then we only need to store 
>> left to right: Gen.1.1=>Gen.1.1-2
>> right to left: Gen.1=>Gen.1.1, Gen.1.2=>Gen.1.1
>> 
>> In that way, we've saved a mapping, and therefore memory (assuming we represent those ranges correctly).
>> Chris
>> 
>> 
>> 
>> On 18 January 2014 20:48, DM Smith <dmsmith at crosswire.org> wrote:
>> 
>> On Jan 18, 2014, at 3:19 PM, Chris Burrell <chris at burrell.me.uk> wrote:
>> 
>>> Great stuff - helps really explain things. So I think for the versification mappings, seen as they don't change much as they are mostly contiguous ranges a RangePassage might be the best bet.
>> 
>> A RangedPassage is really good once it is constructed. Building it takes a lot of time. Certainly not good for canonical ordered search results.
>> 
>> 
>>> 
>>> I think the other thing that's quite expensive is: PassageKeyFactory.instance().getKey(targetVersification, qualifiedKey.getKey().getOsisRef())
>> 
>> I think this is expensive because of how it is done.
>> From memory: getOsisRef() returns a string, which needs to be parsed into what getKey() already is. The parser does not handle osisRefs. It throws an exception and the caller then mungs the reference such that the parser can handle it. JSword needs an osisRef parser, which would be much simpler and faster than the current parser that tries to understand what an end user is trying to provide.
>> 
>> If the reference is a string (e.g. from a user, from a module, these methods are appropriate), these are what should be used. If not a string, then stay away. It is expensive to convert a Passage to a String and it is expensive to parse a String and to build the Passage.
>> 
>> It'd be better to get an empty key list and add qualifiedKey.getKey().
>> 
>> 
>>> 
>>> I'm not sure what's the best way of dealing with this but I suspect that we may want to have specific cases for Verses as I believe the above is quite heavy as well.
>>> 
>>> I think I also often use: PassageKeyFactory.instance().createEmptyKeyList(versification) which probably should use a RangePassage as well
>> 
>> It should use the PassageType that the program finds best. A server side application has very different requirements from a desktop and that from a mobile app.
>> 
>> If you need a specific implementation, don't use the factory. Use what is best for the particular location.
>> 
>> Just note that if two different implementations are used (like adding one passage to another) the result will probably be the program default.
>> 
>>> 
>>> Would that make sense?
>>> Chris
>>> 
>>> 
>>> 
>>> On 18 January 2014 20:11, DM Smith <dmsmith at crosswire.org> wrote:
>>> There are several different implementations for Passage. They have trade-offs.
>>> BitwisePassage is strictly a BitMap, with one bit for each verse in the v11n. It is very space heavy but fast for most operations. Especially adding or removing Verses. Iteration is fast as it uses NextSetBit (or whatever it is called to find the next set bit).
>>> RocketPassage is a very heavy implementation. Not appropriate for mobile. It occupies lots more space than a BitwisePassage
>>> PassageTally is basically a weighted set of verses such that you can order by weights. This is used to return weighted search results. It is probably not much use otherwise. It is used to return a ranked Lucene search.
>>> RangedPassage keeps ranges (start verse and length) ordered in a TreeSet. It probably is a good candidate for mobile. Maybe the best regarding storage. Works well if the RangedPassage is not modified much. If you add a verse (not in the RangedPassage) it may cause two VerseRanges to be merged into one, otherwise it'll extend one or create a new one. That computation takes time. Removing a verse is also complex.
>>> DistinctPassage keeps verses ordered in a TreeSet. This can be a good candidate for mobile, if there are not many verses in the Passage. If there are lots it gets bigger than BitwisePassage.
>>> 
>>> The PassageKeyFactory is used throughout JSword to create the preferred type of Passage. You can set it to one of the above via PassageType typically at program startup.
>>> 
>>> I wrote BitwisePassage; Joe wrote the others.
>>> 
>>> I've been thinking that there might be another variation of BitwisePassage that might make sense. One with testament or book bitmaps.
>>> 
>>> Hope this helps.
>>> 
>>> In Him,
>>> 	DM
>>> 
>>> 
>>> On Jan 18, 2014, at 2:42 PM, Chris Burrell <chris at burrell.me.uk> wrote:
>>> 
>>>> The reason I went for passages is for less memory consumption. I also thought it would provide better parsing and we're parsing less.
>>>> 
>>>> But I think the way we store passages is to represent every verse by a bit. means that iterating through ranges, calculating cardinality or initialising them means is expensive.
>>>> 
>>>> I wonder if we could have a passage that just stores boundaries rather than all its content. That would be very fast to parse and very quick for operations on it.
>>>> 
>>>> Chris
>>>> On 18 Jan 2014 19:17, "Martin Denham" <mjdenham at gmail.com> wrote:
>>>> I think I have found a workaround for this problem (using a background thread makes it run very quickly) but it seems to point toward a possible point of improvement in the new JSword mapping data structure.
>>>> 
>>>> In the old AB v11n mapping code I stored maps of Verses.  However the new JSword code seems to store maps of RocketPassage/BitwisePassage.  Can you verify this is the best data structure for what will normally be a single verse.  I don't know an awful lot about this so I am just asking, but maybe a Passage will always consume more memory than a Verse.
>>>> 
>>>> Thanks
>>>> Martin
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 18 January 2014 13:30, Martin Denham <mjdenham at gmail.com> wrote:
>>>> I tried your patch and it did noticeably improve things, so it is a worthwhile patch, but did not solve the actual problem.  Still viewing the RusSynodal straight after download takes well over a minute, whereas to boot straight into it takes about 2 secs.
>>>> 
>>>> One or 2 things I have noticed:
>>>> a huge amount of memory allocation and GC occurs when trying to view RusSynodal straight after download
>>>> I am not great with the profiler but here are my tentative observations
>>>> Seems to be spending about half the time in java.util.BitSet.cardinality() 
>>>> called by BitwisePassage.countVerses()<-RocketPassage.countVerses()<-AbstractPassage.getCardinality()<-VersificationToKJVMapper.add...
>>>> I tried caching AbstractPassage.getCardinality but that had no affect which would be the case if there were lots of different keys
>>>> Could it be that a module is in an unusual state after download?  I have noticed that a module cannot be deleted directly after download, until the app is restarted due to:
>>>> sbmd.getConfigFile()==null when called by SwordBookDriver.isDeletable directly after module download
>>>> Any suggestions are welcome.
>>>> 
>>>> Cheers
>>>> Martin
>>>> 
>>>> 
>>>> On 16 January 2014 23:13, Martin Denham <mjdenham at gmail.com> wrote:
>>>> I think I will use the old AB v11n mapping for the next release of AB and come back to this after the release which has quite a lot of changes in it already anyway.
>>>> 
>>>> The problem could be AB, Android, or JSword related and it is not trivial to reproduce involving repeated re-installs of RusSynodal.  I had a similar problem a year or two ago, it took a long time to investigate and ended up being a flaw in the Android VM which required a change to JSword.  
>>>> 
>>>> I keep trying to think what could cause these symptoms and drawing a blank.
>>>> 
>>>> Cheers
>>>> Martin
>>>> 
>>>> 
>>>> On 16 January 2014 21:35, Chris Burrell <chris at burrell.me.uk> wrote:
>>>> I'll try profiling it! Nothing comes to mind though.
>>>> 
>>>> On 16 Jan 2014 20:04, "Martin Denham" <mjdenham at gmail.com> wrote:
>>>> There is something unusual happening with the v11n mapping creation straight after downloading a file requiring mapping e.g. It takes over 2 mins to load the mapping (a single call to mapVerse) for RusSynodal straight after download but a few seconds to load the file normally.
>>>> 
>>>> 1. View ESV and anything else in split screen
>>>> 2. Download RusSynodal
>>>> 3. Exit download screen after download
>>>> 4. select RusSynodal for display (with ESV) in 1 half of split screen
>>>> 5. Mapping code now takes over 2 minutes to load mapping
>>>> 6. But there are no errors and then everything seems fine 
>>>> 
>>>> 7. Exit and kill And Bible
>>>> 8. Restart with ESV/RST in split screens
>>>> 9. This time everything is initialised in a few seconds
>>>> 
>>>> Any ideas?
>>>> 
>>>> Martin
>>>> 
>>>> 
>>>> On 16 January 2014 19:06, Chris Burrell <chris at burrell.me.uk> wrote:
>>>> Hi Martin
>>>> 
>>>> Fine by me. Think it should be opt in if possible.
>>>> 
>>>> A couple of thoughts on that note. Do you know of it s the io or the cpu time that's the issue?
>>>> 
>>>> We probably want to opt in with a list as well as all as there's no point in loading the v11n if not required.
>>>> 
>>>> I guess especially if we end up with mappings per module eventually.
>>>> 
>>>> On a separate note, I also found the books.installed call is very expensive. Thinking it may be worth partially loading these. With around 200 modules we spend almost 15 seconds loading them.
>>>> 
>>>> Finally you can apply for a open source license of jprofiler which helps massively to work out what's going on. Have got a couple of uncommitted fixes for books. Installed find with that.
>>>> 
>>>> Chris
>>>> 
>>>> On 16 Jan 2014 18:10, "Martin Denham" <mjdenham at gmail.com> wrote:
>>>> Hi,
>>>> 
>>>> I have integrated the new v11n mapping code and find I am getting a pause when doing the initial mapping between any 2 v11ns.
>>>> 
>>>> I originally had the same problem with the AB mapping code so pre-loaded all mapping required for the installed set of documents at the start in a background thread to prevent delays.  The code I used is in VersificationMappingFactory.initialiseRequiredMappings().
>>>> 
>>>> I am trying to think of the best way to do this in the new code.  Either i) you could add a method to do it which could be called or ii) you could proeload all required mappings automatically or iii) add a new method to allow AB (or other frontend) to trigger preload of all required mappings by adding a public method a bit like the new VersificationsMapper.ensure(v11ntopreload).
>>>> 
>>>> This is quite an issue for mobile users.  I have a fast mobile and loading a mapping causes a noticeable delay the first time a verse changes, but the preload fix is fairly simple and worked well.
>>>> 
>>>> Cheers
>>>> Martin
>>>> 
>>>> 
>>>> _______________________________________________
>>>> jsword-devel mailing list
>>>> jsword-devel at crosswire.org
>>>> http://www.crosswire.org/mailman/listinfo/jsword-devel
>>>> 
>>>> 
>>>> _______________________________________________
>>>> jsword-devel mailing list
>>>> jsword-devel at crosswire.org
>>>> http://www.crosswire.org/mailman/listinfo/jsword-devel
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> jsword-devel mailing list
>>>> jsword-devel at crosswire.org
>>>> http://www.crosswire.org/mailman/listinfo/jsword-devel
>>> 
>>> 
>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/jsword-devel/attachments/20140118/975a29b5/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4145 bytes
Desc: not available
URL: <http://www.crosswire.org/pipermail/jsword-devel/attachments/20140118/975a29b5/attachment-0001.p7s>