[sword-devel] Fast search index options
Jerry Hastings
sword-devel@crosswire.org
Sat, 02 Sep 2000 22:55:41 -0700
At 10:51 AM 9/2/2000 +0200, Nathan wrote:
>On 29th August 2000, Joe wrote:
> >DistinctPassage - a simple list of verses.
> >RangedPassage - stores a list of verse pairs (start and end)
> >BitwisePassage - stores a 31k long bitmap. One bit per verse
>
>I did some quick tests this morning on the 3 options.
>The bitmaps have fixed size (3888 bytes)
>The first 2 options depend on how many times the word occurs
>in the Bible.
Here are three more methods to try. Consider the bit map divided into
records, of say 32, 16, or 8 bits each.
1) Record a list of non-zero bit map records for a word. The list would
have 6 byte records of two fields. One to store a 32 bit record from the
bit map and the other, an integer to store the bit map record
number. This may be good for words like "Micaiah" which is in 18 verses,
but only two chapters. Depending on where the record boundaries fall this
may reduce to a two record list.
2) Similar to 1, but instead of the second field being a record number,
the second record indicates a number of blank records following the
non-zero record. Sort of a RLE compression. This second field could be 8
or16 bits for all records in the list depending on the largest break
between verses with the word.
3) Split the bit map into bytes. If the first byte is non-zero, count the
number of bytes to the first zero byte, but not more than 128 bytes. Add
128 to the count and save as the first byte of a file, followed by the
counted non-zero bytes. If the first byte was zero, count the number of
bytes until a non-zero byte, but not more than 128 bytes. Save the count as
the first byte of a file, but do not save the zero bytes. Start again at
the first byte after the bytes counted above and keep repeating. For some
words, like "the", NOT the bit map first--flip 1s and 0s.
These methods should allow for quick rebuilding of the bit maps and for
reproducing just a section of a bit map for ORing with another bit map.
Jerry