[sword-devel] Comming soon: new improved sword searching

Joel Mawhorter sword-devel@crosswire.org
Tue, 10 Sep 2002 22:38:24 -0700


On September 9, 2002 07:12, porton@narod.ru wrote:
> Bible is 31102 (if I counted correctly) verses. It is ~3.8Kbytes if a bit
> for every verse.

You counted all the verses in the Bible?! (grin)

> Searching for "Christ & (God | Father)" we can construct 3 such bit vectors
> (~10.6Kbytes) and then make logical operations over these.

Bit vectors have some nice properties such as the ability to do very fast 
logical operations. However, they have some significant downsides as well:

1. They are very large to store for the Bible. I did a quick calculation and I 
figured the indexes I've build would increase approx 10 x if I stored them as 
bit vectors. The reason for this is that the average word occurs only 100 
times, at least in the KJV (I assume other word based languages are in the 
same order of magnitude). This means that 4K bit vectors are very sparse.

2. Converion to and from them can be costly computationaly (especially 
converting from them). Since storing bit vectors and returning bit vectors to 
the frontends aren't options this would have to be considered.

3. Perhaps most significantly, bit vectors are only really a big improvement 
for logical operators. Verse and word proximity (i.e. within x verses, or 
within y words) are better done other ways. This could easily lead to 
multiple conversions to and from bit vectors just to complete one search 
expression.

That said, it would be worth investigating if the possible performance 
improvement is significant.

> Such the way we can implement reasonably fast searching in paragraphs
> instead of verses using indexes because with logical/shift operators with
> bit vectors these can be transformed in the necessary way. (We should
> transform bit vectors of verses to the corresponding bit vectors of
> paragraphs (oring the verses in one paragraph) before doing the
> operations.)
>
> I can (as will have time) even write necessary algorithms. If it will be
> too slow for 80386, I can remember its assembler!

Since Sword is a cross platform library, assembler isn't really an option (I 
know it is already compiled on at least 3 different CPU arcitectures). Plus, 
do you really think hand coded assembly would be much faster than what a good 
compiler could produce for a series of bitwise logical operations on arrays?

Thanks for the ideas,

Joel

> Moreover for often encountered words (like "the") we can even use an
> alternate index file format with bit vectors.
>
> P.S. On a Z80 3.5 Mhz ZX Spectrum I calculated 256x192 field for the well
> known life-game (we create a World with certain "physics" and like God see
> its development and can do "wonders") in 1.5 sec (independingly of input
> data). If wanting I could be able to make it even faster.