[sword-devel] Comming soon: new improved sword searching

sword-devel@crosswire.org sword-devel@crosswire.org
Wed, 11 Sep 2002 20:44:50 +0600


> On September 9, 2002 07:12, porton@narod.ru wrote:
> > Bible is 31102 (if I counted correctly) verses. It is ~3.8Kbytes if a bit
> > for every verse.
> 
> You counted all the verses in the Bible?! (grin)
> 
> > Searching for "Christ & (God | Father)" we can construct 3 such bit vectors
> > (~10.6Kbytes) and then make logical operations over these.
> 
> Bit vectors have some nice properties such as the ability to do very fast 
> logical operations. However, they have some significant downsides as well:
> 
> 1. They are very large to store for the Bible. I did a quick calculation and I 
> figured the indexes I've build would increase approx 10 x if I stored them as 
> bit vectors. The reason for this is that the average word occurs only 100 
> times, at least in the KJV (I assume other word based languages are in the 
> same order of magnitude). This means that 4K bit vectors are very sparse.

I don't suggest to store so for anything, but only for the most often 
encountered words (like "the").

> 2. Converion to and from them can be costly computationaly (especially 
> converting from them). Since storing bit vectors and returning bit vectors to 
> the frontends aren't options this would have to be considered.

If my memory is right, 80386 has a special command for searching ones in bit 
vectors. In any case searching non-zeor bytes is fast.

> 3. Perhaps most significantly, bit vectors are only really a big improvement 
> for logical operators. Verse and word proximity (i.e. within x verses, or 
> within y words) are better done other ways. This could easily lead to 
> multiple conversions to and from bit vectors just to complete one search 
> expression.

I'm not about verse proximity, but namely about paragraphs with specified 
borders!

> > I can (as will have time) even write necessary algorithms. If it will be
> > too slow for 80386, I can remember its assembler!
> 
> Since Sword is a cross platform library, assembler isn't really an option (I 
> know it is already compiled on at least 3 different CPU arcitectures). Plus, 
> do you really think hand coded assembly would be much faster than what a good 
> compiler could produce for a series of bitwise logical operations on arrays?

Isn't only 80386 slow?
-- 
Victor Porton (porton@ex-code.com)