[sword-devel] Comming soon: new improved sword searching
sword-devel@crosswire.org
sword-devel@crosswire.org
Wed, 11 Sep 2002 20:44:50 +0600
> On September 9, 2002 07:12, porton@narod.ru wrote:
> > Bible is 31102 (if I counted correctly) verses. It is ~3.8Kbytes if a bit
> > for every verse.
>
> You counted all the verses in the Bible?! (grin)
>
> > Searching for "Christ & (God | Father)" we can construct 3 such bit vectors
> > (~10.6Kbytes) and then make logical operations over these.
>
> Bit vectors have some nice properties such as the ability to do very fast
> logical operations. However, they have some significant downsides as well:
>
> 1. They are very large to store for the Bible. I did a quick calculation and I
> figured the indexes I've build would increase approx 10 x if I stored them as
> bit vectors. The reason for this is that the average word occurs only 100
> times, at least in the KJV (I assume other word based languages are in the
> same order of magnitude). This means that 4K bit vectors are very sparse.
I don't suggest to store so for anything, but only for the most often
encountered words (like "the").
> 2. Converion to and from them can be costly computationaly (especially
> converting from them). Since storing bit vectors and returning bit vectors to
> the frontends aren't options this would have to be considered.
If my memory is right, 80386 has a special command for searching ones in bit
vectors. In any case searching non-zeor bytes is fast.
> 3. Perhaps most significantly, bit vectors are only really a big improvement
> for logical operators. Verse and word proximity (i.e. within x verses, or
> within y words) are better done other ways. This could easily lead to
> multiple conversions to and from bit vectors just to complete one search
> expression.
I'm not about verse proximity, but namely about paragraphs with specified
borders!
> > I can (as will have time) even write necessary algorithms. If it will be
> > too slow for 80386, I can remember its assembler!
>
> Since Sword is a cross platform library, assembler isn't really an option (I
> know it is already compiled on at least 3 different CPU arcitectures). Plus,
> do you really think hand coded assembly would be much faster than what a good
> compiler could produce for a series of bitwise logical operations on arrays?
Isn't only 80386 slow?
--
Victor Porton (porton@ex-code.com)