[sword-devel] Re: [bt-devel] fast search

Sat, 10 Feb 2001 04:11:04 -0800

At 01:59 PM 2/9/2001 +0000, Trevor Jenkins wrote:
>On Thu, 8 Feb 2001, David Burry <dburry@tagnet.org> wrote:
>
> > order-preserving hash indexing algorithm
>
>I don't understand what an "order-preserving hash" means.

It means the hash keys are stored in a way that they can be efficiently 
retrieved in binary order (or whatever ordering routine you supply), 
therefore "ranges" of indexes can be grabbed or checked for relevance 
without loading in and oring/anding the whole word list for any additional 
words.

> > allows complex searches such
> > as one operand within so many words/verses/chapters/etc of another
> > operand (this is not all yet functional but it's coming trust me).
>
>That's just the type of feature I think should be available as part of a
>"fast search". Plus the boolean operators. I'm not convinced of the need
>for relevance ranking (either in sword or in general).

"within 0 verses" is the same as the common boolean "and" operator.  Yes, 
it does "or" too.

> > The downside of what I've written there is that the indexes are about
> > 5 times the size of the original data.
>
>Sounds as if your keeping the position pointers uncompressed. The
>commerical system I helped to develop got the total size of the inverted
>files (we had two for each base file) down to about .7 of the size of the
>base file. We did this partly by removing reduncancy in position pointers
>and by not using fixed width integers.

It's uncompressed, yes, but simply being uncompressed would only make it a 
little larger than the original data since the pointers themselves aren't 
much longer than the average word length.  What takes up most of the space 
is the holey hashing algorithm.  It uses a Berkley DB B+Tree 
(http://www.sleepycat.com/), which provides a lot more speed increase than 
having to decompress and examine and then and/or together what you 
described.  It could be compressed to a small fraction of the size by 
taking out the holes if I worked in an "order preserving minimal perfect 
hash function generator" and made the index read-only, but I haven't gotten 
around to that yet.  (just search for that on google to see what it is, 
Managing Gigabytes also talks about it)

In fact, the main word index is just a bunch of hash keys without any 
values.  I just have to write up a paper explaining it I guess I'm not 
going to be able to adequately explain tonight.

Some explanation (more notes to myself though) can be found at the bottom of:
http://beaver.dburry.com/cgi-perl/bible
http://beaver.dburry.com/cgi-perl/bible?refstr=kjv&srchstr=in+the+beginning
http://beaver.dburry.com/cgi-perl/bible?refstr=kjv&srchstr=fear+of+the+lord
http://beaver.dburry.com/cgi-perl/bible?refstr=kjv&srchstr=god+so+loved
http://beaver.dburry.com/cgi-perl/bible?refstr=kjv&srchstr=love+%26+god+%7C+love+%26+lord

>If we were to index this paragraph of text then the position pointers in
>the index for "index" would be:
<snip>

ok, let me give it a whirl...

stored as ordered hash keys, each hash key in the form:

2-bytes word number from separate word list table, 1 byte book, 1 byte 
chapter, 1 byte verse, 1 byte word index number

Total 6 bytes per hash key.

So to search for "in the beginning" I do this internally:

1) grab all keys that start with the two bytes of the "beginning" word 
number (since "beginning" is the least common word in the phrase).
2) loop through them and do a quick efficient check to see if the word "in" 
exists two words to the left of each one (i.e. calculate and check if each 
key exists in the hash).
3) continue to narrow search with the word "the" between the matches found

To search for "fear & lord & God" I do this internally:

1) grab all keys for the words "fear" just like above
2) loop through calculate and compare with first 5 bytes (minus the 1 word 
byte on the end) for existence of keys with word "lord" then again with the 
word "god" in order from least popular to most popular word (yes, I have a 
word-count index created at indexing time, it's very small since it only 
has a couple thousand entries, not millions).

You can see it's easy enough to do slight permutations for words within so 
many words/verses/chapters/etc of other words.

In the end I end up with references and word indexes I can use to embolden 
the exact matching words in the text output!  ;o)

You can see the coding is a little more complex than simple binary and'ing 
and or'ing, but the speed and versatility is a lot greater for complex 
searches and those containing stop words.  Essential to all this working at 
all is the ability of the Berkley DB B+Tree hashes to preserve the order of 
the keys, and therefore also to do partial-key-prefix lookups and ranges.

The number of bytes per key could be extended or shrunk a little depending 
on how large your book is and how you index it (by chapter, paragraph, 
sentence, etc)

> > yep yep, bought and devoured that book "Managing Gigabytes"  ;o)
>
>You know as much as me then. :-) Except I read MG to find out about latest
>trends. They advocate compressing both inverted and base files.

I know, I'm getting more speed though, and to me it's worth it because I'm 
trying to make a super duper high speed web site that can take tons and 
tons of traffic efficiently, without regard to the extra space this is 
requiring, as long as it's not taking dozens of gigs of course.  In fact, 
15 Bible versions take less than half a gig.  Other people may disagree 
with me and think I'm crazy, that's fine, I don't mind.  ;-D  If you 
convince me of it I'll change  ;o)

> > ... I thought it was interesting that they danced around it and almost
> > touched it a few times, but never really nailed my particular
> > algorithm,
>
>Do you have the first or second edition?

Second.

Dave