[sword-devel] Re: [bt-devel] fast search
David Burry
sword-devel@crosswire.org
Sat, 10 Feb 2001 04:11:04 -0800
At 01:59 PM 2/9/2001 +0000, Trevor Jenkins wrote:
>On Thu, 8 Feb 2001, David Burry <dburry@tagnet.org> wrote:
>
> > order-preserving hash indexing algorithm
>
>I don't understand what an "order-preserving hash" means.
It means the hash keys are stored in a way that they can be efficiently
retrieved in binary order (or whatever ordering routine you supply),
therefore "ranges" of indexes can be grabbed or checked for relevance
without loading in and oring/anding the whole word list for any additional
words.
> > allows complex searches such
> > as one operand within so many words/verses/chapters/etc of another
> > operand (this is not all yet functional but it's coming trust me).
>
>That's just the type of feature I think should be available as part of a
>"fast search". Plus the boolean operators. I'm not convinced of the need
>for relevance ranking (either in sword or in general).
"within 0 verses" is the same as the common boolean "and" operator. Yes,
it does "or" too.
> > The downside of what I've written there is that the indexes are about
> > 5 times the size of the original data.
>
>Sounds as if your keeping the position pointers uncompressed. The
>commerical system I helped to develop got the total size of the inverted
>files (we had two for each base file) down to about .7 of the size of the
>base file. We did this partly by removing reduncancy in position pointers
>and by not using fixed width integers.
It's uncompressed, yes, but simply being uncompressed would only make it a
little larger than the original data since the pointers themselves aren't
much longer than the average word length. What takes up most of the space
is the holey hashing algorithm. It uses a Berkley DB B+Tree
(http://www.sleepycat.com/), which provides a lot more speed increase than
having to decompress and examine and then and/or together what you
described. It could be compressed to a small fraction of the size by
taking out the holes if I worked in an "order preserving minimal perfect
hash function generator" and made the index read-only, but I haven't gotten
around to that yet. (just search for that on google to see what it is,
Managing Gigabytes also talks about it)
In fact, the main word index is just a bunch of hash keys without any
values. I just have to write up a paper explaining it I guess I'm not
going to be able to adequately explain tonight.
Some explanation (more notes to myself though) can be found at the bottom of:
http://beaver.dburry.com/cgi-perl/bible
http://beaver.dburry.com/cgi-perl/bible?refstr=kjv&srchstr=in+the+beginning
http://beaver.dburry.com/cgi-perl/bible?refstr=kjv&srchstr=fear+of+the+lord
http://beaver.dburry.com/cgi-perl/bible?refstr=kjv&srchstr=god+so+loved
http://beaver.dburry.com/cgi-perl/bible?refstr=kjv&srchstr=love+%26+god+%7C+love+%26+lord
>If we were to index this paragraph of text then the position pointers in
>the index for "index" would be:
<snip>
ok, let me give it a whirl...
stored as ordered hash keys, each hash key in the form:
2-bytes word number from separate word list table, 1 byte book, 1 byte
chapter, 1 byte verse, 1 byte word index number
Total 6 bytes per hash key.
So to search for "in the beginning" I do this internally:
1) grab all keys that start with the two bytes of the "beginning" word
number (since "beginning" is the least common word in the phrase).
2) loop through them and do a quick efficient check to see if the word "in"
exists two words to the left of each one (i.e. calculate and check if each
key exists in the hash).
3) continue to narrow search with the word "the" between the matches found
To search for "fear & lord & God" I do this internally:
1) grab all keys for the words "fear" just like above
2) loop through calculate and compare with first 5 bytes (minus the 1 word
byte on the end) for existence of keys with word "lord" then again with the
word "god" in order from least popular to most popular word (yes, I have a
word-count index created at indexing time, it's very small since it only
has a couple thousand entries, not millions).
You can see it's easy enough to do slight permutations for words within so
many words/verses/chapters/etc of other words.
In the end I end up with references and word indexes I can use to embolden
the exact matching words in the text output! ;o)
You can see the coding is a little more complex than simple binary and'ing
and or'ing, but the speed and versatility is a lot greater for complex
searches and those containing stop words. Essential to all this working at
all is the ability of the Berkley DB B+Tree hashes to preserve the order of
the keys, and therefore also to do partial-key-prefix lookups and ranges.
The number of bytes per key could be extended or shrunk a little depending
on how large your book is and how you index it (by chapter, paragraph,
sentence, etc)
> > yep yep, bought and devoured that book "Managing Gigabytes" ;o)
>
>You know as much as me then. :-) Except I read MG to find out about latest
>trends. They advocate compressing both inverted and base files.
I know, I'm getting more speed though, and to me it's worth it because I'm
trying to make a super duper high speed web site that can take tons and
tons of traffic efficiently, without regard to the extra space this is
requiring, as long as it's not taking dozens of gigs of course. In fact,
15 Bible versions take less than half a gig. Other people may disagree
with me and think I'm crazy, that's fine, I don't mind. ;-D If you
convince me of it I'll change ;o)
> > ... I thought it was interesting that they danced around it and almost
> > touched it a few times, but never really nailed my particular
> > algorithm,
>
>Do you have the first or second edition?
Second.
Dave