[sword-devel] Re: [bt-devel] fast search

Thu, 08 Feb 2001 20:07:20 -0800

At 02:25 AM 2/9/2001 +0000, Trevor Jenkins wrote:
>On Thu, 8 Feb 2001, David Burry <dburry@tagnet.org> wrote:
>
>> It's not unreasonable to have two different indexing algorithms both
>> in use, ...
>
>I can think of one serious problem with two separate fast search
>algorithms: code maintenance. The more code the more likely there is of an
>error.

True, there are positives and negatives for everything.  The trick is finding the right balance for the particular application.

> On the commerical system I helped develop we bucked the industry
>trend and remored stop words
><snip>

My "Web Bible" site at http://beaver.dburry.com/bible/devel/ indexes stop words for searching as well.  They are especially useful when you kind of have a verse memorized but can't remember where it's located.

>> ... one more optimized faster one for static books that never
>> change, one more flexible slower one for dynamic books.  Better than
>> having a slow algorithm for everything.
>
>Hash tables. From Data Structures 101 there isn't anything much faster.
>And yes you can use them with dynamic data and no you don't have to find a
>prime number for the table size; forget that lecture in DS101 just use a
>power of two, which allows you to grow the indexes easily (you OR in
>another bit to the AND-mask). That's how we did.

Yes, hash tables are exactly what I've used in the above web site.  The most complex searches are lightening 200 ms fast on my slow old Pentium, any other slowness you see is probably due to my DSL line (deployment site has multiple OC12's into the co-lo).  My particular order-preserving hash indexing algorithm allows complex searches such as one operand within so many words/verses/chapters/etc of another operand (this is not all yet functional but it's coming trust me).  The downside of what I've written there is that the indexes are about 5 times the size of the original data.  For the extra flexibility and speed I personally view this as fine in a client-server situation where you can slap on a several gig drive onto the server and devote it to quickly serving the web site to the world (assuming the original data isn't larger than a couple gigs), but it's not really acceptable for something like SWORD where you're intending it to be installed on people's own local computers!
.

>> The trick is to make them both invisibly work the same in the API
>> without the user (and possibly not even the interface programmer)
>> having to know which book is using which.  This can be difficult if
>> your inverted list indexes all the words ignoring punctuation, but
>> your slower brute force search includes the punctuation, for instance.
>
>I think you've found another reason not to have two speeds of "fast
>algorithms". :-) It is unusual to keep the punctuation in the inverted
>file. However, if it is there (it is after all only a few character term
>and the computer doesn't know it's punctuation) then the original data
>file can be dispensed as the data can be recovered from the inverted
>lists. Okay so it'll cost CPU time but that's the reverse trade off one is
>trying to get from having an inverted file with a base file.

yep yep, bought and devoured that book "Managing Gigabytes"  ;o)  I thought it was interesting that they danced around it and almost touched it a few times, but never really nailed my particular algorithm, maybe I should write a masters thesis... lol

Dave