[sword-devel] Re: [bt-devel] fast search

Trevor Jenkins sword-devel@crosswire.org
Fri, 9 Feb 2001 02:25:46 +0000 (GMT)


On Thu, 8 Feb 2001, David Burry <dburry@tagnet.org> wrote:

> It's not unreasonable to have two different indexing algorithms both
> in use, ...

I can think of one serious problem with two separate fast search
algorithms: code maintenance. The more code the more likely there is of an
error. On the commerical system I helped develop we bucked the industry
trend and remored stop words---that is we did not implement a stop word
module. That's meant to remove common words but for sword would you really
want LORD, Jesus, Israel and other very common words to be removed from
the index? Our reason for not having that "feature" was simple: it
complicated the code. A previous product of ours did provide for stop
words so we knew the risks involved. In the 20 years I worked on/with both
products only once did we find a real reason for having stop words: top
secret data.

A second reason for not having stop word capability was that being a
European system we ealt with multi-lingual environments. What to you looks
like the definite article "the" was to the French term for a beverage
known in English as tea. Similar treatment of OR being French for gold
and as many of our customers were either banks or chemical companies this 
word was important to them. For sword the use of English, French, NT
Greek, Hebrew language modules in the presence of a stop word list could
give rise to false exclusions.

> ... one more optimized faster one for static books that never
> change, one more flexible slower one for dynamic books.  Better than
> having a slow algorithm for everything.

Hash tables. From Data Structures 101 there isn't anything much faster.
And yes you can use them with dynamic data and no you don't have to find a
prime number for the table size; forget that lecture in DS101 just use a
power of two, which allows you to grow the indexes easily (you OR in
another bit to the AND-mask). That's how we did.

> The trick is to make them both invisibly work the same in the API
> without the user (and possibly not even the interface programmer)
> having to know which book is using which.  This can be difficult if
> your inverted list indexes all the words ignoring punctuation, but
> your slower brute force search includes the punctuation, for instance.

I think you've found another reason not to have two speeds of "fast
algorithms". :-) It is unusual to keep the punctuation in the inverted
file. However, if it is there (it is after all only a few character term
and the computer doesn't know it's punctuation) then the original data
file can be dispensed as the data can be recovered from the inverted
lists. Okay so it'll cost CPU time but that's the reverse trade off one is
trying to get from having an inverted file with a base file.

Regards, Trevor

British Sign Language is not inarticulate handwaving; it's a living language.
Support the campaign for formal recognition by the British government now!

-- 

<>< Re: deemed!