[sword-devel] Comming soon: new improved sword searching
David Burry
sword-devel@crosswire.org
Sun, 8 Sep 2002 01:01:10 -0700
Awesome! Poor search capability and speed is the primary reason I've been
only lurking on this list too for so long instead of contributing. I don't
have the C experience to improve that very much easily, and don't have the
time to dive in and learn at the moment to invent better searching for sword
from scratch. But I have done some work on fast indexing and searching
algorithms, maybe we can work together... you can see some of my tests here:
http://beaver.dburry.com/cgi-perl/bible
You will see the framework and a lot of implementation of what you're
talking about is already there (albeit with different syntax and some added
things), but it's not in C it's in Perl, therefore of little use to sword at
the moment without porting it. Complicated searches generally take about
200 milliseconds or so (on an old 233 mhz PII).
You will notice I did implement full regexp searching on the indexed
version, by limiting regexps to only search the list of indexed words not
the full text. Yes, it's a bit slow (usually 2 seconds) to search the
entire word list with regexps, but hey still beats the pants off a brute
force full text regexp!
For instance, using your example:
http://beaver.dburry.com/cgi-perl/bible?encoding=utf-8&refstr=kjv&srchstr=%2
Fa.*b%2F&maxverses=30
this one is particularly long, and takes about 3.5 seconds to run the
regexp... and then 1.2 seconds to find all the bazillion verses that match
(see the speed notes at the bottom, this latter part can be optimized since
we're only showing the first 30 matches so who cares about locating the rest
internally)... don't everyone hit this all at once you could run my poor
home linux machine on a DSL out of memory or something, haha, no I don't
really think so I have the apache processes limited so you'd have to wait
your turns instead ;o)
I also have some ideas not yet implemented about creating a "phrase index"
that indexes word pairs instead of words... it should be able to be used
almost the same as my normal word index for most things, but makes phrase
searches faster still and eliminates the loooong failure case for when you
search for something like the phrase "the the" (this ties up the machine
for like 25 secs and then at last you get a failure notice) The phrase
index is no bigger than the normal word index, in fact it can be smaller
because the data is more evenly distributed among the hash buckets
internally, therefore requiring fewer of them. However, my particular
indexing algorithm is probably too large for desktop use, but not bad for a
web server because it's so faaaast! It uses a Berkeley DB B+Tree to store
the main part of the index. The phrase index will be important to improve
the speed for languages like Chinese, where you index each character. I
noticed that in such languages my algorithm starts bogging down because
everything is found so often, it can't start with an unpopular word to make
it quick, and the phrase index can help dramatically there. It would also
help in other languages if we ever indexed letter-by-letter instead of
word-by-word (which is probably useless in most languages, but... you never
know)
I've designed but not yet implemented a much more complicated index storage
system that should be smaller (about 5 megs per bible version) and still
preserve all the extended functionality of my existing one (which is about
25 megs per version) without reducing speed any (in fact it may increase
speed, not sure yet). This came out of the idea of trying to design an
order preserving minimal perfect hash function for my index, when it dawned
on me that my data is so predictable I don't need to use hashes for most
things just some complicated tables that reference tables that reference
tables that reference tables, etc. i.e. that large hash function by itself
would be as good as the index it was trying to describe, yet much smaller
because no holes between the hash values. Again, this better index is not
yet coded, only on the drawing board.
I've also done some work on prediction of how long a search will take
beforehand, so as to help eliminate DOS attacks if this were run on a web
server (for instance), so that you can refuse to run a search that will
display too many results and take too much resources. This is completely
disabled in my test script above, because I want to test the real thing...
(the non-existant phrase "the the" is currently the longest thing possible
in english I think).
Dave
----- Original Message -----
From: "Joel Mawhorter" <joel-ml@mawhorter.org>
To: <sword-devel@crosswire.org>
Sent: Saturday, September 07, 2002 10:42 PM
Subject: [sword-devel] Comming soon: new improved sword searching
> With 95% less time and 7 essential nutrients!
>
> Hi all,
>
> Most of you don't know me but I've been hanging out in this list for a few
> years. I've been working on a Bible search program that I started in my
last
> year of University as a guided project. My focus with this Bible program
was
> to implement full featured searching for non-Latin based languages. What I
> want to see is people all over the world able to study the Bible in their
own
> language. Several times in the past I have evaluated Sword and considered
> just putting my effort into that but the support for non-Latin languages
just
> wasn't there. However, it now seems to be getting much closer and I think
> Sword will be more useful than what I could produce on my own. Therefore,
> I've decided to join the Sword development project. My first priority is
to
> make a few improvements to the searching mechanism in Sword. I am writing
to
> the list to get feedback while I am still in the planning and early
> implementation stages of my work.
>
...