[sword-devel] Another Important Issue - fast searching framework index
Nathan
sword-devel@crosswire.org
Tue, 29 Aug 2000 08:35:06 +0200
Good day Troy, Martin & others
(I only joined the list about 2 weeks ago, so I am still trying to
find out who's who, who's doing what, what needs to be done, where
is the code, etc etc)
I am busy doing something like this for my website at the moment.
Only at a website, the speed is of even more importance, as you
have many users and many requests at the same time.
I have developed some techniques for making it fast enough, and
they also seem to work well with large resultsets.
It makes provision for most search requirements, including
wildcards (mesch*), AND, OR, NOT, range, and I am also looking at
ranking (most "relevant" at the top -- if requested)
Maybe we should talk on this?
(I am still working on it, but I am finished with 80%+ of it, so
I know it works)
1. In what format are the indexes that you are currently building?
(I assume it is something like a list of pointers to verses)
Are you also storing the number of times the word occurs in that
verse?
Are you working with ALL the words, or are you eliminating
"stopwords"? (something I see some Bible programs are doing --
most annoying imho)
2. I have tried to look at where you are doing the new fast search
in the Sword CVS, but time has not allowed me to explore this yet.
Can you point me to where/what you are doing at the moment?
(Or better, provide me with some quick high-level overview :-)
3. This bring up another point. Not all users know regex, etc.
But they will want to do complex searches. Are you looking at
making the search user interface more simple?
E.g. why ask the users to tell you that you must user regex when
they type "mesch*"? The * should tell you that automatically.
Or am I making it sound too easy?
God bless you,
nathan
http://www.nathan.co.za
PS. Where can I get hold of a Sword CD? I am in South Africa,
so I guess the normal outlets don't work. And the ISO image is
too big to download. I tried it! <grin>
-----Original Message-----
From: owner-sword-devel@crosswire.org
On Behalf Of Troy A. Griffitts
Sent: 29 August 2000 03:15
To: sword-devel@crosswire.org
Subject: Re: [sword-devel] Another Important Issue
Martin,
Thanks for the post. This is exactly what we are doing with the
reference implementation of a fast searching framework. We do one
search for each word in the text and create an index of every word with
verse references for each. We save this index and every time a search
is performed, we ask the index for the references for the word. And,
yes, as you said, we do multiword searches this way also.
Problems come with large result sets. You see, not only do we have to
find verse references for the word[s], we also have to verify that the
verse references are within the search range specified (valid for the
key used to specify the search bounds). This entails iterating through
the search results and asking the key if each one is valid. For
extremely large result sets, this takes just as long as searching the
entire text, actually sometimes longer than the default searching
mechanism.
Any suggestions on how to speed up this process would be greatly
appreciated.
-Troy.
Martin Gruner wrote:
>
> Another feature request:
>
> At the moment you can use sword to retrieve text (a list of words) by a
key
> (bible reference).
> Is it possible to retrieve keys (a list of) by a word? I am not talking
about
> searching. I am talking about something like a concordance. This would
> involve creating a file for every module that contains information about
the
> location of every single word in the module.
> For example, if I look up "mesch", sword tells me that this word is not in
> the module, but the words "mescha", "meschar", "meschelemja" ....
> If I look up "meschelemja", sword will give me 3 references to where this
> word occures in the bible.
> Once this would be implemented, searches for a single word would be
speeded
> up amazingly, because sword would just look them up in the concordance.
You
> could even perform multi word searches using this mechanism.
> I do not know how realistic this is, but it is at least another
(discussable)
> idea.
>
> Martin