[sword-devel] Fast search -- some ideas

Troy A. Griffitts sword-devel@crosswire.org
Fri, 08 Sep 2000 14:28:36 -0700


I hope no one is getting discouraged as to what should be developed for
a fast search framework.  Seems like discussions have calmed down.

Just wanted to encourage everyone that some great discussion was going
on.  Don't feel overwhelmed by such a huge task.  I would love to see
some simple implementations of a few of these neat ideas so that we
might try them out-- see what kind of perfomance versus size that we
really get.

Anyway, just wanted to help everyone not be overwhelmed by thinking of
implementing all of these great ideas right away.  Small test
implementations are great for research.  And fun to play with, too! :)

		-Troy.




Stephen Denne wrote:
> 
> G'day,
> 
> I took a quick look at indexing searches for the purposes of getting some
> kind of complete app on a 1MB Palm Pilot... I decided that if I was to do
> any coding (I'm more likely to upgrade my hardware), I'd use what you're
> calling option 1, and simply leave out the most common words.
> 
> I was considering having a tree data structure where each record in the tree
> containg the next letter (or unique letters) in the word, a (4 byte) pointer
> to the next record in the same level, a (2 byte) pointer to the first record
> of the next level, and the (zero-terminated) verselist, (2 bytes per verse).
> The tree is ... er... climbed(?)... as the user enters the word, skipping as
> many records as required till the entered letter is found (It would be a
> simple matter to optimise this by ordering the index records within a level
> by the total number of related verses), and then the resulting verse list is
> immediately available. This also leads to an easy stemming implementation -
> including (ORing) everything below that point in the tree.
> 
> This stucture saves space in that it doesn't require every letter of every
> word to be stored (dictionary integrated with the concordance), but at the
> same time makes any implementation of multiple spellings/languages using the
> same index rather tricky. It also makes wildcard searches incredibly complex
> (obviously I wasn't going to implement them).
> 
> If space were a huge concern you could save most of the last 12560 by
> indicating that a bitmap is to be expected instead of a verselist with a
> special verse pointer of 0 or FFFF
> 
> Stephen Denne.
> 
> There is... a time to search and a time to give up as lost; A time to keep
> and a time to throw away.