[sword-devel] Searching and Lucene thoughts
Chris Little
chrislit at crosswire.org
Thu Mar 3 14:14:25 MST 2005
Stephen,
I didn't actually say anything about Boyer-Moore looking at the ends of
words. The issue with suffixing languages isn't caused by the
algorithm's interaction with the document being searched. It is caused
by the search patterns themselves. In other words--it's not because the
algorithm skips from word to word in the document--it's because search
patterns will predominantly be words and in English they are often suffixed.
Your summary of the algorithm itself is helpful, but I still wonder at
why the abstract itself specifically cites English data--maybe that's
not relevant, but I wonder why it was cited in the abstract if that's so.
My statement that I couldn't understand how this speeds search times
reflects that suffixing actually decreases the variety of characters you
will find towards the ends of English words. Cross-linguistically, you
tend to find the richest set of sounds (and hence characters) in the
roots of words, so I would expect that starting with the n/2 character
of a string of n characters would probably yield better search times
overall, using a Boyer-Moore-like algorithm.
But to address the real issue here: there are about 20 million things to
do for Sword that are more important than changing the unindexed search
algorithm. We already addressed search times by adding Lucene.
--Chris
Stephen Denne wrote:
> Chris,
>
> The Boyer-Moore algorithm is only language specific in that it works better
> in languages that have larger alphabets, and consequently less chance of a
> single character matching a searched for character.
>
> It is a general purpose search algorithm.
>
> It does not look at the end of each word, that is a misunderstanding of the
> algorithm. It is not based on word suffixes.
>
> Basically, if it is looking for a 10 character long sequence, it checks the
> 10th character, the 20th character, the 30th character, the 40th character,
> etc, until it finds a character that is in the 10 character long sequence it
> is looking for. If it isn't in there, then there was no point checking any
> of the previous 9 characters.
>
> Stephen Denne.
More information about the sword-devel
mailing list