[sword-devel] Searching and Lucene thoughts

Chris Little chrislit at crosswire.org
Thu Mar 3 14:14:25 MST 2005


Stephen,

I didn't actually say anything about Boyer-Moore looking at the ends of 
words. The issue with suffixing languages isn't caused by the 
algorithm's interaction with the document being searched. It is caused 
by the search patterns themselves. In other words--it's not because the 
algorithm skips from word to word in the document--it's because search 
patterns will predominantly be words and in English they are often suffixed.

Your summary of the algorithm itself is helpful, but I still wonder at 
why the abstract itself specifically cites English data--maybe that's 
not relevant, but I wonder why it was cited in the abstract if that's so.

My statement that I couldn't understand how this speeds search times 
reflects that suffixing actually decreases the variety of characters you 
will find towards the ends of English words. Cross-linguistically, you 
tend to find the richest set of sounds (and hence characters) in the 
roots of words, so I would expect that starting with the n/2 character 
of a string of n characters would probably yield better search times 
overall, using a Boyer-Moore-like algorithm.

But to address the real issue here: there are about 20 million things to 
do for Sword that are more important than changing the unindexed search 
algorithm. We already addressed search times by adding Lucene.

--Chris

Stephen Denne wrote:
> Chris,
> 
> The Boyer-Moore algorithm is only language specific in that it works better
> in languages that have larger alphabets, and consequently less chance of a
> single character matching a searched for character.
> 
> It is a general purpose search algorithm.
> 
> It does not look at the end of each word, that is a misunderstanding of the
> algorithm. It is not based on word suffixes.
> 
> Basically, if it is looking for a 10 character long sequence, it checks the
> 10th character, the 20th character, the 30th character, the 40th character,
> etc, until it finds a character that is in the 10 character long sequence it
> is looking for. If it isn't in there, then there was no point checking any
> of the previous 9 characters.
> 
> Stephen Denne.



More information about the sword-devel mailing list