[sword-devel] Searching and Lucene thoughts

Wed Mar 2 14:49:18 MST 2005

I have implemented the Boyer-Moore before. I think that it is a bit 
biased toward common prefix languages as you stated. But it would work 
on any. The problem that I encountered is that it is easy to write for 7 
bit ascii and later for 8 bit latin-1, but the compiled fsa becomes 
difficult for a large character set. (I.e. my simple algorithm won't work)

The reason it works is that when it fails on matching a word, it does 
not need to start at the next character, but it can skip ahead. The 
longer the word, the better the speed.

Here is a very basic understanding: if in searching for the word 
"blessing" one were to match "bl" and then fail on an "a" you know that 
you do not need to restart on "l" or "a" but you can start on a later 
letter. Where you start is based upon the letters in the word being 
searched and the letter that you failed upon.

However, the Boyer-Moyer does not start from the beginning of the word, 
but from the end. So it tries to match the "g" first and then the "n". 
If it were looking at the word blasphemy, the eigth letter of 
"blessing", the "g", would have been checked against the eigth leter of 
"blasphemy", the "m". Since "g" does not occur in "blessing" more than 
once and since "m" does not occur in it at all, we can then skip forward 
8 letters. This is a tremendous savings.

If the movement from latin-1 to utf-8 and old search to lucene is at all 
slow, then I think it may be worth it to implement it.

Chris Little wrote:

> No. Standard Sword searches just start at the beginning and search to 
> the end, byte by byte.
>
> Just on the basis of the abstract you link to, I don't see how this 
> would be of any benefit. The Boyer-Moore algorithm is very 
> language-specific. It benefits from the fact that English is a 
> predominantly suffixing language, as are most European languages, I 
> would say. Personally, I have difficulty imagining how this actually 
> speeds search times, but I assume they've done testing and that their 
> claims are accurate.
>
> The standard linear search is the most general purpose search 
> algorithm, and I think general purpose is what we need to maintain. 
> For people who want faster searches, there is indexed searching 
> available.
>
> --Chris
>
>
> Lynn Allan wrote:
>
>> <alert comment="iwnacsmndipootv ... i was not a computer science major
>> ... ">
>>
>> Just curious ... does non-indexed sword-api searching use c.s.
>> algorithms like Boyer-Moore searching?
>> http://portal.acm.org/citation.cfm?id=359859&coll=ACM&dl=ACM&CFID=13545783&CFTOKEN=93236524 
>>
>>
>> Something I tried to read once (and it was waaaaaay over my head)
>> concerned very smart "state machine" searching when there is more than
>> one word being searched for. Seems like it involved Bell Lab
>> researchers? From one of the A or W or K dudes?
>> http://portal.acm.org/citation.cfm?id=360855&coll=ACM&dl=ACM&CFID=13626066&CFTOKEN=93658335 
>>
>>
>> Does D. Knuth discuss string matching optimizations?
>>
>> Would that be applicable to the sword-api?
>>
>> </alert>
>>
>>
>> _______________________________________________
>> sword-devel mailing list
>> sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>
>
> _______________________________________________
> sword-devel mailing list
> sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
>