[sword-devel] Searching and Lucene thoughts
dmsmith555 at yahoo.com
dmsmith555 at yahoo.com
Wed Mar 2 14:49:18 MST 2005
I have implemented the Boyer-Moore before. I think that it is a bit
biased toward common prefix languages as you stated. But it would work
on any. The problem that I encountered is that it is easy to write for 7
bit ascii and later for 8 bit latin-1, but the compiled fsa becomes
difficult for a large character set. (I.e. my simple algorithm won't work)
The reason it works is that when it fails on matching a word, it does
not need to start at the next character, but it can skip ahead. The
longer the word, the better the speed.
Here is a very basic understanding: if in searching for the word
"blessing" one were to match "bl" and then fail on an "a" you know that
you do not need to restart on "l" or "a" but you can start on a later
letter. Where you start is based upon the letters in the word being
searched and the letter that you failed upon.
However, the Boyer-Moyer does not start from the beginning of the word,
but from the end. So it tries to match the "g" first and then the "n".
If it were looking at the word blasphemy, the eigth letter of
"blessing", the "g", would have been checked against the eigth leter of
"blasphemy", the "m". Since "g" does not occur in "blessing" more than
once and since "m" does not occur in it at all, we can then skip forward
8 letters. This is a tremendous savings.
If the movement from latin-1 to utf-8 and old search to lucene is at all
slow, then I think it may be worth it to implement it.
Chris Little wrote:
> No. Standard Sword searches just start at the beginning and search to
> the end, byte by byte.
>
> Just on the basis of the abstract you link to, I don't see how this
> would be of any benefit. The Boyer-Moore algorithm is very
> language-specific. It benefits from the fact that English is a
> predominantly suffixing language, as are most European languages, I
> would say. Personally, I have difficulty imagining how this actually
> speeds search times, but I assume they've done testing and that their
> claims are accurate.
>
> The standard linear search is the most general purpose search
> algorithm, and I think general purpose is what we need to maintain.
> For people who want faster searches, there is indexed searching
> available.
>
> --Chris
>
>
> Lynn Allan wrote:
>
>> <alert comment="iwnacsmndipootv ... i was not a computer science major
>> ... ">
>>
>> Just curious ... does non-indexed sword-api searching use c.s.
>> algorithms like Boyer-Moore searching?
>> http://portal.acm.org/citation.cfm?id=359859&coll=ACM&dl=ACM&CFID=13545783&CFTOKEN=93236524
>>
>>
>> Something I tried to read once (and it was waaaaaay over my head)
>> concerned very smart "state machine" searching when there is more than
>> one word being searched for. Seems like it involved Bell Lab
>> researchers? From one of the A or W or K dudes?
>> http://portal.acm.org/citation.cfm?id=360855&coll=ACM&dl=ACM&CFID=13626066&CFTOKEN=93658335
>>
>>
>> Does D. Knuth discuss string matching optimizations?
>>
>> Would that be applicable to the sword-api?
>>
>> </alert>
>>
>>
>> _______________________________________________
>> sword-devel mailing list
>> sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>
>
> _______________________________________________
> sword-devel mailing list
> sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
>
More information about the sword-devel
mailing list