<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
<title></title>
</head>
<body bgcolor="#ffffff" text="#000000">
Assuming the behavior of CLucene is the same as Lucene (the Java
version), I think I can partially answer some of these:<br>
Sword is using the standard analyzer.<br>
<br>
The std analyzer ignores a list of common words, such as (a, the, in,
an, on, ....). In Lucene speak these are stop words. This list may or
may not be appropriate for biblical research. The simple analyzer does
not have a stop list. The Standard analyzer also does a lot of other
things, which don't have a net effect on Bibles. The difference between
the two in terms of size of the index is 2M to 2.6M. Both take about
the same length of time to build an index. But the simple analyzer is
just a touch faster.<br>
<br>
Lucene does case insensitive searching. It always does. It does not
matter which analyzer is used.<br>
<br>
Lucene only looks for what you tell it to look for. If you want partial
words you have to construct a search with wild cards. Lucene cannot
wildcard the beginning of a word. So it cannot find words ending with
ration. In addition to the typical wildcards, Lucene also has ~ which
when added to the end of the word will find words like the one you
entered. This is very useful to find words whose spelling is close but
not correct (e.g. abimeleck~).<br>
<br>
By default, Lucene uses "OR" as a connector between words. To require
two words to be in a search result use "AND" or prefix the words with
"+" (like google).<br>
<br>
Lynn Allan wrote:
<blockquote cite="mid041d01c56f0f$62f93920$0200a8c0@k2" type="cite">
<blockquote type="cite">
<pre wrap="">Here is my best attempt at a beta before I leave for the summer.
</pre>
</blockquote>
<pre wrap=""><!---->Give
</pre>
<blockquote type="cite">
<pre wrap="">it a go and let me know what you think.
</pre>
</blockquote>
<pre wrap=""><!---->
Have a great summer.
Some questions about indexed/optimized searching:
* Does it always do case-insensitive searching even when the "Case
Sensitive" checkbox is checked? With the AKJV and "Case Sensitive"
checked, it finds 942 matches for "Jesus" and 942 matches for "jesus".
</pre>
</blockquote>
Lucene is case insensitive.<br>
<blockquote cite="mid041d01c56f0f$62f93920$0200a8c0@k2" type="cite">
<pre wrap="">
* Does indexed searching always do a match on the exact word? For
example, with "Phrase" or "Multi Word" or "Optimized", there are 2
matches for "regeneration" using the AKJV. Phrase and MultiWord find
275 matches for "ration", including the times it is within
"generation" and "regeneration". Optimized search finds 0. Perhaps
this is how it is supposed to work, but it seems like an end-user
might find it unexpected that Optimized Searching gives results that
are very different from "Phrase" and "MultiWord" searching. There
aren't "clues" that Optimized Searching has different behavior.
Perhaps the "Case Insensitive" checkbox should be unchecked and/or
disabled?
</pre>
</blockquote>
It always does exact match unless the request uses wild cards.<br>
<blockquote cite="mid041d01c56f0f$62f93920$0200a8c0@k2" type="cite">
<pre wrap="">
* Perhaps similarily unexpected, MultiWord searching for "son of god"
results in 294 case insensitive matches, "Phrase" found 47, and
"Optimized" found 5472. After this search, the Optimizing seemed
disabled, becausing searching for "son" took about 20 seconds. Then
the next search for "of" crashed (floating point division by zero")
</pre>
</blockquote>
Lucene when given <<son of god>> will find all verses with
<<son>> OR <<god>>.<br>
BibleDesktop/JSword has the same performance problem when searching for
"son" when we<br>
show 1000 verses at a time. But showing 50 at a time fixes the problem.<br>
Looking JSword in the debugger, I find that the answer is returned
almost immediately, but<br>
the processing of it is what is taking the time. Part of the problem is
that half of the time is fetching<br>
verses from the module. Since the verses are spread out across the
book, getting them requires<br>
lot of disk hits. And if the module is compressed, lots of cpu. When we
list the hits based on score<br>
I find that the module read cache is invalidated very often and we have
to re-read from disk.<br>
(With one read we cache many adjacent verses and serve them out of
there.)<br>
I don't know how Sword does it, but since JSword is based upon it, it
might not be too<br>
far different.<br>
<blockquote cite="mid041d01c56f0f$62f93920$0200a8c0@k2" type="cite">
<pre wrap="">
This was the second time it crashed ... sorry don't have repeatable
sequence of actions ... except that each time Searching was
effectively disabled. The button that should be "Search" was "Halt"
and stayed as "Halt" even when the Search dialog was dismissed and
reentered. I had to shut-down BibleCS to get searching to work again.
Here's a repeatable sequence to cause a crash: AKJV Optimize search
for "son of god", then search for "son", then search for "of" ...
crash.
Actually, it is simpler ... search for a very common word like "of" or
"the" or "a"
</pre>
</blockquote>
"of" is not indexed. It will return zero hits. <br>
I found that this did not happen when searching for "buzzard" which is
not in the KJV. Hmmm.<br>
I would have expected this to also fail if the problem were in Sword's
handling of the answer set.<br>
<blockquote cite="mid041d01c56f0f$62f93920$0200a8c0@k2" type="cite">
<pre wrap="">
In case the index needed rebulding, I deleted the AKJV index and
clicked on the "Create Index" button. This caused a "C++ Exception"
message to show up???
Odd ... after the crash, the AKJV seemed to have "forgotten" that it
had an index file created ... that option wasn't available. I had to
switch to another module and back to AKJV for it to realize it had the
index file created.
Very odd .... while trying out different searches, it has twice
happened that the search source switched from AKJV to "Personal
Commentary." This was without the "Choose Module" showing, so I don't
think it was anything I did.
I'll rebuild the indices and see if the behavior is repeatable.
HTH
_______________________________________________
sword-devel mailing list: <a class="moz-txt-link-abbreviated" href="mailto:sword-devel@crosswire.org">sword-devel@crosswire.org</a>
<a class="moz-txt-link-freetext" href="http://www.crosswire.org/mailman/listinfo/sword-devel">http://www.crosswire.org/mailman/listinfo/sword-devel</a>
Instructions to unsubscribe/change your settings at above page
</pre>
</blockquote>
</body>
</html>