<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
>From looking at Lucene, it appears that the only thing that it does not
do that JSword currently does is "blur". To do range searching in
Lucene, we would need to index the ordinal verse value along with the
verse. Then > and < can be used to do the range search. The
problem with storing the ordinal value is that it will make alternate
versification harder. Perhaps a better way would to encode the
reference as something like:<br>
verse + Book * 10 + chapter * 1000 + testament *1000000 (not exactly
these powers of 10, but the smallest ones that would make it work.)<br>
<br>
Back on Blur: Lucene uses ~ as a word suffix operator. So if we
required ~ and ~n (where n is the blur factor) to be surrounded by
whitespace, we could use lucene to do the "halves" and combine the
operations using logical AND.<br>
<br>
Best match also is affected, but it may not be significant. The way it
is currently it does a fuzzy match on non-stop words in a phrase and
weights them differently than the straight search. I think that Lucene
already does weighting that takes the fuzziness into account. I think
it would be good to do some comparison to straight lucene fuzzy match
to see if the results are significantly different. If they are not
significantly different, we would only need to account for blur and
ranges and can create a simple parser that would split searches with
blur and ranges into parts and submit each part.<br>
<br>
If this becomes the case, then javacc would be overkill.<br>
<br>
Advanced search would become more important as it would help build
complex searches. And the SearchSyntax becomes more helpful.<br>
<br>
I think I will start by creating SearchSyntax and applying it to the
existing code. Once that is done, we can then play with other search
engines to see what is better.<br>
<br>
Joe Walker wrote:
<blockquote cite="mid5dd4742605040816025a94661@mail.gmail.com"
type="cite"><br>
Having a SearchSyntax sounds like a good idea to me.<br>
<br>
It would be good if we could implement it using Lucene, we've talked
about using their query parser in the past.<br>
<br>
The problems of the search query parser probably come down to the way
it has evolved, which seems to be a common pit-fall for any parser code
- the pattern seems to be that the parser evolves to the point where
squashing bugs becomes too regular and then someone sits down and
writes a grammar for it. I noticed that Groovy has just been through
this.<br>
I've dabbled with javacc successfully on a couple of projects, and once
tried to write a COBOL grammar - very unsuccessfully so I know it can
be hard. This may well be overkill for our simple syntax?<br>
<br>
Other than that, go for it!<br>
<br>
Joe.<br>
<br>
<br>
<div><span class="gmail_quote">On Apr 8, 2005 12:52 PM, <b
class="gmail_sendername">DM Smith</b> <<a
href="mailto:dmsmith555@gmail.com">dmsmith555@gmail.com</a>> wrote:</span>
<blockquote class="gmail_quote"
style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">I've
narrowed down some of the bugs of search. Seems that the tokenizer<br>
is not producing the correct stream of tokens.<br>
Specifically, the algorithm using the tokens goes something like this:<br>
<br>
while there are command tokens at the beginning of the stream get next
one<br>
do<br>
have that command consume word tokens until it reaches a terminating<br>
condition<br>
done<br>
<br>
The problem of +[mat-rev]"bread of life" is that this produces a token<br>
stream where +[mat-rev] is not followed by a command token.<br>
<br>
In looking at this I noticed that there is what looks like a design<br>
problem. Consistently, elsewhere in JSword, an interface defines a wall<br>
that BibleDesktop and JSword does not look behind. However in the case<br>
of searching this is not the case.<br>
<br>
jsword.book.search<br>
provides the interfaces for Search and Index and factories to get<br>
implementation<br>
jsword.book.search.basic<br>
provides abstract/partial implementation of the interfaces<br>
jsword.book.search.parse<br>
provides an implementation of Searcher<br>
jsword.book.search.lucene<br>
provides an implementation of Indexer<br>
<br>
Based upon this I would have expected that no code (outside of the<br>
package) would have directly used jsword.book.search.parse code.<br>
<br>
The reason I noticed this was that I wanted to create another searcher<br>
and get it from the search factory. (Start with a copy and fix bugs,<br>
while retaining the ability to use BibleDesktop by changing the<br>
factories properties.)<br>
<br>
What is being used is the syntax elements to pro grammatically construct<br>
a search. I'm thinking that we need YAI (yet another interface) for<br>
SearchSyntax. This would be able to:<br>
1) decorate individual words and phrases with appropriate syntax
elements.<br>
SearchSyntax ss = SearchSyntaxFactory.getSearchSyntax();<br>
String decorated = ss.decorate(SyntaxType.STARTS_WITH, "bread of
life");<br>
decorated = ss.decorate(SyntaxType.FIND_ALL_WORDS, "son of man");<br>
decorated = ss.decorate(SyntaxType.FIND_STRONG_NUMBERS, "1234
5678");<br>
decorated = ss.decorate(SyntaxType.BEST_MATCH, "....");<br>
decorated = ss.decorate(SyntaxType.PHRASE_SEARCH, "....");<br>
...<br>
<br>
2) create a token stream from a string.<br>
Token[] tokens = ss.tokenize("search string");<br>
or<br>
TokenStream tokens = ss.tokenize("search string");<br>
or<br>
...<br>
<br>
3) serialize a token stream to a string.<br>
<br>
Input desired!<br>
</blockquote>
</div>
</blockquote>
<br>
</body>
</html>