<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

>From looking at Lucene, it appears that the only thing that it does not

do that JSword currently does is "blur". To do range searching in

Lucene, we would need to index the ordinal verse value along with the

verse. Then &gt; and &lt; can be used to do the range search. The

problem with storing the ordinal value is that it will make alternate

versification harder. Perhaps a better way would to encode the

reference as something like:<br>

verse + Book * 10 + chapter * 1000 + testament *1000000 (not exactly

these powers of 10, but the smallest ones that would make it work.)<br>

<br>

Back on Blur: Lucene uses ~ as a word suffix operator. So if we

required ~ and ~n (where n is the blur factor) to be surrounded by

whitespace, we could use lucene to do the "halves" and combine the

operations using logical AND.<br>

<br>

Best match also is affected, but it may not be significant. The way it

is currently it does a fuzzy match on non-stop words in a phrase and

weights them differently than the straight search. I think that Lucene

already does weighting that takes the fuzziness into account. I think

it would be good to do some comparison to straight lucene fuzzy match

to see if the results are significantly different. If they are not

significantly different, we would only need to account for blur and

ranges and can create a simple parser that would split searches with

blur and ranges into parts and submit each part.<br>

<br>

If this becomes the case, then javacc would be overkill.<br>

<br>

Advanced search would become more important as it would help build

complex searches. And the SearchSyntax becomes more helpful.<br>

<br>

I think I will start by creating SearchSyntax and applying it to the

existing code. Once that is done, we can then play with other search

engines to see what is better.<br>

<br>

Joe Walker wrote:

<blockquote cite="mid5dd4742605040816025a94661@mail.gmail.com"

 type="cite"><br>

Having a SearchSyntax sounds like a good idea to me.<br>

  <br>

It would be good if we could implement it using Lucene, we've talked

about using their query parser in the past.<br>

  <br>

The problems of the search query parser probably come down to the way

it has evolved, which seems to be a common pit-fall for any parser code

- the pattern seems to be that the parser evolves to the point where

squashing bugs becomes too regular and then someone sits down and

writes a grammar for it. I noticed that Groovy has just been through

this.<br>

I've dabbled with javacc successfully on a couple of projects, and once

tried to write a COBOL grammar - very unsuccessfully so I know it can

be hard. This may well be overkill for our simple syntax?<br>

  <br>

Other than that, go for it!<br>

  <br>

Joe.<br>

  <br>

  <br>

  <div><span class="gmail_quote">On Apr 8, 2005 12:52 PM, <b

 class="gmail_sendername">DM Smith</b> &lt;<a

 href="mailto:dmsmith555@gmail.com">dmsmith555@gmail.com</a>&gt; wrote:</span>

  <blockquote class="gmail_quote"

 style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">I've

narrowed down some of the bugs of search. Seems that the tokenizer<br>

is not producing the correct stream of tokens.<br>

Specifically, the algorithm using the tokens goes something like this:<br>

    <br>

while there are command tokens at the beginning of the stream get next

one<br>

do<br>

&nbsp;&nbsp;&nbsp;&nbsp;have that command consume word tokens until it reaches a terminating<br>

condition<br>

done<br>

    <br>

The problem of +[mat-rev]"bread of life" is that this produces a token<br>

stream where +[mat-rev] is not followed by a command token.<br>

    <br>

In looking at this I noticed that there is what looks like a design<br>

problem. Consistently, elsewhere in JSword, an interface defines a wall<br>

that BibleDesktop and JSword does not look behind. However in the case<br>

of searching this is not the case.<br>

    <br>

jsword.book.search<br>

&nbsp;&nbsp;&nbsp;&nbsp;provides the interfaces for Search and Index and factories to get<br>

implementation<br>

jsword.book.search.basic<br>

&nbsp;&nbsp;&nbsp;&nbsp;provides abstract/partial implementation of the interfaces<br>

jsword.book.search.parse<br>

&nbsp;&nbsp;&nbsp;&nbsp;provides an implementation of Searcher<br>

jsword.book.search.lucene<br>

&nbsp;&nbsp;&nbsp;&nbsp;provides an implementation of Indexer<br>

    <br>

Based upon this I would have expected that no code (outside of the<br>

package) would have directly used jsword.book.search.parse code.<br>

    <br>

The reason I noticed this was that I wanted to create another searcher<br>

and get it from the search factory. (Start with a copy and fix bugs,<br>

while retaining the ability to use BibleDesktop by changing the<br>

factories properties.)<br>

    <br>

What is being used is the syntax elements to pro grammatically construct<br>

a search. I'm thinking that we need YAI (yet another interface) for<br>

SearchSyntax. This would be able to:<br>

1) decorate individual words and phrases with appropriate syntax

elements.<br>

&nbsp;&nbsp;&nbsp;&nbsp;SearchSyntax ss = SearchSyntaxFactory.getSearchSyntax();<br>

&nbsp;&nbsp;&nbsp;&nbsp;String decorated = ss.decorate(SyntaxType.STARTS_WITH, "bread of

life");<br>

&nbsp;&nbsp;&nbsp;&nbsp;decorated = ss.decorate(SyntaxType.FIND_ALL_WORDS, "son of man");<br>

&nbsp;&nbsp;&nbsp;&nbsp;decorated = ss.decorate(SyntaxType.FIND_STRONG_NUMBERS, "1234

5678");<br>

&nbsp;&nbsp;&nbsp;&nbsp;decorated = ss.decorate(SyntaxType.BEST_MATCH, "....");<br>

&nbsp;&nbsp;&nbsp;&nbsp;decorated = ss.decorate(SyntaxType.PHRASE_SEARCH, "....");<br>

&nbsp;&nbsp;&nbsp;&nbsp;...<br>

    <br>

2) create a token stream from a string.<br>

&nbsp;&nbsp;&nbsp;&nbsp;Token[] tokens = ss.tokenize("search string");<br>

&nbsp;&nbsp;&nbsp;&nbsp;or<br>

&nbsp;&nbsp;&nbsp;&nbsp;TokenStream tokens = ss.tokenize("search string");<br>

&nbsp;&nbsp;&nbsp;&nbsp;or<br>

&nbsp;&nbsp;&nbsp;&nbsp;...<br>

    <br>

3) serialize a token stream to a string.<br>

    <br>

Input desired!<br>

  </blockquote>

  </div>

</blockquote>

<br>

</body>

</html>