[jsword-devel] strongs search

Sat May 17 04:38:35 MST 2008

On May 16, 2008, at 8:30 AM, Mullins, Steven wrote:

> DM,
>
> Thanks for the tips and direction, it is much appreciated!
> I'm going to work on these issues as time allows.  I may
> still have to bug you with a question or two as I learn how
> jsword is structured.  I'm very new to Java and object-oriented
> programming in general (unless you count python).  I tend to
> think and write procedurally i.e. (C, Perl and Fortran),
> but will try hard to fit the paradigm of the existing code.
>
> I'd really like to see jsword on par with BibleWorks:
> http://www.bibleworks.com/ in the area of searching and
> morphological analysis of greek texts.  I think with some
> work we can get it there.

Yes this would be great. Here are some ideas. (Some are in Jira, which  
is down at the moment, so we can't get to our issues database.)

You may find the following of interest:
https://issues.apache.org/jira/browse/LUCENE-1284

This is a contribution to Lucene that allows for words to broken up  
into their constituent parts for searching. This is very important for  
languages that have compound words, such as German. Basically, a word  
such as "hotdog" is searchable as both hot and dog.

There is also some work going on regarding n-grams. The basic idea  
here is that some languages (e.g. Thai and Japanese) do not have word  
boundaries. Searching in these languages is the process of finding  
substring matches.
This is discussed here:
https://issues.apache.org/jira/browse/LUCENE-1224
https://issues.apache.org/jira/browse/LUCENE-1225
and in some threads on jira-dev

I don't know if either of those have applicability to Greek and/or  
Hebrew.

The other thing that we need is the ability to strip accents, vowel  
points and cantillation.

Soon we will have a Greek text with accents. When we do, it will be  
important to search with and without regard to accents.

To be able to reliably search on a Unicode text we need to normalize  
the text before storing it and also to normalize search requests the  
same way, before doing the search. (Unicode has various normalization  
forms.) The texts should already be NFC, but that may not be the best  
for indexing and searching.

Same with Hebrew. With Hebrew it is also important to be able to  
remove cantillation for the sake of readability.

I'd also like the ability added to transliterate these texts. Chris  
Little has done some wonderful work here for the Sword engine. This  
would help beginners learn how to read Greek and Hebrew texts. It  
might also help as an additional normalization form to index.

I think it would be interesting in a work like the KJV to do a  
Strong's search that retrieves a list of the different translations of  
a particular Strong's number.

What are some of your ideas?

I will be focused on adding BookMarks for the next release (after the  
one that is about to be done now) and won't be able to get to much of  
anything else, but bug fixes.

In Him,
	DM

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.crosswire.org/pipermail/jsword-devel/attachments/20080517/35e6563c/attachment.html