[jsword-devel] strongs search
DM Smith
dmsmith555 at yahoo.com
Sat May 17 04:38:35 MST 2008
On May 16, 2008, at 8:30 AM, Mullins, Steven wrote:
> DM,
>
> Thanks for the tips and direction, it is much appreciated!
> I'm going to work on these issues as time allows. I may
> still have to bug you with a question or two as I learn how
> jsword is structured. I'm very new to Java and object-oriented
> programming in general (unless you count python). I tend to
> think and write procedurally i.e. (C, Perl and Fortran),
> but will try hard to fit the paradigm of the existing code.
>
> I'd really like to see jsword on par with BibleWorks:
> http://www.bibleworks.com/ in the area of searching and
> morphological analysis of greek texts. I think with some
> work we can get it there.
Yes this would be great. Here are some ideas. (Some are in Jira, which
is down at the moment, so we can't get to our issues database.)
You may find the following of interest:
https://issues.apache.org/jira/browse/LUCENE-1284
This is a contribution to Lucene that allows for words to broken up
into their constituent parts for searching. This is very important for
languages that have compound words, such as German. Basically, a word
such as "hotdog" is searchable as both hot and dog.
There is also some work going on regarding n-grams. The basic idea
here is that some languages (e.g. Thai and Japanese) do not have word
boundaries. Searching in these languages is the process of finding
substring matches.
This is discussed here:
https://issues.apache.org/jira/browse/LUCENE-1224
https://issues.apache.org/jira/browse/LUCENE-1225
and in some threads on jira-dev
I don't know if either of those have applicability to Greek and/or
Hebrew.
The other thing that we need is the ability to strip accents, vowel
points and cantillation.
Soon we will have a Greek text with accents. When we do, it will be
important to search with and without regard to accents.
To be able to reliably search on a Unicode text we need to normalize
the text before storing it and also to normalize search requests the
same way, before doing the search. (Unicode has various normalization
forms.) The texts should already be NFC, but that may not be the best
for indexing and searching.
Same with Hebrew. With Hebrew it is also important to be able to
remove cantillation for the sake of readability.
I'd also like the ability added to transliterate these texts. Chris
Little has done some wonderful work here for the Sword engine. This
would help beginners learn how to read Greek and Hebrew texts. It
might also help as an additional normalization form to index.
I think it would be interesting in a work like the KJV to do a
Strong's search that retrieves a list of the different translations of
a particular Strong's number.
What are some of your ideas?
I will be focused on adding BookMarks for the next release (after the
one that is about to be done now) and won't be able to get to much of
anything else, but bug fixes.
In Him,
DM
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.crosswire.org/pipermail/jsword-devel/attachments/20080517/35e6563c/attachment.html
More information about the jsword-devel
mailing list