[bt-devel] Re: BibleTime

Fri Dec 16 12:25:03 MST 2005

On 16/12/05, Troy A. Griffitts <scribe at crosswire.org> wrote:
> Hey guys,
>         Just a quick note.  Are you all aware that SWORD does expose clucene
> searching in the API.  We have an interface to query if indexes have
> been created, and also to ask them to be created (reporting status) if
> they have not been.
>
>         Also, it is my impression that clucene does not yet work correctly with
> wide characters (wchar_t is also different sizes on different platforms
> (as previously below) and does not conform to any standard).

Have you tried it out? My impression is that they are just putting 16
bits of data into whatever wchar_t is but I haven't tested it yet so I
don't know if it works.

Regards,
Daniel

> Martin Gruner wrote:
> > Dear Lee,
> >
> > I'm more than excited! My comments below...
> >
> >
> >>I have been meaning to send regular updates, but I keep thinking, "No
> >>I'll do this one more thing then send an update."  Right now, in my
> >>local Bibletime tree, I have an index-based search going!  Currently, it
> >>simply uses the existing search dialog, but I ignore some of the fields.
> >>  The results show up in the normal results tab of the seach dialog.
> >>
> >>Here's what I've done.  I implemented the search as another function in
> >>CSwordModuleInfo.  Where the search dialog normally called search(), I
> >>call searchIndexed() which is my new function.  The results are returned
> >>to m_searchResults as normal.
> >>
> >>I fought the Unicode issue again.  The search string came from QT in
> >>UCS2.  CLucene uses TCHAR which is a wchar_t if built for Unicode and
> >>just char if built for ANSI.  To make matters worse, wchar_t is 2 bytes
> >>on Windows and 4 bytes on Linux.  Fortunately, I found some conversion
> >>utilities in CLucene that allowed me to convert from utf8 to wide-char
> >>strings.  So I use QString to convert to UTF8 then those utils to
> >>convert to CLucenes wchar types.  Then I search my index and convert the
> >>results from wchar types to utf8 to stuff back into SWKey results. *Phew*
> >>:)
> >
> >
> > :-p
> >
> > I suggest that we _demand_ that users install clucene built for Unicode. Isn't
> > wchar_t UCS2? Perhaps we could speed up index creation if we have a direct
> > conversion routine, instead of UCS2 - UTF8 - WCHAR_T (UCS2)? I'm no expert
> > here. We could add that later, also.
> >
> >
> >>I am currently working on limiting the results to the search scope
> >>specified in the search dialog.  I came up with a list of questions I
> >>wanted to ask to go further.  I was going to send them tonight actually,
> >>but since you pinged me, here they are :)
> >>
> >>1. Search syntax.  As you know CLucene has a rich search syntax.  Do we
> >>want to expose that syntax directly (i.e. the user types their query in
> >>the syntax supported by CLucene) or do we want to break out the syntax
> >>into user interface elements (e.g. the AND/OR/ANY buttons, etc.)?
> >>
> >>2. Do we want index-based searching to be "the search method" or do we
> >>want it to be an option along with the search that's there now?
> >
> >
> > It will be the standard and the only method. =) And IMO we should directly
> > expose the search syntax and offer some nice help for users to learn it. This
> > means that we can remove many buttons/boxes in the search dialog. Going to be
> > easier for us and more flexible for the users.
> >
> >
> >>3. Index-building.  When do we want to build the index?  It almost makes
> >>sense to build the index when the user adds a module.  However, this is
> >>a potentially long operation.  We could kick off a thread to do it and
> >>keep the UI free for other purposes.  Also, we could do like most search
> >>engines and force the user to build the index the first time they search.
> >
> >
> > The last is what I'd suggest.
> > Another question: Will we be able to access the index directly, e.g. getting a
> > list of all words starting or ending with XY? I have plans for an "instant
> > concordance" function later which would operate on the index.
> > You could make a little blocking pop-up window that just says "(Re)building
> > index for module XY, this may take a while" and has a progress bar. No user
> > interaction needed.
> >
> >
> >>4. Index-location.  Where do we store the index?  Do we currently have a
> >>.bibletime or something to store such things? (I might be able to answer
> >>this myself, I haven't looked for it yet.)
> >
> >
> > You can use:
> > QString dir( KGlobal::dirs()->saveLocation("data", "bibletime/indices/") );
> >
> > On my system, this will return ~/.kde/share/apps/bibletime/indices/, which
> > would be a nice location. ~/.kde/share/apps/bibletime/cache/ is where we
> > currently store the lexicon entry cache files (very simple logic). Indexes
> > also need to be rebuilt should the version of an installed module OR the way
> > we create indexes change. So I guess our module version number and the "index
> > layout" version number need to be stored somewhere. Whenever the index layout
> > changes, we increase the index layout version number, and all indices will be
> > rebuilt for the users.
> > We also perhaps need a button to "Delete all index and cache files", if a user
> > has disk space problems.
> >
> >
> >>Also, what about Bibles?
> >>Their indices are not going to change.  Should we distribute index files
> >>with the modules?  The user wouldn't have to build at all!
> >
> >
> > This is not possible, because Crosswire distributes the module files, and
> > we'll likely use a different index format than other Sword frontends. So I
> > guess we'll have to take care of it.
> >
> > How long does it take? How big do they get?
> >
> >
> >>5. Analyzers.  It seems that there are many different Analyzers that can
> >>be used to build an index.  (Some that differentiate between lower and
> >>uppercase, some that take into account grammar rules for certain
> >>languages, etc.)  Do we want this flexibility extended to the user?  Or
> >>do we just use the simple analyzer which simply breaks up words?
> >
> >
> > I don't know, have to read more. Perhaps we should start with the simple one?
> >
> >
> >>6. Exceptions.  When building my search in, CLucene code complained that
> >>C++ exceptions were turned off and CLucene requires them on.  Was there
> >>a reason for them being turned off?
> >
> >
> > Joachim, can you say something about it?
> >
> >
> >>I think that's it for the moment...  I'll try to send status updates
> >>more often :)
> >
> >
> >>[...]
> >
> >
> > Lee, I just tagged cvs with rel-1-5-3 to reflect the status of the 1.5.3
> > release which just came out. Feel free to start working in cvs HEAD. Should
> > we need to make more bugfix releases in the meantime, we can create a branch
> > and work there. Once this works well and is documented, we can release 1.6.
> >
> >
> > So much for now.
> >
> > mg
> >
> >
> >>Thanks,
> >>
> >>Lee C.
> >>
> >>Martin Gruner wrote:
> >>
> >>>Hey Lee,
> >>>
> >>>just wanted to ask how you are and about your progress with BibleTime
> >>>coding / investigation. Here's a nice clucene-based project that I just
> >>>found: http://kioclucene.objectis.net/ (just as a demonstration).
> >>>
> >>>I hope you, Anna and you dear wife are doing well,
> >>>
> >>>Martin

--
A: No.
Q: Should I include quotations after my reply?
A. Because it breaks the logical sequence of discussion
Q. Why is top posting bad?