[bt-devel] Re: BibleTime
Martin Gruner
mg.pub at gmx.net
Fri Dec 16 02:20:47 MST 2005
Dear Lee,
I'm more than excited! My comments below...
> I have been meaning to send regular updates, but I keep thinking, "No
> I'll do this one more thing then send an update." Right now, in my
> local Bibletime tree, I have an index-based search going! Currently, it
> simply uses the existing search dialog, but I ignore some of the fields.
> The results show up in the normal results tab of the seach dialog.
>
> Here's what I've done. I implemented the search as another function in
> CSwordModuleInfo. Where the search dialog normally called search(), I
> call searchIndexed() which is my new function. The results are returned
> to m_searchResults as normal.
>
> I fought the Unicode issue again. The search string came from QT in
> UCS2. CLucene uses TCHAR which is a wchar_t if built for Unicode and
> just char if built for ANSI. To make matters worse, wchar_t is 2 bytes
> on Windows and 4 bytes on Linux. Fortunately, I found some conversion
> utilities in CLucene that allowed me to convert from utf8 to wide-char
> strings. So I use QString to convert to UTF8 then those utils to
> convert to CLucenes wchar types. Then I search my index and convert the
> results from wchar types to utf8 to stuff back into SWKey results. *Phew*
> :)
:-p
I suggest that we _demand_ that users install clucene built for Unicode. Isn't
wchar_t UCS2? Perhaps we could speed up index creation if we have a direct
conversion routine, instead of UCS2 - UTF8 - WCHAR_T (UCS2)? I'm no expert
here. We could add that later, also.
> I am currently working on limiting the results to the search scope
> specified in the search dialog. I came up with a list of questions I
> wanted to ask to go further. I was going to send them tonight actually,
> but since you pinged me, here they are :)
>
> 1. Search syntax. As you know CLucene has a rich search syntax. Do we
> want to expose that syntax directly (i.e. the user types their query in
> the syntax supported by CLucene) or do we want to break out the syntax
> into user interface elements (e.g. the AND/OR/ANY buttons, etc.)?
>
> 2. Do we want index-based searching to be "the search method" or do we
> want it to be an option along with the search that's there now?
It will be the standard and the only method. =) And IMO we should directly
expose the search syntax and offer some nice help for users to learn it. This
means that we can remove many buttons/boxes in the search dialog. Going to be
easier for us and more flexible for the users.
> 3. Index-building. When do we want to build the index? It almost makes
> sense to build the index when the user adds a module. However, this is
> a potentially long operation. We could kick off a thread to do it and
> keep the UI free for other purposes. Also, we could do like most search
> engines and force the user to build the index the first time they search.
The last is what I'd suggest.
Another question: Will we be able to access the index directly, e.g. getting a
list of all words starting or ending with XY? I have plans for an "instant
concordance" function later which would operate on the index.
You could make a little blocking pop-up window that just says "(Re)building
index for module XY, this may take a while" and has a progress bar. No user
interaction needed.
> 4. Index-location. Where do we store the index? Do we currently have a
> .bibletime or something to store such things? (I might be able to answer
> this myself, I haven't looked for it yet.)
You can use:
QString dir( KGlobal::dirs()->saveLocation("data", "bibletime/indices/") );
On my system, this will return ~/.kde/share/apps/bibletime/indices/, which
would be a nice location. ~/.kde/share/apps/bibletime/cache/ is where we
currently store the lexicon entry cache files (very simple logic). Indexes
also need to be rebuilt should the version of an installed module OR the way
we create indexes change. So I guess our module version number and the "index
layout" version number need to be stored somewhere. Whenever the index layout
changes, we increase the index layout version number, and all indices will be
rebuilt for the users.
We also perhaps need a button to "Delete all index and cache files", if a user
has disk space problems.
> Also, what about Bibles?
> Their indices are not going to change. Should we distribute index files
> with the modules? The user wouldn't have to build at all!
This is not possible, because Crosswire distributes the module files, and
we'll likely use a different index format than other Sword frontends. So I
guess we'll have to take care of it.
How long does it take? How big do they get?
> 5. Analyzers. It seems that there are many different Analyzers that can
> be used to build an index. (Some that differentiate between lower and
> uppercase, some that take into account grammar rules for certain
> languages, etc.) Do we want this flexibility extended to the user? Or
> do we just use the simple analyzer which simply breaks up words?
I don't know, have to read more. Perhaps we should start with the simple one?
> 6. Exceptions. When building my search in, CLucene code complained that
> C++ exceptions were turned off and CLucene requires them on. Was there
> a reason for them being turned off?
Joachim, can you say something about it?
> I think that's it for the moment... I'll try to send status updates
> more often :)
>[...]
Lee, I just tagged cvs with rel-1-5-3 to reflect the status of the 1.5.3
release which just came out. Feel free to start working in cvs HEAD. Should
we need to make more bugfix releases in the meantime, we can create a branch
and work there. Once this works well and is documented, we can release 1.6.
So much for now.
mg
> Thanks,
>
> Lee C.
>
> Martin Gruner wrote:
> > Hey Lee,
> >
> > just wanted to ask how you are and about your progress with BibleTime
> > coding / investigation. Here's a nice clucene-based project that I just
> > found: http://kioclucene.objectis.net/ (just as a demonstration).
> >
> > I hope you, Anna and you dear wife are doing well,
> >
> > Martin
More information about the bt-devel
mailing list