[bt-devel] Re: BibleTime
Troy A. Griffitts
scribe at crosswire.org
Fri Dec 16 12:08:22 MST 2005
Hey guys,
Just a quick note. Are you all aware that SWORD does expose clucene
searching in the API. We have an interface to query if indexes have
been created, and also to ask them to be created (reporting status) if
they have not been.
Also, it is my impression that clucene does not yet work correctly with
wide characters (wchar_t is also different sizes on different platforms
(as previously below) and does not conform to any standard).
Hope this add a little,
-Troy.
Martin Gruner wrote:
> Dear Lee,
>
> I'm more than excited! My comments below...
>
>
>>I have been meaning to send regular updates, but I keep thinking, "No
>>I'll do this one more thing then send an update." Right now, in my
>>local Bibletime tree, I have an index-based search going! Currently, it
>>simply uses the existing search dialog, but I ignore some of the fields.
>> The results show up in the normal results tab of the seach dialog.
>>
>>Here's what I've done. I implemented the search as another function in
>>CSwordModuleInfo. Where the search dialog normally called search(), I
>>call searchIndexed() which is my new function. The results are returned
>>to m_searchResults as normal.
>>
>>I fought the Unicode issue again. The search string came from QT in
>>UCS2. CLucene uses TCHAR which is a wchar_t if built for Unicode and
>>just char if built for ANSI. To make matters worse, wchar_t is 2 bytes
>>on Windows and 4 bytes on Linux. Fortunately, I found some conversion
>>utilities in CLucene that allowed me to convert from utf8 to wide-char
>>strings. So I use QString to convert to UTF8 then those utils to
>>convert to CLucenes wchar types. Then I search my index and convert the
>>results from wchar types to utf8 to stuff back into SWKey results. *Phew*
>>:)
>
>
> :-p
>
> I suggest that we _demand_ that users install clucene built for Unicode. Isn't
> wchar_t UCS2? Perhaps we could speed up index creation if we have a direct
> conversion routine, instead of UCS2 - UTF8 - WCHAR_T (UCS2)? I'm no expert
> here. We could add that later, also.
>
>
>>I am currently working on limiting the results to the search scope
>>specified in the search dialog. I came up with a list of questions I
>>wanted to ask to go further. I was going to send them tonight actually,
>>but since you pinged me, here they are :)
>>
>>1. Search syntax. As you know CLucene has a rich search syntax. Do we
>>want to expose that syntax directly (i.e. the user types their query in
>>the syntax supported by CLucene) or do we want to break out the syntax
>>into user interface elements (e.g. the AND/OR/ANY buttons, etc.)?
>>
>>2. Do we want index-based searching to be "the search method" or do we
>>want it to be an option along with the search that's there now?
>
>
> It will be the standard and the only method. =) And IMO we should directly
> expose the search syntax and offer some nice help for users to learn it. This
> means that we can remove many buttons/boxes in the search dialog. Going to be
> easier for us and more flexible for the users.
>
>
>>3. Index-building. When do we want to build the index? It almost makes
>>sense to build the index when the user adds a module. However, this is
>>a potentially long operation. We could kick off a thread to do it and
>>keep the UI free for other purposes. Also, we could do like most search
>>engines and force the user to build the index the first time they search.
>
>
> The last is what I'd suggest.
> Another question: Will we be able to access the index directly, e.g. getting a
> list of all words starting or ending with XY? I have plans for an "instant
> concordance" function later which would operate on the index.
> You could make a little blocking pop-up window that just says "(Re)building
> index for module XY, this may take a while" and has a progress bar. No user
> interaction needed.
>
>
>>4. Index-location. Where do we store the index? Do we currently have a
>>.bibletime or something to store such things? (I might be able to answer
>>this myself, I haven't looked for it yet.)
>
>
> You can use:
> QString dir( KGlobal::dirs()->saveLocation("data", "bibletime/indices/") );
>
> On my system, this will return ~/.kde/share/apps/bibletime/indices/, which
> would be a nice location. ~/.kde/share/apps/bibletime/cache/ is where we
> currently store the lexicon entry cache files (very simple logic). Indexes
> also need to be rebuilt should the version of an installed module OR the way
> we create indexes change. So I guess our module version number and the "index
> layout" version number need to be stored somewhere. Whenever the index layout
> changes, we increase the index layout version number, and all indices will be
> rebuilt for the users.
> We also perhaps need a button to "Delete all index and cache files", if a user
> has disk space problems.
>
>
>>Also, what about Bibles?
>>Their indices are not going to change. Should we distribute index files
>>with the modules? The user wouldn't have to build at all!
>
>
> This is not possible, because Crosswire distributes the module files, and
> we'll likely use a different index format than other Sword frontends. So I
> guess we'll have to take care of it.
>
> How long does it take? How big do they get?
>
>
>>5. Analyzers. It seems that there are many different Analyzers that can
>>be used to build an index. (Some that differentiate between lower and
>>uppercase, some that take into account grammar rules for certain
>>languages, etc.) Do we want this flexibility extended to the user? Or
>>do we just use the simple analyzer which simply breaks up words?
>
>
> I don't know, have to read more. Perhaps we should start with the simple one?
>
>
>>6. Exceptions. When building my search in, CLucene code complained that
>>C++ exceptions were turned off and CLucene requires them on. Was there
>>a reason for them being turned off?
>
>
> Joachim, can you say something about it?
>
>
>>I think that's it for the moment... I'll try to send status updates
>>more often :)
>
>
>>[...]
>
>
> Lee, I just tagged cvs with rel-1-5-3 to reflect the status of the 1.5.3
> release which just came out. Feel free to start working in cvs HEAD. Should
> we need to make more bugfix releases in the meantime, we can create a branch
> and work there. Once this works well and is documented, we can release 1.6.
>
>
> So much for now.
>
> mg
>
>
>>Thanks,
>>
>>Lee C.
>>
>>Martin Gruner wrote:
>>
>>>Hey Lee,
>>>
>>>just wanted to ask how you are and about your progress with BibleTime
>>>coding / investigation. Here's a nice clucene-based project that I just
>>>found: http://kioclucene.objectis.net/ (just as a demonstration).
>>>
>>>I hope you, Anna and you dear wife are doing well,
>>>
>>>Martin
>
> _______________________________________________
> bt-devel mailing list
> bt-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/bt-devel
More information about the bt-devel
mailing list