[bt-devel] Re: BibleTime
Lee Carpenter
elc at carpie.net
Sat Dec 17 21:08:15 MST 2005
Martin Gruner wrote:
>>I fought the Unicode issue again. The search string came from QT in
>>UCS2. CLucene uses TCHAR which is a wchar_t if built for Unicode and
>>just char if built for ANSI. To make matters worse, wchar_t is 2 bytes
>>on Windows and 4 bytes on Linux. Fortunately, I found some conversion
>>utilities in CLucene that allowed me to convert from utf8 to wide-char
>>strings. So I use QString to convert to UTF8 then those utils to
>>convert to CLucenes wchar types. Then I search my index and convert the
>>results from wchar types to utf8 to stuff back into SWKey results. *Phew*
>
> I suggest that we _demand_ that users install clucene built for Unicode. Isn't
> wchar_t UCS2? Perhaps we could speed up index creation if we have a direct
> conversion routine, instead of UCS2 - UTF8 - WCHAR_T (UCS2)? I'm no expert
> here. We could add that later, also.
As you've seen from the other posts to the list, on Linux wchar_t is
essentially UCS4. The only reason I have to from UCS2 to UCS4 is to
handle the input string from QT which comes natively in UCS2. I could
write the routine to directly stuff UCS2 chars into 4-byte variables,
but since it was a incredibly small amount of data, I just used the
convenience functions that were provided.
Since the SWORD modules are already UTF8, there is no "middle man" in
that conversion...
>
>>1. Search syntax. As you know CLucene has a rich search syntax. Do we
>>want to expose that syntax directly (i.e. the user types their query in
>>the syntax supported by CLucene) or do we want to break out the syntax
>>into user interface elements (e.g. the AND/OR/ANY buttons, etc.)?
>>
>>2. Do we want index-based searching to be "the search method" or do we
>>want it to be an option along with the search that's there now?
>
>
> It will be the standard and the only method. =) And IMO we should directly
> expose the search syntax and offer some nice help for users to learn it. This
> means that we can remove many buttons/boxes in the search dialog. Going to be
> easier for us and more flexible for the users.
Ok, I agree. It is a rich syntax.
>>3. Index-building. When do we want to build the index? It almost makes
>>sense to build the index when the user adds a module. However, this is
>>a potentially long operation. We could kick off a thread to do it and
>>keep the UI free for other purposes. Also, we could do like most search
>>engines and force the user to build the index the first time they search.
>
>
> The last is what I'd suggest.
> Another question: Will we be able to access the index directly, e.g. getting a
> list of all words starting or ending with XY? I have plans for an "instant
> concordance" function later which would operate on the index.
> You could make a little blocking pop-up window that just says "(Re)building
> index for module XY, this may take a while" and has a progress bar. No user
> interaction needed.
Ok. A seemingly simply to get what you want with the index is to
perform a a CLucene search and read the returned Hits directly (as
opposed having them returned in SWKey lists.)
>
>>4. Index-location. Where do we store the index? Do we currently have a
>>.bibletime or something to store such things? (I might be able to answer
>>this myself, I haven't looked for it yet.)
>
>
> You can use:
> QString dir( KGlobal::dirs()->saveLocation("data", "bibletime/indices/") );
>
> On my system, this will return ~/.kde/share/apps/bibletime/indices/, which
> would be a nice location. ~/.kde/share/apps/bibletime/cache/ is where we
> currently store the lexicon entry cache files (very simple logic). Indexes
> also need to be rebuilt should the version of an installed module OR the way
> we create indexes change. So I guess our module version number and the "index
> layout" version number need to be stored somewhere. Whenever the index layout
> changes, we increase the index layout version number, and all indices will be
> rebuilt for the users.
> We also perhaps need a button to "Delete all index and cache files", if a user
> has disk space problems.
Ok.
>
>>Also, what about Bibles?
>>Their indices are not going to change. Should we distribute index files
>>with the modules? The user wouldn't have to build at all!
>
>
> This is not possible, because Crosswire distributes the module files, and
> we'll likely use a different index format than other Sword frontends. So I
> guess we'll have to take care of it.
>
> How long does it take? How big do they get?
My current test index using the SimpleAnalyzer is with the KJV and it's
42 MB. I didn't time it, but it seemed to take 2 to 3 minutes on my
Athlon 2.13 GHz.
>
>>5. Analyzers. It seems that there are many different Analyzers that can
>>be used to build an index. (Some that differentiate between lower and
>>uppercase, some that take into account grammar rules for certain
>>languages, etc.) Do we want this flexibility extended to the user? Or
>>do we just use the simple analyzer which simply breaks up words?
>
>
> I don't know, have to read more. Perhaps we should start with the simple one?
Ok. Seems to work fine.
> Lee, I just tagged cvs with rel-1-5-3 to reflect the status of the 1.5.3
> release which just came out. Feel free to start working in cvs HEAD. Should
> we need to make more bugfix releases in the meantime, we can create a branch
> and work there. Once this works well and is documented, we can release 1.6.
>
Ok.
Thanks,
Lee C.
More information about the bt-devel
mailing list