[bt-devel] Re: BibleTime
Daniel Glassey
dglassey at gmail.com
Fri Dec 16 12:25:03 MST 2005
On 16/12/05, Troy A. Griffitts <scribe at crosswire.org> wrote:
> Hey guys,
> Just a quick note. Are you all aware that SWORD does expose clucene
> searching in the API. We have an interface to query if indexes have
> been created, and also to ask them to be created (reporting status) if
> they have not been.
>
> Also, it is my impression that clucene does not yet work correctly with
> wide characters (wchar_t is also different sizes on different platforms
> (as previously below) and does not conform to any standard).
Have you tried it out? My impression is that they are just putting 16
bits of data into whatever wchar_t is but I haven't tested it yet so I
don't know if it works.
Regards,
Daniel
> Martin Gruner wrote:
> > Dear Lee,
> >
> > I'm more than excited! My comments below...
> >
> >
> >>I have been meaning to send regular updates, but I keep thinking, "No
> >>I'll do this one more thing then send an update." Right now, in my
> >>local Bibletime tree, I have an index-based search going! Currently, it
> >>simply uses the existing search dialog, but I ignore some of the fields.
> >> The results show up in the normal results tab of the seach dialog.
> >>
> >>Here's what I've done. I implemented the search as another function in
> >>CSwordModuleInfo. Where the search dialog normally called search(), I
> >>call searchIndexed() which is my new function. The results are returned
> >>to m_searchResults as normal.
> >>
> >>I fought the Unicode issue again. The search string came from QT in
> >>UCS2. CLucene uses TCHAR which is a wchar_t if built for Unicode and
> >>just char if built for ANSI. To make matters worse, wchar_t is 2 bytes
> >>on Windows and 4 bytes on Linux. Fortunately, I found some conversion
> >>utilities in CLucene that allowed me to convert from utf8 to wide-char
> >>strings. So I use QString to convert to UTF8 then those utils to
> >>convert to CLucenes wchar types. Then I search my index and convert the
> >>results from wchar types to utf8 to stuff back into SWKey results. *Phew*
> >>:)
> >
> >
> > :-p
> >
> > I suggest that we _demand_ that users install clucene built for Unicode. Isn't
> > wchar_t UCS2? Perhaps we could speed up index creation if we have a direct
> > conversion routine, instead of UCS2 - UTF8 - WCHAR_T (UCS2)? I'm no expert
> > here. We could add that later, also.
> >
> >
> >>I am currently working on limiting the results to the search scope
> >>specified in the search dialog. I came up with a list of questions I
> >>wanted to ask to go further. I was going to send them tonight actually,
> >>but since you pinged me, here they are :)
> >>
> >>1. Search syntax. As you know CLucene has a rich search syntax. Do we
> >>want to expose that syntax directly (i.e. the user types their query in
> >>the syntax supported by CLucene) or do we want to break out the syntax
> >>into user interface elements (e.g. the AND/OR/ANY buttons, etc.)?
> >>
> >>2. Do we want index-based searching to be "the search method" or do we
> >>want it to be an option along with the search that's there now?
> >
> >
> > It will be the standard and the only method. =) And IMO we should directly
> > expose the search syntax and offer some nice help for users to learn it. This
> > means that we can remove many buttons/boxes in the search dialog. Going to be
> > easier for us and more flexible for the users.
> >
> >
> >>3. Index-building. When do we want to build the index? It almost makes
> >>sense to build the index when the user adds a module. However, this is
> >>a potentially long operation. We could kick off a thread to do it and
> >>keep the UI free for other purposes. Also, we could do like most search
> >>engines and force the user to build the index the first time they search.
> >
> >
> > The last is what I'd suggest.
> > Another question: Will we be able to access the index directly, e.g. getting a
> > list of all words starting or ending with XY? I have plans for an "instant
> > concordance" function later which would operate on the index.
> > You could make a little blocking pop-up window that just says "(Re)building
> > index for module XY, this may take a while" and has a progress bar. No user
> > interaction needed.
> >
> >
> >>4. Index-location. Where do we store the index? Do we currently have a
> >>.bibletime or something to store such things? (I might be able to answer
> >>this myself, I haven't looked for it yet.)
> >
> >
> > You can use:
> > QString dir( KGlobal::dirs()->saveLocation("data", "bibletime/indices/") );
> >
> > On my system, this will return ~/.kde/share/apps/bibletime/indices/, which
> > would be a nice location. ~/.kde/share/apps/bibletime/cache/ is where we
> > currently store the lexicon entry cache files (very simple logic). Indexes
> > also need to be rebuilt should the version of an installed module OR the way
> > we create indexes change. So I guess our module version number and the "index
> > layout" version number need to be stored somewhere. Whenever the index layout
> > changes, we increase the index layout version number, and all indices will be
> > rebuilt for the users.
> > We also perhaps need a button to "Delete all index and cache files", if a user
> > has disk space problems.
> >
> >
> >>Also, what about Bibles?
> >>Their indices are not going to change. Should we distribute index files
> >>with the modules? The user wouldn't have to build at all!
> >
> >
> > This is not possible, because Crosswire distributes the module files, and
> > we'll likely use a different index format than other Sword frontends. So I
> > guess we'll have to take care of it.
> >
> > How long does it take? How big do they get?
> >
> >
> >>5. Analyzers. It seems that there are many different Analyzers that can
> >>be used to build an index. (Some that differentiate between lower and
> >>uppercase, some that take into account grammar rules for certain
> >>languages, etc.) Do we want this flexibility extended to the user? Or
> >>do we just use the simple analyzer which simply breaks up words?
> >
> >
> > I don't know, have to read more. Perhaps we should start with the simple one?
> >
> >
> >>6. Exceptions. When building my search in, CLucene code complained that
> >>C++ exceptions were turned off and CLucene requires them on. Was there
> >>a reason for them being turned off?
> >
> >
> > Joachim, can you say something about it?
> >
> >
> >>I think that's it for the moment... I'll try to send status updates
> >>more often :)
> >
> >
> >>[...]
> >
> >
> > Lee, I just tagged cvs with rel-1-5-3 to reflect the status of the 1.5.3
> > release which just came out. Feel free to start working in cvs HEAD. Should
> > we need to make more bugfix releases in the meantime, we can create a branch
> > and work there. Once this works well and is documented, we can release 1.6.
> >
> >
> > So much for now.
> >
> > mg
> >
> >
> >>Thanks,
> >>
> >>Lee C.
> >>
> >>Martin Gruner wrote:
> >>
> >>>Hey Lee,
> >>>
> >>>just wanted to ask how you are and about your progress with BibleTime
> >>>coding / investigation. Here's a nice clucene-based project that I just
> >>>found: http://kioclucene.objectis.net/ (just as a demonstration).
> >>>
> >>>I hope you, Anna and you dear wife are doing well,
> >>>
> >>>Martin
--
A: No.
Q: Should I include quotations after my reply?
A. Because it breaks the logical sequence of discussion
Q. Why is top posting bad?
More information about the bt-devel
mailing list