[bt-devel] Re: BibleTime
Daniel Glassey
dglassey at gmail.com
Sat Dec 17 08:22:46 MST 2005
On 17/12/05, Lee Carpenter <elc at carpie.net> wrote:
> I saw that SWORD had a clucene option to the search. Do you know which
> CLucene API it expects? (0.8.x or 0.9.x)
Sword 1.5.8 uses 0.8.x The stuff I am doing for svn is for 0.9.x
> CLucene 0.9.x series claims
> that it uses UCS2 internally. My inspection of it shows that it uses
> TCHAR which turns to wchar_t if UNICODE is defined during the build and
> a simple char otherwise. If running Windows, wchar_t is 2-bytes and
> would essentially be UCS2. Running on Linux however, wchar_t is 4 bytes
> and would be UCS4.
Well, my understanding is that CLucene puts UCS2 data into the
wchar_t. So it just wastes space rather than actually being UCS4. I
don't know if it uses wchar functions like wcslen - that would get
confused at high codepoints. Afaiu theoretically you could put UTF-8
into wchar if you really wanted but it would be a lot of space wasted.
> That is why I used the conversion functions which
> theoretically would handle either the 2-byte or 4-byte wchar_t.
>
> CLucene is working for me currently, but my language doesn't make use of
> many non-ASCII characters anyway, so I can't say at this point that it
> works correctly for wide characters. It should work (using the
> conversion routines) unless somewhere in CLucene they make assumptions
> about the width of wchar_t. Based on the way wchar_t is defined (or not
> defined as the case may be) they should not.
>
> If you like, I can take a look at the SWORD built-in clucene search as
> well...
If you like I can send you my patch offlist.
Regards,
Daniel
> Daniel Glassey wrote:
> > On 16/12/05, Troy A. Griffitts <scribe at crosswire.org> wrote:
> >
> >>Hey guys,
> >> Just a quick note. Are you all aware that SWORD does expose clucene
> >>searching in the API. We have an interface to query if indexes have
> >>been created, and also to ask them to be created (reporting status) if
> >>they have not been.
> >>
> >> Also, it is my impression that clucene does not yet work correctly with
> >>wide characters (wchar_t is also different sizes on different platforms
> >>(as previously below) and does not conform to any standard).
> >
> >
> > Have you tried it out? My impression is that they are just putting 16
> > bits of data into whatever wchar_t is but I haven't tested it yet so I
> > don't know if it works.
More information about the bt-devel
mailing list