[bt-devel] Re: BibleTime
Lee Carpenter
elc at carpie.net
Sat Dec 17 20:55:24 MST 2005
Daniel Glassey wrote:
>
> Well, my understanding is that CLucene puts UCS2 data into the
> wchar_t. So it just wastes space rather than actually being UCS4. I
> don't know if it uses wchar functions like wcslen - that would get
> confused at high codepoints. Afaiu theoretically you could put UTF-8
> into wchar if you really wanted but it would be a lot of space wasted.
>
This is my understanding as well. I just didn't state it clearly. UCS2
could never contain UCS4 information since it "throws away" anything
over 16-bits. If it was UTF-16, then there could be a "correct"
conversion to UCS4. But I agree, as it stands, it's UCS2 in 4-byte
variable. It seemed to me that CLucenes' native platform was Windows
where wchar_t is 2 bytes and then there was work done later to support
Linux. Using wchar_t was probably the easiest way to get UCS2 support.
>>>> Also, it is my impression that clucene does not yet work correctly with
>>>>wide characters (wchar_t is also different sizes on different platforms
>>>>(as previously below) and does not conform to any standard).
>>>
>>>
>>>Have you tried it out? My impression is that they are just putting 16
>>>bits of data into whatever wchar_t is but I haven't tested it yet so I
>>>don't know if it works.
In my testing with it, it works fine coming from QTs UCS-2 to UTF8 then
from UTF8 to wchar_t. All the data I've tested is English so its all
8-bit wide data, but nothing gets mangled or transposed. I don't know
if the conversions from UTF8 to wchar_t are correct or not. Perhaps
someone who routinely uses 16-bit (or greater) characters could test this?
Lee C.
More information about the bt-devel
mailing list