No subject
Thu Oct 29 11:12:40 MST 2009
It supports conversion of UTF-8 to a 32-bit Unicode character stream on
linux (and other platforms that define wchar_t to 32 bits) just fine.
It will simply not work on Windows for values greater than 16-bit.
My support of this conclusion is from the impl of this method:
size_t lucene_utf8towc(wchar_t *pwc, const char *p, size_t n)
{
int i, mask = 0;
int result;
unsigned char c = (unsigned char) *p;
int len=0;
UTF8_COMPUTE (c, mask, len);
if (len == -1)
return 0;
UTF8_GET (result, p, i, mask, len);
*pwc = result;
return len;
}
Notice that it assigns to *pwc (wchar_t) the value of result (int).
Not sure what we should do about this.
We can use our methods to convert UTF-8 to UTF-32 (a.k.a. UCS-4) and
send that to clucene, which should work fine for clucene on systems that
define wchar_t to 32-bit, but will fail miserably on Windows.
Maybe we can get the clucene folks opinion on this? Maybe I've
completely misunderstood the situation; otherwise, maybe we can offer to
clean this up for them.
Troy
Matthew Talbert wrote:
> OK, I am still not understanding why there is an issue, or what the
> real cause of the issue is. However, this line I think will work:
>
> const unsigned int MAX_CONV_SIZE = 6536 * sizeof(wchar_t) * sizeof(wchar_t);
>
> If somebody can come up with an actual explanation for why there is a
> problem, and a non-hackish solution, that would be great.
>
> Just for the record, wchar_t is 16 bits on win32 and 32 bits on *nix.
> So, if I'm thinking correctly (and I won't guarantee that right now),
> this should give the equivalent of 1024 * 1024;
>
> Matthew
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
>
More information about the sword-devel
mailing list