[sword-devel] indexed search discrepancy

Fri Aug 28 15:38:16 MST 2009

TCHAR is even more ambigous than wchar-t. if UNICODE is defined then
TCHAR is wchar-t. otherwise, it is plain char. I'm away form my
computer but clucene is definitely converting to utf16 or utf32
depending on platform. so i think it is always proper unicode. one way
or another, the field needs to be converted to a wchar-t containing
utf 16/32

On 8/28/09, Troy A. Griffitts <scribe at crosswire.org> wrote:
> Thanks again Matthew.  Writing quick for lack of time right now.
>
> In general, we avoid the use of wchar_t because it is define differently
> on different systems, making its intended use (as a unicode character)
> holder at best essentially useless for anything other than UTF-16, and
> at least confusing and ambiguous.
>
> I could probably look this up, but since you know where everything is in
> clucene by now...
>
> What EXACTLY is TCHAR defined as (i.e. what is sizeof(TCHAR))?  Same on
> all platforms?
>
> What does lucene_utf8towc return? TCHAR? wchar_t?
>
> What I'm trying to determine is:
>
> Is clucene expecting UTF-16
> (which can represent 15 bits of unicode glyph space in 2 bytes,
> reserving the upper bit as a multicode indicator, and if set then moves
> to 4+ bytes after 15 bits)?
>
> ... or is clucene just saying 16 bits of unicode glyph space is good
> enough for government work; we're not gonna worry about the rest?
>
> From the pros in the definition of the method you gave, it sounds like
> knowing the sizeof the return value for lucene_utf8towc might tell us
> the answer.
>
> Thanks again for doing the legwork.
>
> 	-Troy.
>
>
>
>
> Matthew Talbert wrote:
>>>> We have methods to convert to both UTF-16 and UTF-32 in our engine,
>>>> which don't need a fixed length buffer, so I would like to replace:
>>>>
>>>> lucene_utf8towcs(wcharBuffer, content, MAX_CONV_SIZE);
>>>>
>>>> with a call to our code, if we can nail down exactly what clucene wants
>>>> in the resultant wcharBuffer
>>
>> lucene_utf8towcs calls lucene_utf8towc for every character; the
>> comment on the function is this:
>>
>> /**
>>  * lucene_utf8towc:
>>  * @p: a pointer to Unicode character encoded as UTF-8
>>  *
>>  * Converts a sequence of bytes encoded as UTF-8 to a Unicode character.
>>  * If @p does not point to a valid UTF-8 encoded character, results are
>>  * undefined. If you are not sure that the bytes are complete
>>  * valid Unicode characters, you should use lucene_utf8towc_validated()
>>  * instead.
>>  *
>>  * Return value: the resulting character
>>  **/
>>
>> The call to doc->Add actually expects a TCHAR, so if your utf8 to
>> utf16 conversion can produce a TCHAR, then that's all that would be
>> necessary I think.
>>
>> Matthew
>>
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
>