[sword-devel] conf utf-8

Mon Feb 14 16:01:01 MST 2005

UTF-8 is a stream of bytes, so it has no endianness. Big vs. little 
endian indicates whether you store the bytes of a 2+ byte number 
starting with the low- or high-order byte.

You can use a BOM in any Unicode encoding (UTF-7, UTF-8, UTF-16BE, 
UTF-16LE, UTF-32BE, or UTF-32LE) since it encodes both the endianness of 
the stream and the encoding used, even though the endianness is not 
relevant to UTF-7 or UTF-8.

In any case, we decided years ago that we would use UTF-8 in Sword. And, 
where endianness matters (namely indexes stored in files), we use 
little-endian numbers because development started and continues to 
primarily take place on Intel architecture processors, which are 
little-endian. So, for us, the BOM is irrelevant unless we want to 
support .conf files in UTF-16--but I still think we should allow for its 
presence.

I'm not sure what you mean by "Windows does it backward from the rest of 
the world." My guess is that Java uses big-endian because Sun's 
processors are big-endian, so they kept that endianness in the Java 
platform. Endianness is normally associated with processor 
architectures, though, so Intel is little-endian but MIPS & PowerPC are 
big-endian. I would guess that Windows NT on non-Intel platforms was 
big-endian. Linux certainly uses native endianness.

--Chris

DM Smith wrote:
> UTF-8 has big and little endian byte orderings.
> If there is no byte mark, it will be significant to use a particular 
> byte ordering (either little-endian or big-endian).
> If there is a BOM, then it can be interrogated and the UTF can be 
> interpret in either fashion.
> Even so, I think that it would be best to settle upon a particular byte 
> ordering.
> Windows does it backward from the rest of the world.
> 
> Chris Little wrote:
> 
>>
>>
>> Troy A. Griffitts wrote:
>>
>>>     My guess about the characters which keep the .conf file from 
>>> being recognized... try adding a few newlines to the beginning of the 
>>> file.  I would guess that XXX[Section Name] at the beginning is just 
>>> causing our .conf reader to not recognize the "Section Name".
>>
>>
>>
>> The three characters are the Unicode byte-order mark (BOM). See 
>> http://www.unicode.org/faq/utf_bom.html#BOM for full details. But, 
>> basically, it's the codepoint U+FEFF, encoded at the beginning of a 
>> file. From this character, you can tell whether you have UTF-16 
>> big-endian, UTF-16 little-endian, or UTF-8.
>>
>> I would recommend we go ahead and support it (to the extent that we 
>> check for it and throw it away) since it's not something that just 
>> notepad adds to file. (No need to fix before the trip, though, I think.)
>>
>> --Chris
>>
>> _______________________________________________
>> sword-devel mailing list
>> sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>
> _______________________________________________
> sword-devel mailing list
> sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel