[sword-devel] conf utf-8
Chris Little
chrislit at crosswire.org
Mon Feb 14 16:01:01 MST 2005
UTF-8 is a stream of bytes, so it has no endianness. Big vs. little
endian indicates whether you store the bytes of a 2+ byte number
starting with the low- or high-order byte.
You can use a BOM in any Unicode encoding (UTF-7, UTF-8, UTF-16BE,
UTF-16LE, UTF-32BE, or UTF-32LE) since it encodes both the endianness of
the stream and the encoding used, even though the endianness is not
relevant to UTF-7 or UTF-8.
In any case, we decided years ago that we would use UTF-8 in Sword. And,
where endianness matters (namely indexes stored in files), we use
little-endian numbers because development started and continues to
primarily take place on Intel architecture processors, which are
little-endian. So, for us, the BOM is irrelevant unless we want to
support .conf files in UTF-16--but I still think we should allow for its
presence.
I'm not sure what you mean by "Windows does it backward from the rest of
the world." My guess is that Java uses big-endian because Sun's
processors are big-endian, so they kept that endianness in the Java
platform. Endianness is normally associated with processor
architectures, though, so Intel is little-endian but MIPS & PowerPC are
big-endian. I would guess that Windows NT on non-Intel platforms was
big-endian. Linux certainly uses native endianness.
--Chris
DM Smith wrote:
> UTF-8 has big and little endian byte orderings.
> If there is no byte mark, it will be significant to use a particular
> byte ordering (either little-endian or big-endian).
> If there is a BOM, then it can be interrogated and the UTF can be
> interpret in either fashion.
> Even so, I think that it would be best to settle upon a particular byte
> ordering.
> Windows does it backward from the rest of the world.
>
> Chris Little wrote:
>
>>
>>
>> Troy A. Griffitts wrote:
>>
>>> My guess about the characters which keep the .conf file from
>>> being recognized... try adding a few newlines to the beginning of the
>>> file. I would guess that XXX[Section Name] at the beginning is just
>>> causing our .conf reader to not recognize the "Section Name".
>>
>>
>>
>> The three characters are the Unicode byte-order mark (BOM). See
>> http://www.unicode.org/faq/utf_bom.html#BOM for full details. But,
>> basically, it's the codepoint U+FEFF, encoded at the beginning of a
>> file. From this character, you can tell whether you have UTF-16
>> big-endian, UTF-16 little-endian, or UTF-8.
>>
>> I would recommend we go ahead and support it (to the extent that we
>> check for it and throw it away) since it's not something that just
>> notepad adds to file. (No need to fix before the trip, though, I think.)
>>
>> --Chris
>>
>> _______________________________________________
>> sword-devel mailing list
>> sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>
> _______________________________________________
> sword-devel mailing list
> sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
More information about the sword-devel
mailing list