FW: [osis-core] character counting issue: proposed solution
Patrick Durusau
osis-core@bibletechnologieswg.org
Tue, 18 Jun 2002 16:02:24 -0400
Harry,
I will be trying to work through your post and the W3C's position on the
character set model (http://www.w3.org/TR/charmod/). If you have the
time, can you look at that and see how it would fit into a
recommendation from OSIS for character counting? (Not sure I can reach
it today.)
Thanks!
Patrick
Harry Plantinga wrote:
>Yesterday I posted a problem with counting characters in
>unicode, namely you can encode some accented characters in
>different ways that have different numbers of characters.
>
>Here is a proposed solution.
>
>1. The _official_ character count is in a normalized version
>of the text, which uses minimal-length encodings of all
>characters.
>
>2. For a given grain, e.g. @char:52(Hello world!), if the
>52nd character isn't the start of the string "Hello world!",
>point to the first occurrence of "Hello world!" after the 52nd
>character.
>
>3. Recommend that when counting characters, don't count accents
>and other modifiers. This may underestimate the number of unicode
>characters slightly if there are some accented combinations that
>don't have a single-character representation, but in conjunction
>with (2) above, will normally give the right result. Especially
>if the string is unique.
>
>4. For people who don't like counting characters and can identify
>unique strings, allow @char:0(Hello world!). Actually, this
>is implied by 3 above.
>
>5. (Extra credit). Allow @(Hello world!) as a shortcut for
>@char:0(Hello world!).
>
>-Harry
>
--
Patrick Durusau
Director of Research and Development
Society of Biblical Literature
pdurusau@emory.edu