FW: [osis-core] character counting issue: proposed solution
Steve DeRose
osis-core@bibletechnologieswg.org
Wed, 19 Jun 2002 21:15:51 -0400
At 10:57 AM -0400 06/18/02, Harry Plantinga wrote:
>Yesterday I posted a problem with counting characters in
>unicode, namely you can encode some accented characters in
>different ways that have different numbers of characters.
>
>Here is a proposed solution.
>
>1. The _official_ character count is in a normalized version
>of the text, which uses minimal-length encodings of all
>characters.
>
>2. For a given grain, e.g. @char:52(Hello world!), if the
>52nd character isn't the start of the string "Hello world!",
>point to the first occurrence of "Hello world!" after the 52nd
>character.
>
>3. Recommend that when counting characters, don't count accents
>and other modifiers. This may underestimate the number of unicode
>characters slightly if there are some accented combinations that
>don't have a single-character representation, but in conjunction
>with (2) above, will normally give the right result. Especially
>if the string is unique.
>
>4. For people who don't like counting characters and can identify
>unique strings, allow @char:0(Hello world!). Actually, this
>is implied by 3 above.
>
>5. (Extra credit). Allow @(Hello world!) as a shortcut for
>@char:0(Hello world!).
>
>-Harry
Kind of nice. The ligature you cited later remains a pain, though.
I'm not sure bagging the offset helps much since as you pointed out,
the string matching still has to assume same encoding.
I see two other possible solutions:
1) change from 'character' to 'code point' ('cp:') and say it's
defined to be stupid and just count and compare Unicode code points,
which are well-defined. this wouldn't work across systems that insist
on changing the representation of data they import, but that
shouldn't be so bad a problem, I would hope.
2) Insist on Form C. It is a pain to implement, although there are
probably utilities and source around to do it.
I think by Occam's Razor I'd go for #1.
--
Steve DeRose -- http://www.stg.brown.edu/~sjd
Chair, Bible Technologies Group -- http://www.bibletechnologies.net
Email: sderose@speakeasy.net
Backup email: sderose@mac.com, sjd@stg.brown.edu