FW: [osis-core] character counting issue: proposed solution

Harry Plantinga osis-core@bibletechnologieswg.org
Tue, 18 Jun 2002 17:18:38 -0400


Patrick,

I looked through the link below, and I think the issue is even 
more difficult than I had realized. For example, ligatures can
represent two characters with a single glyph and a single byte
sequence. In fact, because of the various meanings of "character,"
the w3.org web page recommends against using the term "character"
at all, if possible.

They do refer to "Unicode Normalized Form C", however, which ensures
identical byte coding of the same set of characters. We could 
refer to in our definition of character counting. But that would 
make counting characters accurately very difficult--it would 
require reading reams of information about Unicode Normalized
Form C just to figure out if "First" is 5 characters or 4 (because
of an Fi ligature).

However, since w3 recommends against even using the term "character", 
we may want to consider whether that term is too fuzzy to be used 
in a definition of a grain identifier. Besides, it's a nuisance 
(and somewhat error-prone) to count characters even if the meaning 
were clear. 

N.B. the same sorts of problems arise in string matching; "First"
may not match "First" if one of the strings uses a ligature.

I believe that this sort of thing will be a rare problem but one
that is very hard to solve correctly in all circumstances. The 
only way to solve it correctly that I can think of is to insist
that the texts and matching strings be in Unicode Normalized Form C,
however arcane that may be.  (Then we can ignore the issue entirely
and it will work right for us most of the time :-)

So here's a revised proposal:

- Drop the character count in the grain. Just use strings, with 
  an optional parameter for the occurrence number of the string.

  @(Hello world!)
  @37(Hello world!)  (37th occurrence. Or use other syntax.)

- Recognize that this will only work correctly if the strings 
  are encoded the same way. 

-Harry




-----Original Message-----
From: owner-osis-core@bibletechnologieswg.org
[mailto:owner-osis-core@bibletechnologieswg.org]On Behalf Of Patrick
Durusau
Sent: Tuesday, June 18, 2002 4:02 PM
To: osis-core@bibletechnologieswg.org
Subject: Re: FW: [osis-core] character counting issue: proposed solution


Harry,

I will be trying to work through your post and the W3C's position on the 
character set model (http://www.w3.org/TR/charmod/). If you have the 
time, can you look at that and see how it would fit into a 
recommendation from OSIS for character counting? (Not sure I can reach 
it today.)

Thanks!

Patrick