[sword-devel] Comming soon: new improved sword searching

Chris Little sword-devel@crosswire.org
Sun, 8 Sep 2002 21:31:44 -0700 (MST)


On Sun, 8 Sep 2002, Joel Mawhorter wrote:

> Wouldn't it make more sense to use UTF-16 than UTF-8 in regular expressions. 
> At least with UTF-16, in most cases, 1 character == 1 symbol so regular 
> expressions would be more managable (e.g. what does a dot mean in a regular 
> expression when being matched against symbols that can be represented in 1,2 
> or 3 chars?). Does ICU have regular expression support? I know the regular 
> expression support in Java 1.4 is very nice and uses UTF-16 but alas we can't 
> really use that in Sword unless we come up with a CNNI (C non-native 
> interface :-).

Nope.  Sword is entirely UTF-8 internally.  Perl just happens to be the 
same.  Perl has a nice regex implementation built on UTF-8.  In Perl, a 
dot means a character.  Regexes should operate on characters, not bytes, 
after all.  No, ICU doesn't have any regex support.  It's almost entirely 
devoted to i18n/l10n stuff, though it does have a simple io library.  
Using code that works with UTF-8 also benefits us by not requiring that we 
convert to/from UTF-16.

--Chris