[sword-devel] diatheke search type regex and the dot ?

Troy A. Griffitts scribe at crosswire.org
Mon Mar 6 18:17:04 MST 2017


Yeah, so this page shows that c11x regex is still mostly unsupported in gcc:

http://gcc.gnu.org/onlinedocs/libstdc++/manual/status.html#status.iso.tr1

(see section 7)

And the old school gnu regex we use otherwise I don't think knows 
anything about wide chars.  It simply compares bytes and does have a 
clue if some should be considered part of the same byte.  I suspect that 
because nowhere do we tell it that we're giving it UTF-8.

Ultimately my hope is that gcc will improve eventually and solve our 
problem for us.  We could use

We could add an option to use ICU RegexMatcher, but I'm still holding 
out for our compiler.

Troy


On 03/06/2017 05:52 PM, Karl Kleinpaste wrote:
> On 03/06/2017 05:25 PM, Greg Hellings wrote:
>> being off by 2 would seem strange to me
> I don't understand this question at all.
>
> 0xE2 = 226 = 0342
> 0x80 = 128 = 0200
> 0x93 = 147 = 0223
>
> There's no off-by error at all.
>
> "od" is the "octal dump" tool; given -c, it tries to dump characters, 
> but outside 7-bit ASCII, it still dumps octal.
>
> For those familiar with dc(1), this will make sense
> $ dc
> 8o
> 226p
> 342
> 128p
> 200
> 147p
> 223
> 16i
> 0XE2p
> 342
> 0X80p
> 200
> 0X93p
> 223
>
> The interesting questions are why C++11 regex can't find /en dash/, 
> and why non-C++11 regex doesn't understand multibyte.
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20170306/9492cc11/attachment.html>


More information about the sword-devel mailing list