[sword-devel] diatheke search type regex and the dot ?
Troy A. Griffitts
scribe at crosswire.org
Mon Mar 6 18:17:04 MST 2017
Yeah, so this page shows that c11x regex is still mostly unsupported in gcc:
http://gcc.gnu.org/onlinedocs/libstdc++/manual/status.html#status.iso.tr1
(see section 7)
And the old school gnu regex we use otherwise I don't think knows
anything about wide chars. It simply compares bytes and does have a
clue if some should be considered part of the same byte. I suspect that
because nowhere do we tell it that we're giving it UTF-8.
Ultimately my hope is that gcc will improve eventually and solve our
problem for us. We could use
We could add an option to use ICU RegexMatcher, but I'm still holding
out for our compiler.
Troy
On 03/06/2017 05:52 PM, Karl Kleinpaste wrote:
> On 03/06/2017 05:25 PM, Greg Hellings wrote:
>> being off by 2 would seem strange to me
> I don't understand this question at all.
>
> 0xE2 = 226 = 0342
> 0x80 = 128 = 0200
> 0x93 = 147 = 0223
>
> There's no off-by error at all.
>
> "od" is the "octal dump" tool; given -c, it tries to dump characters,
> but outside 7-bit ASCII, it still dumps octal.
>
> For those familiar with dc(1), this will make sense
> $ dc
> 8o
> 226p
> 342
> 128p
> 200
> 147p
> 223
> 16i
> 0XE2p
> 342
> 0X80p
> 200
> 0X93p
> 223
>
> The interesting questions are why C++11 regex can't find /en dash/,
> and why non-C++11 regex doesn't understand multibyte.
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20170306/9492cc11/attachment.html>
More information about the sword-devel
mailing list