[sword-devel] usfm2osis.py
Chris Little
chrislit at crosswire.org
Sat Aug 4 15:55:00 MST 2012
On 08/04/2012 10:22 AM, David Haslam wrote:
> Wow!
>
> What Peter means is that after all the ASCII stuff (up to the tilde), these
> are also counted:
>
> 0E0030 14 TAG DIGIT ZERO
> 0E0031 11 TAG DIGIT ONE
> 0E0032 10 TAG DIGIT TWO
> 0E0033 7 TAG DIGIT THREE
> 0E0034 6 TAG DIGIT FOUR
> 0E0035 5 TAG DIGIT FIVE
> 0E0042 18 TAG LATIN CAPITAL LETTER B
> 0E0043 11 TAG LATIN CAPITAL LETTER C
> 0E0044 16 TAG LATIN CAPITAL LETTER D
> 0E0046 28 TAG LATIN CAPITAL LETTER F
> 0E0056 7 TAG LATIN CAPITAL LETTER V
> 0E0070 21 TAG LATIN SMALL LETTER P
>
>
> David
Yes, these are intended and fall under the following line of the guidelines:
Use & abuse Unicode tags (http://unicode.org/charts/PDF/UE0000.pdf) to
simplify Regex processing
They are inserted at various division boundaries to simplify regexes. So
the B-tag marks book boundaries. C is for chapter, D is for div, F is
for footnote, V is for verse, and p needs to be capitalized but
represents paragraphs. The digit tags represent section levels, IIRC.
Unfortunately, no one includes these in fonts, much less keyboards, so
they're a pain to work with, but they simplify regexes so drastically
that they're worth it. And I consider the probability that anyone would
use them in USFM so slim that I'm willing to risk the possibility of
false positives in my regex matching.
--Chris
More information about the sword-devel
mailing list