[sword-devel] usfm2osis.py

Sat Aug 4 15:55:00 MST 2012

On 08/04/2012 10:22 AM, David Haslam wrote:
> Wow!
>
> What Peter means is that after all the ASCII stuff (up to the tilde), these
> are also counted:
>
> 0E0030	󠀰	14	TAG DIGIT ZERO
> 0E0031	󠀱	11	TAG DIGIT ONE
> 0E0032	󠀲	10	TAG DIGIT TWO
> 0E0033	󠀳	7	TAG DIGIT THREE
> 0E0034	󠀴	6	TAG DIGIT FOUR
> 0E0035	󠀵	5	TAG DIGIT FIVE
> 0E0042	󠁂	18	TAG LATIN CAPITAL LETTER B
> 0E0043	󠁃	11	TAG LATIN CAPITAL LETTER C
> 0E0044	󠁄	16	TAG LATIN CAPITAL LETTER D
> 0E0046	󠁆	28	TAG LATIN CAPITAL LETTER F
> 0E0056	󠁖	7	TAG LATIN CAPITAL LETTER V
> 0E0070	󠁰	21	TAG LATIN SMALL LETTER P
>
>
> David

Yes, these are intended and fall under the following line of the guidelines:

Use & abuse Unicode tags (http://unicode.org/charts/PDF/UE0000.pdf) to 
simplify Regex processing

They are inserted at various division boundaries to simplify regexes. So 
the B-tag marks book boundaries. C is for chapter, D is for div, F is 
for footnote, V is for verse, and p needs to be capitalized but 
represents paragraphs. The digit tags represent section levels, IIRC.

Unfortunately, no one includes these in fonts, much less keyboards, so 
they're a pain to work with, but they simplify regexes so drastically 
that they're worth it. And I consider the probability that anyone would 
use them in USFM so slim that I'm willing to risk the possibility of 
false positives in my regex matching.

--Chris