[osis-core] Thinking out loud on the regex changes

Patrick Durusau osis-core@bibletechnologieswg.org
Thu, 19 Feb 2004 17:44:20 -0500


Guys,

More thinking out loud than anything else but:

Post-prefix regex for osisGenRegex reads:

((((\p{L})|(\p{N})|_)+)(((\.(\p{L}|\p{N}|_)+)*))?)

That is:

(\p{L})|(\p{N})|_)+) = Any number of valid XML Name Start characters 
(but at least one)

(\. = Followed by a "."

(\p{L}|\p{N}|_)+ = Followed by any number of Letters or Numbers or "_"

*) = Repeat the prior line any number of times

))? = But, the "." and all that follows is optional

What we want is:

(\p{L})|(\p{N})|_)+) = Any number of Letters or Numbers or "_" (but at 
least one)

(\. = Followed by a "."

<!-- proposed change on next line -->

(In prose, any valid XML character but if it is one of the ones we have 
reserved, such as the "-" (hyphen) then it must be preceded with a "\" 
as an escape character.)

or,

(\p{L}|\p{N}|_|(\\[^\s]))+ = Followed by any number of Letters or 
Numbers or "_" or any XML character except space, when preceded by a "\" 
as an escape character.

Note that this works in testing but looks inelegant.

Something keep nagging at me about the addition to the expression.

Obviously cannot go prior to the "." since then it would be possible to 
have invalid characters in the start of the name.

Suppose this gets us close to validation since anything that is not a 
Letter or Number or a hyphen must use the escape character. Doesn't mean 
that you can't use the "\" before any other character but why would you?

Suppose it is that last case that is bothering me. What if I have:

"Amos.\1\2\3"

All of those are valid characters anyway. What does the escape character 
mean here?

Note that is a different case from: "Amos.\\1\\2\\3" which represents 
the back-slashes as literals, perhaps part of a Sword universal 
reference system. ;-) (Apologies to Troy/Chris)

Are the processing rules?:

1. Any Letter or Number or "_" underscore may be preceded by a single 
"\" but that is meaningless and should be discarded by the processor?

2. Any non-Letter/Number/"_" must be preceded by a single "\".

3. A "\" preceded by another "\" is a literal (implied by #2 but wanted 
to be explicit)

Sorry to go on so but we are gaining traction and I don't want to create 
a problem in the regexes that will bite us just after a public release.

Comments?

I will try to generate the modified regexes later today and ship them 
out in my regex.xsd file for your testing enjoyment. Just has a few 
elements so you can put in testing values on the attributes.

BTW, is the style of explanation I used above look like something that 
would be useful in the users manual for the regexes?

Hope everyone is having a great day!

Patrick

-- 
Patrick Durusau
Director of Research and Development
Society of Biblical Literature
Patrick.Durusau@sbl-site.org
Chair, V1 - Text Processing: Office and Publishing Systems Interface
Co-Editor, ISO 13250, Topic Maps -- Reference Model

Topic Maps: Human, not artificial, intelligence at work!