[sword-devel] OSIS links
DM Smith
dmsmith at crosswire.org
Wed Jan 19 09:26:28 MST 2011
On 01/19/2011 09:18 AM, Karl Kleinpaste wrote:
> DM Smith<dmsmith at crosswire.org> writes:
>> The problem with spaces in an osisRef is that a space is defined as the
>> separator between one reference and another.
>> E.g. osisRef="Rom.1.1 Rom.2.1-Rom.3.1"
> Then I would like to think that either URL-standard '+' or hex-encoded
> "%20" would work in place of a real space.
Regarding the thread as a whole. It might be good to see the actual
definition.
The osisRef is defined by the following regular expression:
(((\p{L}|\p{N}|_)+)((\.(\p{L}|\p{N}|_)+)*)?:)?((\p{L}|\p{N}|_|(\\[^\s]))+)(\.(\p{L}|\p{N}|_|(\\[^\s]))*)*(!((\p{L}|\p{N}|_|(\\[^\s]))+)((\.(\p{L}|\p{N}|_|(\\[^\s]))+)*)?)?(@(cp\[(\p{Nd})*\]|s\[(\p{L}|\p{N})+\](\[(\p{N})+\])?))?(\-((((\p{L}|\p{N}|_|(\\[^\s]))+)(\.(\p{L}|\p{N}|_|(\\[^\s]))*)*)+)(!((\p{L}|\p{N}|_|(\\[^\s]))+)((\.(\p{L}|\p{N}|_|(\\[^\s]))+)*)?)?(@(cp\[(\p{Nd})*\]|s\[(\p{L}|\p{N})+\](\[(\p{N})+\])?))?)?
See below for a breakdown...
The OSIS manual is a bit inconsistent with this regex:
This regex allows for a single reference or a single contiguous
reference. It does not allow for multiple references separated by
whitespace. The manual has at least one example of multiple references,
but it also clearly says multiple references are not allowed.
Also on the "grain" the manual states () surround the operator argument
in one spot and [] in another, but the regex only gives [].
Interestingly, the osisID and the osisRef do not allow whitespace at
all, even escaped.
The osisRef requires that characters other than letters, numbers and
underscores to be escaped. And that '.' is the hierarchical separator. Thus,
osisRef="Shaw:Shaw/The Reformed Faith/Preface"
should be (Assuming that / is a hierarchical separator):
osisRef="Shaw:Shaw.The\ Reformed\ Faith.Preface"
I don't know how well SWORD or JSword handles this.
In His Service,
DM
Breaking the regex down:
# A couple of things to note:
# \p{L} any letter in any script. It is not [A-Za-z].
# \p{N} is any number in any script. It is not merely [0-9].
# \p{Nd} is any number in any script except ideographic, such as
Chinese. It is not merely [0-9].
# \\[^\s] is a backslash followed by a non-whitespace,
# that is an escaped non-whitespace
# I think:
# there are unnecessary ()
# ((...)*)? cane be more simply (...)*
# A work id may stand at the beginning of an osisRef
(
# The osisRef starts with a sequence of 1 or more letters, numbers
and underscores
(
(\p{L}|\p{N}|_)+
)
# optionally it is followed by 0 or more parts
(
# which begin with a dot and followed by a sequence of 1 or more
letters, numbers and underscores
(
\.(\p{L}|\p{N}|_)+
)*
)?
:
)?
# The reference follows the workId, if present
(
# The reference starts with a sequence of 1 or more letters, numbers,
underscores and escaped non-whitespace
(\p{L}|\p{N}|_|(\\[^\s]))+
)
# and it is followed by 0 or more parts
(
# each of which begins with a dot and is followed by 0 or more
letters, numbers, underscores and escaped non-whitespace
# Note: this allows for ...
\.(\p{L}|\p{N}|_|(\\[^\s]))*
)*
# Following the reference is an optional work-specific extension
(
# The extension begins with an exclamation mark
!
(
# and is followed by 1 or more letters, numbers, underscores and
escaped non-whitespace
(\p{L}|\p{N}|_|(\\[^\s]))+
)
# the grain can be multipart
(
(
# each of which begins with a dot and is followed by 1 or more
letters, numbers, underscores and escaped non-whitespace
\.(\p{L}|\p{N}|_|(\\[^\s]))+
)*
)?
)?
# An osisRef has an optional grain, which the osisID does not have
# Note the OSIS manual differs from this regex.
# The manual defines () to enclose the argument.
# Here we have []
# The manual does not specify, but the regex allows a trailing [n] on
the "s" operator
(
# the grain starts with an @
@
(
# and is followed by the letters 'cp' and a bracketed number
# This is the "code point" operator, which indexes n characters
into the referenced element
cp\[(\p{Nd})*\]
# or followed by the letter s, for string, and a bracketed sequence
of 1 or more letters or numbers
# This is the "string" operator, which finds the given string in
the referenced element
# and this can be followed by an optional bracketed number
|s\[(\p{L}|\p{N})+\](\[(\p{N})+\])?
)
)?
# The osisRef can be defined as range
(
# starting with a hyphen
\-
# the following is identical to what was given before
(
# of one or more parts that are exactly like the reference part
given above
# key thing to note, there is no allowance for a workId
(
(
# starting with a sequence of 1 or more letters, numbers,
underscores and escaped non-whitespace
(\p{L}|\p{N}|_|(\\[^\s]))+
)
# and followed by 0 or more parts
(
\.(\p{L}|\p{N}|_|(\\[^\s]))*
)*
)+
)
# and the optional extension, repeated here as before
(
# starting with an exclamation mark
!
(
# and followed by a sequence of 1 or more letters, numbers and
excaped non-whitespace
(\p{L}|\p{N}|_|(\\[^\s]))+
)
# optionally followed by parts
(
(
# each part starts with a dot and is followed by 1 or more
letters, numbers and excaped non-whitespace
\.(\p{L}|\p{N}|_|(\\[^\s]))+
)*
)?
)?
# and as before it can be followed by grain
(
# the grain starts with an @
@
(
# and is followed by the letters 'cp' and a bracketed number
# This is the "code point" operator, which indexes n characters
into the referenced element
cp\[(\p{Nd})*\]
# or followed by the letter s, for string, and a bracketed
sequence of 1 or more letters or numbers
# This is the "string" operator, which finds the given string in
the referenced element
# and this can be followed by an optional bracketed number
|s\[(\p{L}|\p{N})+\](\[(\p{N})+\])?
)
)?
)?
More information about the sword-devel
mailing list