[sword-devel] OSIS links

DM Smith dmsmith at crosswire.org
Wed Jan 19 09:26:28 MST 2011


On 01/19/2011 09:18 AM, Karl Kleinpaste wrote:
> DM Smith<dmsmith at crosswire.org>  writes:
>> The problem with spaces in an osisRef is that a space is defined as the
>> separator between one reference and another.
>> E.g. osisRef="Rom.1.1 Rom.2.1-Rom.3.1"
> Then I would like to think that either URL-standard '+' or hex-encoded
> "%20" would work in place of a real space.
Regarding the thread as a whole. It might be good to see the actual 
definition.

The osisRef is defined by the following regular expression:
(((\p{L}|\p{N}|_)+)((\.(\p{L}|\p{N}|_)+)*)?:)?((\p{L}|\p{N}|_|(\\[^\s]))+)(\.(\p{L}|\p{N}|_|(\\[^\s]))*)*(!((\p{L}|\p{N}|_|(\\[^\s]))+)((\.(\p{L}|\p{N}|_|(\\[^\s]))+)*)?)?(@(cp\[(\p{Nd})*\]|s\[(\p{L}|\p{N})+\](\[(\p{N})+\])?))?(\-((((\p{L}|\p{N}|_|(\\[^\s]))+)(\.(\p{L}|\p{N}|_|(\\[^\s]))*)*)+)(!((\p{L}|\p{N}|_|(\\[^\s]))+)((\.(\p{L}|\p{N}|_|(\\[^\s]))+)*)?)?(@(cp\[(\p{Nd})*\]|s\[(\p{L}|\p{N})+\](\[(\p{N})+\])?))?)?

See below for a breakdown...

The OSIS manual is a bit inconsistent with this regex:
This regex allows for a single reference or a single contiguous 
reference. It does not allow for multiple references separated by 
whitespace. The manual has at least one example of multiple references, 
but it also clearly says multiple references are not allowed.

Also on the "grain" the manual states () surround the operator argument 
in one spot and [] in another, but the regex only gives [].

Interestingly, the osisID and the osisRef do not allow whitespace at 
all, even escaped.

The osisRef requires that characters other than letters, numbers and 
underscores to be escaped. And that '.' is the hierarchical separator. Thus,

osisRef="Shaw:Shaw/The Reformed Faith/Preface"
should be (Assuming that / is a hierarchical separator):
osisRef="Shaw:Shaw.The\ Reformed\ Faith.Preface"

I don't know how well SWORD or JSword handles this.

In His Service,
     DM


Breaking the regex down:
# A couple of things to note:
#   \p{L} any letter in any script. It is not [A-Za-z].
#   \p{N} is any number in any script. It is not merely [0-9].
#   \p{Nd} is any number in any script except ideographic, such as 
Chinese. It is not merely [0-9].
#   \\[^\s] is a backslash followed by a non-whitespace,
#      that is an escaped non-whitespace
# I think:
#    there are unnecessary ()
#    ((...)*)? cane be more simply (...)*

# A work id may stand at the beginning of an osisRef
(
   # The osisRef starts with a sequence of 1 or more letters, numbers 
and underscores
   (
     (\p{L}|\p{N}|_)+
   )
   # optionally it is followed by 0 or more parts
   (
     # which begin with a dot and followed by a sequence of 1 or more 
letters, numbers and underscores
     (
       \.(\p{L}|\p{N}|_)+
     )*
   )?
   :
)?

# The reference follows the workId, if present
(
   # The reference starts with a sequence of 1 or more letters, numbers, 
underscores and escaped non-whitespace
   (\p{L}|\p{N}|_|(\\[^\s]))+
)

# and it is followed by 0 or more parts
(
   # each of which begins with a dot and is followed by 0 or more 
letters, numbers, underscores and escaped non-whitespace
   # Note: this allows for ...
   \.(\p{L}|\p{N}|_|(\\[^\s]))*
)*

# Following the reference is an optional work-specific extension
(
   # The extension begins with an exclamation mark
   !
   (
     # and is followed by 1 or more letters, numbers, underscores and 
escaped non-whitespace
     (\p{L}|\p{N}|_|(\\[^\s]))+
   )
   # the grain can be multipart
   (
     (
       # each of which begins with a dot and is followed by 1 or more 
letters, numbers, underscores and escaped non-whitespace
       \.(\p{L}|\p{N}|_|(\\[^\s]))+
     )*
   )?
)?

# An osisRef has an optional grain, which the osisID does not have
# Note the OSIS manual differs from this regex.
# The manual defines () to enclose the argument.
# Here we have []
# The manual does not specify, but the regex allows a trailing [n] on 
the "s" operator
(
   # the grain starts with an @
   @
   (
     # and is followed by the letters 'cp' and a bracketed number
     # This is the "code point" operator, which indexes n characters 
into the referenced element
     cp\[(\p{Nd})*\]
     # or followed by the letter s, for string, and a bracketed sequence 
of 1 or more letters or numbers
     # This is the "string" operator, which finds the given string in 
the referenced element
     # and this can be followed by an optional bracketed number
     |s\[(\p{L}|\p{N})+\](\[(\p{N})+\])?
   )
)?

# The osisRef can be defined as range
(
   # starting with a hyphen
   \-
   # the following is identical to what was given before
   (
     # of one or more parts that are exactly like the reference part 
given above
     # key thing to note, there is no allowance for a workId
     (
       (
         # starting with a sequence of 1 or more letters, numbers, 
underscores and escaped non-whitespace
         (\p{L}|\p{N}|_|(\\[^\s]))+
       )
       # and followed by 0 or more parts
       (
         \.(\p{L}|\p{N}|_|(\\[^\s]))*
       )*
     )+
   )
   # and the optional extension, repeated here as before
   (
     # starting with an exclamation mark
     !
     (
       # and followed by a sequence of 1 or more letters, numbers and 
excaped non-whitespace
       (\p{L}|\p{N}|_|(\\[^\s]))+
     )
     # optionally followed by parts
     (
       (
         # each part starts with a dot and is followed by 1 or more 
letters, numbers and excaped non-whitespace
         \.(\p{L}|\p{N}|_|(\\[^\s]))+
       )*
     )?
   )?
   # and as before it can be followed by grain
   (
     # the grain starts with an @
     @
     (
       # and is followed by the letters 'cp' and a bracketed number
       # This is the "code point" operator, which indexes n characters 
into the referenced element
       cp\[(\p{Nd})*\]
       # or followed by the letter s, for string, and a bracketed 
sequence of 1 or more letters or numbers
       # This is the "string" operator, which finds the given string in 
the referenced element
       # and this can be followed by an optional bracketed number
       |s\[(\p{L}|\p{N})+\](\[(\p{N})+\])?
     )
   )?
)?




More information about the sword-devel mailing list