[osis-core] morph regex error
Troy A. Griffitts
osis-core@bibletechnologieswg.org
Mon, 08 Dec 2003 01:07:50 -0700
I think you are sorely incorrect about historical facts, but in regard
to the current schema:
You may argue that it SHOULD conform to osisIDRegex, but NOW is not the
time to argue that.
The PROBLEM I have is not this:
<xs:attribute name="morph" type="osisIDType" use="optional"/>
it's NOT defined that way in the schema. If that's what you want, we
can talk/debate about changing it to the above at our next meeting.
The PROBLEM is that being defined correctly, like this:
<xs:attribute name="morph" type="osisGenType" use="optional"/>
(which is how it IS defined in the official schema)
osisGenType (osisGenRegex) SHOULD NOT BE RESTRICTED TO THE SAME THING AS
osisIDType (osisIDRegex) or we wouldn't have 2 types.
Does that make sense?
-Troy.
PS. Even if changing it to osisIDType was being proposed (which I think
you've done). I still believe that a serious flaw exists in this proposal:
There has to be a way programmatically to restore the encoding WITHOUT
the software knowing anything about the morph scheme or else we've
forced enumeration of the known morph schemes in software implementation.
e.g. You can't Change 'N-[G]@5' to 'N__G_5'. If we ever decide to force
morph to conform to osisIDType, then we MUST provide a programmatic way
to restore the original morph code, e.g 'N%2D%5BG%5D%405' Which I think
still looks horrible and is not acceptable to me, but at least would
allow me to remove the ambiguity and programmatically reconstruct the
original code.
Chris Little wrote:
> As the person who actually requested this attribute, and the one who
> implemented it in 2 or 3 Bibles and 2 morphology tag indices back in
> OSIS 1.0....
>
> My recollection leading up to 2.0 was that we wanted to limit the format
> to the present regex. (That does not deny that there may have been
> further conversations on the subject to which I was not privy or that I
> do not recollect.)
>
> It's true that mophological tagging schemes do use characters that would
> violate the regex format's requirements--space and hyphen specifically.
> However, I'm unaware of any system in which it should actually matter
> what character represents these characters, if they get transcoded. In
> every system that I know, space and hypen simply represent dividers and
> place holders. They never hold any actual content--they have empty
> semantics. So if they all get encoded to underscores, then decoded as
> hyphens, that should be fine. (Indeed, in point of fact, there are
> systems--Friberg comes to mind, but I might be wrong--that are rendered
> with spaces in some instances and hyphens in others, depending on the
> publisher & format.)
>
> My feeling is that it's actually more beneficial to match the osisID
> format in order to allow for linking. But more importantly, I thought
> we disallowed spaces in attributes unless they divide values in a list.
>
> So, taking a tag like, oh, say "N-NSF", I would just encode it as
> "N_NSF" (And did so 1613 times in tr.xml.)
>
> --Chris
>
> Troy A. Griffitts wrote:
>
>> Patrick,
>> This is a serious restriction/change. I specifically remember
>> discussing this with you and we agreed that these tags should NOT be
>> restricted to osisID-like syntax.
>>
>> Serious reasons:
>>
>> VERY REAL SCHEMES (probably the only ones that have ever been
>> marked in OSIS) USE OFFENDING CHARACTERS.
>>
>> We have defined no escape character.
>>
>> Without an escape character EVERY SOFTWARE needs to magically KNOW
>> the scheme used to recode these schemes, instead of just mindlessly
>> displaying them to the scholar (which is what should be allowed).
>> This is unreasonable.
>>
>> I have texts that I need to release with this morphological scheme
>> NOW, not when 3.0 is released.
>>
>> This is NOT a change that should have been applied without
>> everyone's consent.
>>
>>
>> Not to be a jerk, but being the one that asked for this attribute,
>> and being the only one using this attribute that I know of, I'm a
>> little ticked that it was changed.
>>
>>
>> -Troy.
>>
>>
>>
>> Patrick Durusau wrote:
>>
>>> Troy,
>>>
>>> I think the regex is correct, no hyphens are allowed. This does not
>>> mean that you should use a range in any of these, although that is
>>> possible. It does allow these to be used as osisRefs so that they can
>>> refer to other sources of information.
>>>
>>> Perhaps we should revisit at the January OSIS meeting but I don't
>>> think we will reach a different conclusion.
>>>
>>> Hope you are having a great day!
>>>
>>> Patrick
>>>
>>> Troy A. Griffitts wrote:
>>>
>>>> :)
>>>>
>>>> Unless I'm going senile-- which I've been suspecting for some time
>>>> now-- I believe that the last discussion on this subject, before
>>>> release of 2.0, concluded that lemma, xlit, gloss, and morph WOULD
>>>> NOT be restricted by osisRef syntax. We would make a separate
>>>> complexType for them, which basically would allow: prefix:any_string
>>>>
>>>> I think I wanted to allow spaces (expecially for gloss), Patrick
>>>> found real world occurances of other systems that used prohibiting
>>>> characters, as well.
>>>>
>>>> So the conclusion was either:
>>>>
>>>> prefix:any_string
>>>>
>>>> or
>>>>
>>>> prefix:any string
>>>>
>>>> I think Steve may have made some push for replacing the 'space' but
>>>> don't remember the conclusion on that one.
>>>>
>>>> But regardless, there are no spaces in my offending line that I
>>>> quoted earlier, and yet I still get an error.
>>>>
>>>> If I have to remove the cobwebs to defend this again, I will try,
>>>> but think it's just a mis-sight in the .xsd.
>>>>
>>>> -Troy.
>>>>
>>>>
>>>>
>>>>
>>>> Chris Little wrote:
>>>>
>>>>> Okay, okay. No need to shout. Don't kill the messenger. Etc. :)
>>>>>
>>>>> The problem with changing the format is that we can no longer use
>>>>> morph, lemma, etc. values as osisRefs. As it stands, any of these
>>>>> attributes could double as an osisRef/osisID. So your lexicon,
>>>>> organized by lemma, could have divisions with osisIDs that are the
>>>>> same as their lemma values. Likewise, if you organize the Robinson
>>>>> morphology scheme as a sort of lexicon, you can look up entries and
>>>>> tag them with osisIDs that are identical to your morph value.
>>>>>
>>>>> --Chris
>>>>>
>>>>> Troy A. Griffitts wrote:
>>>>>
>>>>>> NO!
>>>>>>
>>>>>>
>>>>>> Chris Little wrote:
>>>>>>
>>>>>>> Troy A. Griffitts wrote:
>>>>>>>
>>>>>>>> Hey guys. It seems we may have messed up the regex on the morph
>>>>>>>> attribute of <w>.
>>>>>>>>
>>>>>>>> Here my line:
>>>>>>>>
>>>>>>>> <w xml:lang="grc" lemma="strongs:15" morph="robinsons:V-PAM-2P"
>>>>>>>> xlit="la:agaqopoieite">GREEK UTF8 TEXT HERE</w>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Here's the MSV error output:
>>>>>>>>
>>>>>>>> Error at line:279, column:117 of
>>>>>>>> file:///space/home/scribe/msv/./lexcounts
>>>>>>>> attribute "morph" has a bad value: the value does not match
>>>>>>>> the regular expression
>>>>>>>> "((((\p{L}|\p{N}|_)+)(\.(\p{L}|\p{N}|_))*:)((((\p{L})|(\p{N})|_)+)(((\.(\p{L}|\p{N}|_)+)*))?))".
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The value you give has never been valid. Hyphens have never been
>>>>>>> allowed in morph or lemma attributes (nor have spaces and various
>>>>>>> other characters). I think the decision we made before releasing
>>>>>>> 2.0 was to force folks to transcode these as '_'.
>>>>>>>
>>>>>>> Does that work for you?
>>>>>>>
>>>>>>> --Chris
>
>
>
> _______________________________________________
> osis-core mailing list
> osis-core@bibletechnologieswg.org
> http://www.bibletechnologieswg.org/mailman/listinfo/osis-core