[osis-core] Notes from OSIS meetings of 2004-01-31
Steven J. DeRose
osis-core@bibletechnologieswg.org
Sat, 31 Jan 2004 18:40:34 -0500
on examining usfm, todd & chris found that the only things that
didn't map easily to osis, were qr quad qc for quadding poetry (only)
to left and center.
Proposal: +n for indents from left (actually start, to account for
r-to-l lgs), -n from right, and 0 for centered.
What would a translator accept as reasons for a tanslator to use
these -- 'it looks good' would be iffy; inconsistency would be
rejected; reason should have to do with linguistic.
Proboem of automated conversion does not imply we have to not have a
finer grained system -- these two codes are remnants of formatting
orientation in sf.
Can we extract a set of *reasons* for people using qc and qr?
are we overloading notion of "line" for typographic vs. structure?
Maybe for conversion purposes, we just let them use type attribute?
Or 'kind'?
We could enumerate some reasons, and allow center/right if you don't know.
possibility: type='unknown' subtype='center|right'
translators don't want to worry about all these distinctions.
yes, but they also often don't want to worry about lots of other
distinctions (being focused on getting the Bible published) -- but
letting them just use formatting (Word file, format macros....) in
fact costs them more time.
can checkers gradually develop a list of the qc and qr related types?
possible consensus:
line type attribute is for meaninfgul types, to be determined by agency
line type of 'unknown' with subtype for typography
can usfm do more? e.g. enumerate appropriate meaningful uses of
qc/qr, and then (a) at least add those to the doc for those tags; and
(b) if it makes sense, add tags for them.
Develop a usfm/osis manual -- guidance for mapping.
Can USFM define what qr means in right-to-left languages?
Can USFM define more meaningful alternatives to qc and qr?
Can they add an inscription tag?
Should we enumerate? start/center/end? Pro: validation.
consensus: add type=unknown; add enumeration (extensible) for
justification types. ql/qc/qr, left/center/right,
left/center/right/start/end; start/center/end
Narrowed by poll to l/c/r or l/c/r/s/e. Finish on list.
Note to editors: should we separate out all potentially-enumerable
attributes into a schema type?
Welcome David Haraburta, Baylor CS student working with Kirl L.
--------------
Switching to Linguistic Annotation
(intro linguistics summary)
Levels of analysis: phonology, morphology, syntax, discourse
morphology/part-of-speech annotation
sometimes determination requires arbitrary amounts of contexts
no neat 1:1 mapping from categories to features (like part-of-speech)
For example:
look up a hill
look up a word
but
look a hill up (wrong, at least for the meaning like 'look up a hill')
look a word up (fine, and means same as corresponding example above)
--> "look up" is a verb with a space in it (and which can be shifted
even further apart.
Also, conjoint forms like "don't", etc.
Hebrew 'melek', can't tell if it's construct or not without more context.
A single word instance may even have different parts of speech in
different clauses that include it, although this is rarer.
Lemma vs. morphology: lemma says what root "word"; morphology says
what grammatical form, etc.
Issue of recording obsolete systems accurately (e.g., Strong's lemma
numbers even when they're now deemed wrong), and also being able to
express modern consensus, and individual's annotations that may not
conform to any "standard" taxonomy.
(much harder above morphological level)
issue of ambiguity.
Consider Eagles work on dfining feature sets for EU languages.
Problems:
1) How to link in to OSIS texts
e.g. add analysis to a text that has <w> -- expand capability of w
2) Inline versus out of line markup
[[sjd: Is there a schema construct for saying "any" attribute permissible?
3) Should we introduce a <morpheme> level tag?
(question of using namespaces)
First approach: add <morpheme>, everything goes on attributes.
Second approach: un-flatten it into element structures
model: Provide a large set of features, hoping to cover vast majority
of lgs; but provide a way to subtract values inapplicable in any
given language.
[[sjd: provide a way to pull in the definition file and then
add/subtract features/values
[[problem: if we go with portmanteau references a la top level of TEI
fs, we have to enumerate all the combinations, and users have to
enumerate all their deletions -- couldn't practically delete "dual"
number with a single statement. We could provide a way of expressing
structure inside the values, like a token in the value for each
feature expressed, and a way to name each level and associate its
contxt in the attribute value/reference string, with the particular
feature name
E.g. n-n-m-s
would express a pattern of
category=noun
case=nominative
gender=masculine
number=singular
then you could get rid of the dual value for the number feature with
someting like:
for "delete n-*-*-d
Or, someone could make a simple interface for making changes (which
might map a request to a whole lot of trivial cases). Or, we could
give them the TEI fslib we create, and let them literally delete/mod
as needed. Could get a tool built, too.
Would we be able to keep to one sequence, fairly flat like here, or
do we need some kind of parenthesizing inside the attributes (if the
latter, it quickly gets complicated enough that it belongs in element
markup instead -- which, however, has the problem of forcing users to
touch the schema to change the tag vocab.
Kirk: tried TEI fs's at the start. Hard to find actual examples,
usage guidance. Exists 1844 distinct feature-spec strings in BHS: one
char for part-of-speech, etc.
To use fs's, seems you would have to have a GUI, because too many
features to memorize.
Much easier if you make the idrefs to the feature structures be
exactly the (for example) BHS mnemonics.
Could also split out lg universals (say, pos and context-Boundedness)
to separate attributes
Can separate question of whether to use TEI fs's, and whether to
provide multi-level (morpheme vs. word level) annot.
<seg granularity='word|morph|...'>....
(case of ciscontiguous words/morphemes, so need some pointing
mechanism -- can you do this in TEI? Like, binding a feature to a
word-instance value.
fs are more palatable with the mnemonics -- still need a good UI for
real users to have a chance.
issue of tings like TEI global attributes..... just select the fs
module, drop any global attrs we don't use (see current.
possibility: use namespace prefix to identify fslib in header, and
people can add their own attributes to w and m elements to add their
own features.
summary:
Define schema (per TEI) for fslibs
Dcl such fslibs as works of class 'fsd':
<work osisWorkID="class">
<identifier type='osis'>fsd.he.WHI.2004...</>
...
</work>
Then refer to them via the prefixing mechanism:
<m feature='class:pro....'>...</m>
Next prob: combining discontiguous parts:
features can be referenced from w or m, or a generic <wordpart>.....
on discontiguous things, link them up
what about milestones? insufficient for discontiguous.
TEI defined <join> for this.... sits somewhere (a type of link) and
points to all the parts. goes into a joingroup in an anonymousBlock.
Problem of duplicate osisIDs -- several meanings with no way to tell
(for example a verse):
1) Discontiguous portions of a verse
2) Multiple distinct copies of a verse from different works (diglots,
parallels)
3) Multiple copies of the same verse from the same work (in a commentary, say)
4) Alternate readings of the same verse (end of Mark)
5) Combinations of the above.
We have a problem waiting out there when implementors have to decide
how ot process duplicate osisIDs.
Seems like we're re-inventing TEI bit by bit....
Problem: how to mark up discontiguous constituents? Gotta have a
pointer across; but then, where do we hang properties of the whole?
PLan:
after lunch:
troy item
last few usfm issues
gorier bits of features
Documentation:
Comments on current state of doc, AND on pld's disposition of
prior comments, are due by Feb 15.
Hard date -- anything not raised by then is left to editors'
discretion (if any).
Comments on changes made after now, will be accepted later than Feb 15.
----- After lunch:
Troy's cases:
(cf notes from pld on first two issues)
3: How to mark up this in a lexicon:
This word occurs one hundred fifty seven times in the NT.
Need a way to get machine-rpocessable numeric value.
<seg type="x-occ"> or similar is the right solution for the whole sentence.
Where to put "157"?
Should not go as content of an embedded seg, because it is not a
segment of the source content at all, but a property of it.
If it's an attribute, it should be on a <seg> surrounding the content
that it is a property (normalization) of, namely "one hundred fifty
seven".
None of the existing elements or attributes really fit (though many
are syntactically possible).
TEI <num> element would be nice for this. Takes type and value.
We should specify a single format for the 'value' attribute: some XSD
numeric type that covers floats and integers.
Types: card/ord/pct/frac -- plus x-
Should we also add measure?
It has samples weight, count, length, area volume, currency.
We also need time durations (xsd has a duration type)
TEI timeRange doesn't really do duration
length, mass, charge, angle, solid angle, temperature, time,
meter, kg, sec, amp, K, mole, candela
Minimal set: length, time, currency, volume, count, area, mass.
reg is a pure numeric value.
add unit attribute: pick from somewhere.....
last issue: chars alllowed in morph values. esp. hyphen, as in
"n-nm-s" etc. Right now we have the same regex controlling these and
osisRefs.... so you end up with the same distinction.
Could split off the definitions for at least lemma and morph, to: use
prefix, reserve space as top-level delim, but allow hyphen etc.
Problem: annotateRef is union of osisRef, osisID, and osisGen (lemma/morph).
[[sjd: was that really a good idea??
Prob: the last case means you're annotating metadata....
Todd: no, want to refer to a word in the abstract
Steve: that should be pointing to a lexicon entry via its osisID
Todd: That's even narrower, if lemmas have to be osisIDs
[[sjd:
how about dividing the ambiguity of what the value of annotateRef is
-- split into 3 attributes and simplify the regexes. much easier to
explain, and to maintain in schema, and can avoid interference
between osisRefs and lemmas and morphTags....
What additional things may we want to annotate?
-- the content, start/end location, gi, and attrs of any elements
at all (or content portions)
--
(very lengthy discussion of syntax of annotateRef)
In the end: lemmas and morphs etc. are semantically osisIDs into
other documents. It seems it would disturb an awful lot of
already-complex syntax, or require some ugly mode switch/double
prefix/something. In the end we ended up keeping the same reserved
characters, and deciding to add to the manual very clear statements
of the list, and recommendations of what to do with legacy
identifiers that include those characters. The reserved characters
are:
(space) - : [ ] @ !
Period can be used but not doubled.
Meeting in august?
At Calvin?
Prepare, submit, and review PSI sets
sjd and kel to prepare initial fsds, etc. for hebrew morphs
Priorities:
Finish manual
LAWG
Test sermons/commentaries/devotionals/etc
Versification system declarations and mapping
Liaison w/ Bible Forum audio/video stuff
Grant development for text prep
Items passed to list
extract list of schema-affecting issues and resolutions, release
*only* for testing of conversion software, etc.
2.1 schema:
--
Steve DeRose -- http://www.derose.net
Chair, Bible Technologies Group -- http://www.bibletechnologies.net
Email: sderose@acm.org or steve@derose.net