[sword-devel] Creating a version of the BSB module with interlinear support

Sat Sep 30 05:54:48 EDT 2023

The Berean Standard Bible is available in two machine-readable formats: 
USFM, and "translation tables", a 40MB Excel spreadsheet with a row for 
every Hebrew or Greek word in their chosen source texts with the English 
text it's translated to. I would like to make one module with the nice 
formatting of the USFM sources and the metadata from the spreadsheet, so 
I've spent the last few weeks writing a script that runs through them 
both in parallel and makes sure everything lines up, so I'm now 
confident that I have an accurate mapping between them.

My question now is, how can I translate the data from the spreadsheet 
into OSIS?

Here's the information the spreadsheet gives me:

Column
	Example
	Notes
he_ordinal
	1
	"Hebrew Ordinal", increments for each spreadsheet row in the Old 
Testament, set to 999999 for each row in the New Testament
el_ordinal
	0
	"Greek Ordinal", set to 0 for each row in the Old Testament, increments 
for each row in the New Testament, except for Mark 1:1 which has a word 
with the number 18379.5 (presumably something needed to be inserted and 
they didn't want to renumber everything else)
en_ordinal
	1
	"English Ordinal", increments for each spreadsheet row (except for that 
word in Mark 1:1)
language
	Hebrew
	"Hebrew", "Greek", or sometimes "Aramaic"
verse_ordinal
	1
	Increments for each verse in the Bible, so every word in Genesis 1:1 
has "1", etc.
source_word
	בְּרֵאשִׁ֖ית
	The word in the original source text. Sometimes includes fancy brackets 
to mark sources other than WLC or Nestle 1904: {TR} ⧼RP⧽ (WH) 〈NE〉 [NA] 
‹SBL› [[ECM]]
transliteration
	bə·rê·šîṯ
	A transliteration of the source word into the Latin alphabet
grammar_code
	Prep-b | N-fs
	A code describing the grammatical form of the word; these don't appear 
to be Robinson codes, but their own custom thing for Hebrew 
(https://biblehub.com/hebrewparse.htm) and Greek 
(https://biblehub.com/abbrev.htm)
grammar_description
	Preposition-b | Noun - feminine singular
	The grammar code, unabbreviated
strongs_number
	7225
	The Strongs number of the basic form of this word
translation
	In the beginning
	The English text that appears in the BSB
gloss
	1) first, beginning, best, chief
1a) beginning
1b) first
1c) chief
1d) choice part
	A definition from the Brown-Driver-Briggs Hebrew Lexicon, or Thayer's 
Greek Definitions, as appropriate

Looking at the OSIS 2.1.1 User's Manual (and sniffing around in the KJVA 
module), to represent this information in OSIS I should use the <w> 
element, which supports the following attributes (copy/pasted from the 
Manual):

  * *gloss* Record comments on a particular word or its usage.
  * *lemma* Use to record the base form of a word.
  * *morph* Use to record grammatical information for a word.
  * *POS* Use to record the function of a word according to a particular
    view of the language's syntax.
  * *src* Use to record origin of the word.
  * *xlit* Use to record a transliteration of a word.

The first problem is that sometimes multiple source words are translated 
into a single English span, and it's not made clear how to express that 
in these attributes. From poking around in the KJVA module, I get the 
impression these are supposed to be space-delimited lists. Is that correct?

Assuming that's the case, here's my guesses at how to fill out these 
attributes for each span:

  * *gloss* can't be done, because each gloss contains spaces which
    means the displaying app can't figure out which part of the gloss
    goes with which word
  * *lemma* is where Strongs numbers go; Greek Strongs numbers should be
    prefixed with "G" and Hebrew/Aramaic ones with "H0"
  * *morph* might be used for the "grammar code" content, but I would
    probably need to figure out how to translate them into Robinson
    codes first, since that seems to be the only morphological
    dictionary module in the Crosswire repositories
  * *POS* is unclear to me, I don't see how it differs from the "morph"
    attribute
  * *src* is also unclear: is this for the word order (he_ordinal or
    el_ordinal, possibly numbered from the beginning of the verse rather
    than the beginning of the entire Bible) or the actual choice of
    source text (Nestle1904, TR, NA, SBL, etc.)?
  * *xlit* clearly comes from the "transliteration" field

One thing that's clearly missing is where to put the source word. How 
does that work?

Is there other way to represent information that doesn't fit into the 
<w> element? I'd like this module to be as useful as possible, so I'm 
hesitant to toss out any information that can be usefully represented.

Is there anything else I've missed or misunderstood?

Timothy.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://crosswire.org/pipermail/sword-devel/attachments/20230930/96aa39b0/attachment-0001.htm>