[sword-devel] XML versions of Thayer's or Strongs?
Sean
sean at semanticbible.com
Sat Mar 11 10:51:14 MST 2006
Thanks, your detailed instructions and example (and a little puzzling
about how Java works, since i'm not a Java guy) produced some useful
results, as well as (of course!) a few more questions related to running
this with Thayer's.
1) there are various complaints: i'm not sure if they're significant
org.crosswire.jsword.book.sword.ConfigEntryTable(INFO): Ignoring
unexpected entry in orthodoxy of sMinimumVersion
org.crosswire.jsword.book.sword.ConfigEntryTable(INFO): Ignoring empty
entry in orthodoxy: CopyrightHolder=
org.crosswire.jsword.book.sword.ConfigEntryTable(INFO): Ignoring empty
entry in orthodoxy: CopyrightDate=
org.crosswire.jsword.book.sword.ConfigEntryTable(INFO): Ignoring empty
entry in orthodoxy: DistributionNotes=
org.crosswire.jsword.book.sword.ConfigEntryTable(INFO): Ignoring empty
entry in rsv: CopyrightNotes=
org.crosswire.jsword.book.sword.ConfigEntryTable(INFO): Ignoring empty
entry in rsv: CopyrightContactEmail=
org.crosswire.jsword.book.sword.ConfigEntryTable(INFO): Ignoring empty
entry in rsv: DistributionNotes=
org.crosswire.jsword.book.filter.thml.THMLFilter(INFO): Could not fix it
by cleaning tags: Illegal character or entity reference syntax.
2) the results from Thayer's seem to have lost the Greek characters.
What's in the .imp file looks like some 8-bit chars
ωφελιμος
which i assume is some kind of representation of the Greek characters
(haven't quite figured out what: doesn't seem to be UTF-8). But this
winds up in the output as a string of '?'s.
3) entry 5207 (huios) produces bad XML: looks like a TDNT reference
attribute in a sync tag doesn't get its terminating quote (after
"8:400"?) and slash+angle bracket ending the sync are also missing:
AV-son(s) 85, Son of Man +<sync type="Strongs" value="G444" /> 87 (<sync
type="TDNT" value="8:400, 1210), Son of God
The fault seems to exist in the .imp file as well (which has these
<sync> tags embedded)
4) there are a number of bare "&" characters in the original which seem
to get dropped in the output instead of replaced with & (except for
one in #5207, one might suppose because of the unterminated
attribute/tag issue)
5) There are some issues with the synonym references around ampersands
(whether related to #4 i can't tell): the .imp file has
For Synonyms see entry <sync type="Strongs" value="G5811" /> & <sync
type="Strongs" value="G5889" />
but the OSISified output has
<w lemma='strong:G5811'>
For Synonyms see entry </w><w lemma='strong:G5889'> </w>
Hope this feedback is helpful, and thanks again for the pointers. Unless
there's a solution to the problem with the Greek characters, i'll have
to fall back to parsing the .imp file by hand, since getting these out
is important to me. By the way, what displays in the Sword Project for
Thayer's lacks accents and breathing marks, though by comparison i see
them in e-Sword's version: anyone happen to know why?
His,
Sean
More information about the sword-devel
mailing list