[sword-devel] Hesychius

Greg Hellings greg.hellings at gmail.com
Thu Nov 9 22:34:31 MST 2006


On further inspection it appears that the only HTML formatting that
appears in the above document is a <div....> .... </div> that
corresponds with every <text> ... </text> element in the exported XML.
 Thus all of the angle brackets that appear around anything other than
text/title/page are brackets that are somehow significantly placed
around Greek words.

Perhaps this is the limit of where pure XSLT can take us?  It seems
that it would be better at this point to process the remaining text
with something like python or perl and have that generate the desired
OSIS text, since the OSIS has nothing to do with the XML structure of
the current document but rather with its textual content?

This really is my last e-mail tonight...

--Greg

On 11/10/06, Greg Hellings <greg.hellings at gmail.com> wrote:
> And I forgot to mention that I had posted it to the wxSword download
> site on Soureforge:
> https://sourceforge.net/project/showfiles.php?group_id=142229
>
> Sorry!
>
> --Greg
>
> On 11/10/06, Greg Hellings <greg.hellings at gmail.com> wrote:
> > Getting the output from their included wiki export page was the
> > trivial portion of the task (read: I had to guess completely judging
> > from the directions that were on Wikipedia's site and extrapolate
> > those to figure out what name WikiSource actually wanted for each
> > page).  Writing the XSLT is proving to be far more cumbersome.  I just
> > spent over an hour trying to figure out why my XSLT was not producing
> > any output, only to realize that the exported file had a default
> > namespace.
> >
> > It will be incredibly difficult to extract any structural information
> > from the files in an automated system.  For one, I am not familiar
> > with what Hesychius is, and while I took extensive Greek in my
> > undergrad course of study, reading through that massive document would
> > be unwieldy for me at this point, since I could not dedicate huge
> > amount of time to the work.
> >
> > For now I have posted an XML file that is the filtered XML that comes
> > from the export, with everything except for the page, title and text
> > fields removed (since the rest of the information simply pertains to
> > who performed the latest modification to the page and when it happened
> > and their change log entry).  I have also modified all of the &gt; and
> > &lt; to be > and < in an effort to return the data to its display
> > format.
> >
> > Someone will need to figure out how to differentiate when the < or >
> > is pertinent to the HTML/XML or when it is pertinent to the more
> > specific data within.  The WikiSource document seems to make very poor
> > use of the < and > characters to both denote a keyword and to
> > emphasize certain words or phrases, thus making the data even more
> > difficult to parse.  I don't know that a fully automated solution will
> > be possible with this data or with the original data... but it's all
> > just a starting point.
> >
> > If you want other files, let me know.
> >
> > --Greg
> >
> > On 11/9/06, Troy A. Griffitts <scribe at crosswire.org> wrote:
> > > Greg,
> > >         You're amazing!!! I must have played with stuff for hours today trying
> > > to make sense from the wikimedia export docs.  I even downloaded some
> > > PyWikipediaBot python thingy but couldn't get it to run either (I am
> > > inept at python, so I wasn't surprised, though quite frustrated,
> > > nonetheless).   Thank you!!!  If this might make any difference, my
> > > personal interest in the lexicon, after it is usable by SWORD, is to
> > > build a synonyms database from the data.  If there is any indication in
> > > the data that a synonym for an entry is being listed, I would most
> > > appreciate a unique <seg type="x-synonym>, or some such.  Thank you
> > > again, so much, for your work.  I am very excited!
> > >
> > >         -Troy.
> > >
> > >
> > >
> > > Greg Hellings wrote:
> > > > So yeah... I managed to grab the XML file from the Export (it's fun
> > > > trying to do that on a webpage written in modern Greek when you're
> > > > used to ancient Greek and you can't remember what the Koine word for
> > > > "hyperlink" or "webpage is" :P).
> > > >
> > > > It comes to a mere 4.2 MB file, so now the trick will be parsing the
> > > > text that is wanted out of that and creating an OSIS from it.  The
> > > > main problem with that is that the text from the file is placed inside
> > > > of a tag with xml:space="preserve" attribute, and all of the HTML is
> > > > encoded as entities underneath of that.  Therefore all of the
> > > > structure of the actual data (other than the large groupings under
> > > > alpha, beta, gamma, etc) is lost to an XML/XSL parsing combination.
> > > >
> > > > Wish me luck... ::dives into a pile of libxml2::
> > > >
> > > > --Greg Hellings
> > > >
> > > > On 11/9/06, Troy A. Griffitts <scribe at crosswire.org> wrote:
> > > >> We had a contributer on IRC, today, post this link:
> > > >>
> > > >> http://el.wikisource.org/wiki/%CE%93%CE%BB%E1%BF%B6%CF%83%CF%83%CE%B1%CE%B9
> > > >>
> > > >>
> > > >> It looks promising.
> > > >>
> > > >> I know there is a way to download content in XML of a mediawiki site,
> > > >> but have no experience doing so.
> > > >>
> > > >> Anyone want to take a shot at producing a SWORD Hesychius Lexicon, (or
> > > >> even just a text file from this link?
> > > >>
> > > >>
> > > >> Thanks for everyone's input and help.
> > > >>
> > > >>         -Troy.
> > > >>
> > > >>
> > > >>
> > > >> Peter von Kaehne wrote:
> > > >>> I spoke yesterday both to Prof Hansen and to Prof Ian Cunningham (who is a collaborator of Hansen)
> > > >>>
> > > >>> http://www.csad.ox.ac.uk/CSAD/Hesychius/Hansen.html
> > > >>>
> > > >>> Prof Hansen mentioned the TLG and Prof Cunningham confirmed this + said further there is no electronic version of Hansen's work available. I understand that Hansen's work is published in de Gruyters' Sammlung Griechischer and Lateinischer Altertuemer
> > > >>>
> > > >>> http://www.degruyter.com/rs/174_AT_E_ED_ENU_h.cfm?rc=19992&id=SER-M1-WDG-HESYCH-B-19992&fg=AT
> > > >>>
> > > >>> - a copy of which I found here to buy:
> > > >>>
> > > >>> http://www.basis-buch.de/main-173503.html
> > > >>>
> > > >>> WRT the TLG. I read the licence in detail and bluntly said, they have no leg to stand upon to deny us using the texts:
> > > >>>
> > > >>> They already allowed us to do what we want to do on the base of the licence - even if they get now cold feet on direct questioning. That said, at least Schmidts edition is now public domain anyway and unless there are DMCA-restrictions everyone can copy it out of there anyway.  And outside of DMCA -alike legislation only the public domain-ness woudl appliy anyway.But IANAL etc.
> > > >>>
> > > >>> Wrt Latte/Hansen- I am not sure how far Latte's work would constitute an original work in its own right - I presume it does - but again the TLG licence does allow text extraction for scholarly work which is non-commercial.
> > > >>>
> > > >>> Peter
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> -------- Original-Nachricht --------
> > > >>> Datum: Fri, 03 Nov 2006 17:23:03 -0700
> > > >>> Von: "Troy A. Griffitts" <scribe at crosswire.org>
> > > >>> An: SWORD Developers\' Collaboration Forum <sword-devel at crosswire.org>
> > > >>> Betreff: Re: [sword-devel] Hesychius
> > > >>>
> > > >>>> Peter,
> > > >>>>      Thank you for your time and info.  We have an ongoing dialog with UCI
> > > >>>> regarding the use of the data from TLG.  They have denied our request
> > > >>>> twice, but I am hoping a detailed third plea might solicit sympathy.
> > > >>>>
> > > >>>>      -Troy.
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> Peter von Kaehne wrote:
> > > >>>>> The TLG has though also the older edition by Schmidt which should be by
> > > >>>> now public domain as it is 1861
> > > >>>>> Peter
> > > >>>>>
> > > >>>>> -------- Original-Nachricht --------
> > > >>>>> Datum: Fri, 03 Nov 2006 15:59:02 +0100
> > > >>>>> Von: "Peter von Kaehne" <refdoc at gmx.net>
> > > >>>>> An: SWORD Developers\' Collaboration Forum <sword-devel at crosswire.org>
> > > >>>>> Betreff: Re: [sword-devel] Hesychius
> > > >>>>>
> > > >>>>>> The TLG indeed contains parts of the Hesychius - Latte's work only.
> > > >>>>>>
> > > >>>>>> Hansen's work is published on paper only in Germany. Electronic copies
> > > >>>> are
> > > >>>>>> not available.
> > > >>>>>>
> > > >>>>>> The TLG licence of the text is so that the work might be possible to
> > > >>>>>> integrate - ie.e. commecial scholarly tools making use of teh whole
> > > >>>> text are
> > > >>>>>> forbidden but crosswire might be possible.
> > > >>>>>>
> > > >>>>>> HTH
> > > >>>>>>
> > > >>>>>> Peter
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> -------- Original-Nachricht --------
> > > >>>>>> Datum: Thu, 02 Nov 2006 16:38:36 -0700
> > > >>>>>> Von: "Troy A. Griffitts" <scribe at crosswire.org>
> > > >>>>>> An: sword-devel at crosswire.org
> > > >>>>>> Betreff: [sword-devel] Hesychius
> > > >>>>>>
> > > >>>>>>> If anyone has the time to research where we can find an electronic
> > > >>>> copy
> > > >>>>>>> of Hesychius' Greek Lexicon, your efforts would be extremely valuable
> > > >>>> to
> > > >>>>>>> me right now.  I believe the TLG has a copy of it, but I currently
> > > >>>> don't
> > > >>>>>>> have easy access to the TLG.  Thanks in advance.
> > > >>>>>>>
> > > >>>>>>>   -Troy.
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> _______________________________________________
> > > >>>>>>> sword-devel mailing list: sword-devel at crosswire.org
> > > >>>>>>> http://www.crosswire.org/mailman/listinfo/sword-devel
> > > >>>>>>> Instructions to unsubscribe/change your settings at above page
> > > >>>>>> --
> > > >>>>>> GMX DSL-Flatrate 0,- Euro* - Überall, wo DSL verfügbar ist!
> > > >>>>>> NEU: Jetzt bis zu 16.000 kBit/s! http://www.gmx.net/de/go/dsl
> > > >>>>>>
> > > >>>>>> _______________________________________________
> > > >>>>>> sword-devel mailing list: sword-devel at crosswire.org
> > > >>>>>> http://www.crosswire.org/mailman/listinfo/sword-devel
> > > >>>>>> Instructions to unsubscribe/change your settings at above page
> > > >>>> _______________________________________________
> > > >>>> sword-devel mailing list: sword-devel at crosswire.org
> > > >>>> http://www.crosswire.org/mailman/listinfo/sword-devel
> > > >>>> Instructions to unsubscribe/change your settings at above page
> > > >>
> > > >> _______________________________________________
> > > >> sword-devel mailing list: sword-devel at crosswire.org
> > > >> http://www.crosswire.org/mailman/listinfo/sword-devel
> > > >> Instructions to unsubscribe/change your settings at above page
> > > >>
> > > >
> > > > _______________________________________________
> > > > sword-devel mailing list: sword-devel at crosswire.org
> > > > http://www.crosswire.org/mailman/listinfo/sword-devel
> > > > Instructions to unsubscribe/change your settings at above page
> > >
> > >
> > > _______________________________________________
> > > sword-devel mailing list: sword-devel at crosswire.org
> > > http://www.crosswire.org/mailman/listinfo/sword-devel
> > > Instructions to unsubscribe/change your settings at above page
> > >
> >
>



More information about the sword-devel mailing list