[sword-devel] XSLT vs. C++
Martin Denham
mjdenham at gmail.com
Wed Dec 1 05:50:28 MST 2010
Excuse me for being pure Java and not knowing Sword C++ at all but can I add
(perhaps obviously) that an XSLT framework will perform noticeably slower
than a SAX-like framework.
Here
<http://java.sun.com/developer/technicalArticles/xml/JavaTechandXML_part2/>are
some performance comparisons. They are old and Java-centric and so XSLT
performance may have improved but these tests show that in the worst case
XSLT was 3 times slower than SAX and a good SAX processor was twice as fast
as a good XSLT processor. If pages are parsed at the chapter level then
users may notice a delay turning pages on smaller machines like mobile
phones.
Martin
On 1 December 2010 12:20, Troy A. Griffitts <scribe at crosswire.org> wrote:
> The logic to get from any Publisher Source Document to rendered HTML is
> a very complex task to solve.
>
> We conceptually create Plato's Form of, say, a Bible, and try to fit
> imperfect Publisher markup into this concept. A Bible has verses,
> headings between verses, chapter intros, footnotes, crossrefs, lemma
> information, etc.
>
> If we do not do this, then we become a PDF reader-- there are already
> PDF readers and we lose the ability to do Bible specific things with our
> software. For example, if we didn't normalize the concept of crossref
> across all Books, then we couldn't turn them on and off; we couldn't
> provide a crossref panel in the reader which fills according to which
> crossref is hovered over, etc. Same with notes, strongs, headings, etc.
>
> This causes us to impose our Form onto a publisher's text. I understand
> why some people may not like this, but it is very much to our end users'
> benefit that we do this. Without this, we become a web-browser or a PDF
> reader. Which are fine for their purpose, but we intend to provide
> common, familiar, and sometimes novel Bible study aides to our reader.
>
> The current processing model is dark magic and I apologize for this. It
> should be well documented and easy to modify. I will attempt to improve
> the dissemination of knowledge of exactly WHAT our Forms are, how we
> impose those Forms on publishers' texts and improve the documentation
> and code to help others understand and have the ability to improve the
> code.
>
> I'll attempt to post a few easy to swallow SWORD 101 classes in email,
> which will help us gather our thoughts and documents on how all this works.
>
>
> Troy
>
>
>
> On 12/01/2010 12:09 AM, Greg Hellings wrote:
> > On Tue, Nov 30, 2010 at 1:08 PM, Troy A. Griffitts <scribe at crosswire.org>
> wrote:
> >> Having finally returned from a hectic 2 weeks of conferences, and lots
> >> to do before leaving for Christmas, I'm not sure I'm up for a heated,
> >> passionate debate about technologies right now, but by all means, please
> >> commence the public discussion.
> >>
> >> Let me start by saying that everyone (I believe) agrees that we would
> >> like to have an HTML output from the engine which is more generic and
> >> would allow CSS to be applied if a frontend would like to do this.
> >> Currently HTMLHREF output from the engine is used by the widest number
> >> of frontends (to my knowledge) and would benefit everyone involved by
> >> becoming much more generic. e.g.,
> >>
> >> <title> -> <h1>
> >> rather than
> >> <title> -> <b><br />
> >>
> >> <transChange type="added"> -> <span class="tcAdded">
> >> rather than
> >> <transChange type="added"> -> <i>
> >>
> >> etc.
> >>
> >> I believe this will solve a number of issues and possibly get the BT and
> >> MacSword teams onboard to using the same HTML output filters as the
> >> other projects involve (or at least subclassing them and using the
> >> majority of their functionality).
> >
> > I think this is our pretty well accepted premise. The current filters
> > stink to various degrees and currently no one is willing to step up
> > and tackle them.
> >
> >>
> >>
> >> Now, as to the other issue of using XSLT internally in the engine to
> >> process OSIS -> HTML
> >>
> >> I will throw a few melons into the air for target practice, and let the
> >> shooting commence.
> >>
> >> _____________________________
> >> *Multiple Language*
> >>
> >> XSLT is a programming language in the same sense that C++ is a
> >> programming language.
> >>
> >> The SWORD Project C++ engine is written in C++. It is not a Python
> >> engine; it is not a Perl engine; it is not a Java engine; it is C++.
> >>
> >> One might say, "Well, you can use XSLT from C++. Doesn't JSword do this
> >> from Java?" Well, yes, of course you can, and DM can comment, if he
> >> feels the desire to recommend his decision to encorporate an XSLT engine
> >> into the JSword logic flow. But simply because one CAN doesn't mean one
> >> SHOULD. We COULD encorporate a Perl text processing engine in our C++
> >> code, or an Awk processing engine... that doesn't mean we SHOULD. I'm
> >> sure some would say we SHOULD. And obviously DM has thought he SHOULD
> >> encorporate XSLT processing for JSword, so I'm not intending to say it
> >> is a BAD decision, just that it is not a decision I would make; in the
> >> same way as our projects each chose C++ vs. Java to implement our
> objective.
> >
> > If a developer is going to develop OSIS -> HTML filters, for instance,
> > we are already assuming they know OSIS and HTML. OSIS is XML and HTML
> > is SGML (though most of our work is probably targetting a more
> > XML-dialect of HTML). XSLT is also XML. Formally, it is not even a
> > programming language, but just a set of formatting/processing
> > instructions in XML.
> >
> > Any developer using XML who is worth their salt should at least be
> > familiar with the basics of XSL - they may not be a guru of XPath
> > expressions or have every attribute of XSL memorized - and would
> > probably expect a library which handles XML as its preferred input
> > method to utilize one of the standard XML processing methods. I know
> > I'm not the only person who was surprised to look in the library
> > filters and see neither DOM, SAX nor XSLT technologies in use. That
> > was when I first ran and hid.
> >
> > Of course, this portion of the discussion is only relevant for the
> > from-OSIS filters.
> >
> >>
> >> _______________________
> >> *XSLT better than C++*
> >>
> >> One might say, "well, XSLT is better suited to process XML than C++."
> >> That's a loaded and unquantified statement.
> >>
> >> Certainly the C++ language specification doesn't include facilities to
> >> easily process XML, but that doesn't mean a plethora of C++ libraries
> >> don't exists for assisting in this task.
> >>
> >> The SWORD engine includes classes like XMLTag and SWBasicFilter which
> >> implement a SAX processing model.
> >>
> >> The current filters do not all use SWBasicFilter, nor XMLTag. They've
> >> been written over 15 years and many before these classes existed. Some
> >> are ugly and need to be rewritten for readability, certainly. But not
> >> necessarily in a different programming language.
> >
> > XSLT being "better" is, yes, a matter of complete subjectivity. And,
> > as I mentioned above, is only useful when our source is XML to begin
> > with. For GBF or Plaintext sources, XSLT is clearly not even
> > applicable.
> >
> > But the current C++ is so good that you seem the only person willing
> > to touch it. Peter just mentioned he tried once and couldn't get it.
> > I have gone into the filters before with a singular goal in mind and
> > was able to produce my desired changes, but it was long, drawn-out and
> > painful. Doing the same tasks in XSL would have taken me mere
> > seconds. I know a few other people, at least, have said they would
> > know how to do a task if XSLT was used instead of C++. Of course,
> > that is a hypothetical - I can't know that they would have done so,
> > but that was their claim at the time.
> >
> > Our recent discussion about the use of the "n" attribute for footnotes
> > in ThML is a perfect example. Maintaining the attribute in XSL would
> > have been a trivial task I could have handled in seconds. Instead, it
> > required you, myself and Karl and took about 10 days to get fixed.
> > You had to alert Karl and me to presence of the attributes, I provided
> > him a preliminary patch to incorporate the values, then he had to
> > heavily modify the patch to operate correctly in non-ThML source and a
> > few other corner cases. And, in the end, the fix is only in Xiphos'
> > code base - I would have to go through 2 of those three steps again in
> > Bibletime, BPBible, MacSword and any other applications I wanted to
> > see proper behavior in. Alternatively I could tackle the filters -
> > but I'm not really inclined to do so.
> >
> > Is XSLT "better"? For me, it would be better because I could more
> > easily modify its behavior based on the fact that I know XML and could
> > easily locate the necessary processing directive. For you, maybe not.
> > Are there things you simply cannot do in XSL that C++ can? Yes. IMO
> > the benefits of XSL outweigh the benefits of C++ for this task, but
> > you clearly disagree. :) I would also say that DOM or SAX processing
> > would be better for all the same reasons - it shields the user from
> > having to see the XML parsing and handle inconsistencies in
> > whitespace, validation, etc and is still a decently well-known
> > technology among XML users (even if it's slightly less well-known than
> > XSL). And with a DOM or SAX parser, you could still happily employ
> > the full power of C++.
> >
> >>
> >> ________________________
> >> *COMPLEXITY*
> >>
> >> The task of enumerating all types of OSIS <title> tags, and deciding
> >> what to do with each, and how to classify all <title> tags from all
> >> possible OSIS documents into our enumeration is still going to be a
> >> complex task using XSLT. <title> is a complex example, but certainly
> >> not the most complex.
> >>
> >> It is a tall task to generalize all elements of all documents from all
> >> publishers into one conceptual model with one chosen output for a
> >> frontend-- whether that be for an audience on the Desktop, web-based, or
> >> a handheld.
> >>
> >> The complex processing required by the engine will require long, complex
> >> XSLT-- which likely will encorporate callbacks to C++. It will not be
> >> more simple-- only mixed language.
> >
> > I could also argue that the XSL would not require a developer to
> > mentally filter out the code that just identifies and locates XML
> > elements and attributes and parses them from the code that transforms
> > them and generates the output. Thus yes, it might include some
> > extension functions into C++ but it would be simpler. And it would
> > also be more expressive.
> >
> > The enumeration of every OSIS <title> tag is a moot point for the
> > decision. You need to enumerate them all in C++ as well and decide
> > what to do with them. That doesn't change in the XSL - just the
> > method used. An XSL match along the lines of <xsl:template
> > match="title[@type=psalm]"> still has to be done in C++ with some sort
> > of if(tag.name() == "title && tag.attr("type") == "psalm") or whatever
> > the syntax is. And that is assuming the current filter is using
> > XMLTag and isn't comparing character strings directly.
> >
> >> _______________________
> >> *Semantic vs. Display*
> >>
> >> Some will say (and have), "well, let everything be display oriented and
> >> let the publisher decide". Fine, then you lose 2 things: the ability to
> >> display differently per user preference, per display device; and you
> >> also give up the promise to actually do any interesting research on the
> >> text. When you lose semantic markup, then you lose all interesting
> >> information about WHAT is being marked up.
> >
> > I just want to be clear that I'm not advocating the use of display
> > over semantics as a general choice. My statements are strictly based
> > around my specific task and the fact that OSIS support in SWORD and
> > the front ends is not as good as the support of ThML. Largely this is
> > because most applications display in HTML and my required task is
> > framed entirely in terms of the presentation and display - not the
> > semantics. I would love and prefer to use OSIS for this task, but I
> > simply cannot accomplish it with the state of SWORD at this time.
> >
> >>
> >> _______________________
> >> *More than a Rending Engine*
> >>
> >> The SWORD C++ Engine is more than simply a text rendering engine-- it is
> >> a Biblical text research engine.
> >>
> >> If I'd like to know the morphology of word 3 in 2Thes 2.13 of the WHNU
> >> Greek text, the entire program to do such is:
> >>
> >> SWMgr library;
> >> SWModule *whnu = library.getModule("WHNU");
> >> whnu->setKey("2th.2.13");
> >> whnu->RenderText();
> >>
> >> cout << "The morphology of word three is: " <<
> >> whnu->getEntryAttributes()["Word"]["003"]["Morph"] << endl;
> >>
> >>
> >> That reads nice (at least in my opinion). I don't need to know about
> >> XML, XSLT, care what markup the WHNU module uses, I don't even have to
> >> know how to make a SWORD filter. The current filters do all the work of
> >> breaking out these attributes and making them available in a nice and
> >> interesting map.
> >
> > I'd like to be clear again, that XSL would only be useful for material
> > already in OSIS formats (or in valid ThML - I think TEI is also an XML
> > format?). I doubt many modules in ThML are strictly valid at their
> > import times, so XSL wouldn't be very useful, and GBF is a monster
> > unto itself. Doing the above in XSL from an OSIS source would not be
> > much different in complexity than what you have listed there.
> >
> > <xsl:template match="verse[@osisID='2thes.2.13']/w[@n=3]">
> > The morphology of word three is: <xsl:value-of select="@morph" />
> > </xsl:template>
> >
> > Or something similar (my knowledge of exact OSIS attribute names and
> > values wanes and it's been two or three weeks since I wrote an XPath
> > expression).
> >
> > Of course, the string processing portion of SWORD would continue to be
> > of great importance for any modules in GBF format or similar to bring
> > them into a useful form. In that way, SWORD would continue to be more
> > than just a text rendering engine. It would continue to offer all of
> > its features, its buffering from the system and from the format, its
> > indexing, its module fetching and storing, etc.
> >
> >> ______________________
> >>
> >>
> >> And finally, if bullets aren't flying already, I'll stir the heat up
> with...
> >>
> >> XSLT sucks. A good C++ programmer can do anything in C++ better than
> >> any XSLT programmer.
> >>
> >>
> >> :)
> >
> > A C++ programmer can definitely do more, since C++ is actually a
> > programming language and XSLT is a set of processing instructions.
> > Better? That depends on what the criteria is. For me, in my current
> > role as a module creator, the use of C++ is not currently better
> > because it is less flexible and extensible. For you, as the library
> > maintainer, perhaps C++ is better because it's what you are already
> > comfortable with and because it has largely been your hand in the
> > filters.
> >
> >>
> >> *duck*
> >> Have fun.
> >>
> >> Troy
> >>
> >> PS. In summary, I understand the current filters are sometimes overly
> >> complex and need cleanup, standardization, etc. It comes down to the
> >> fact that they mostly work, and other things which don't get priority,
> >> so they don't get much attention. But honestly, I think one might be
> >> oversimplifying the problem at hand without realizing it, if one simply
> >> thinks switching to XSLT will make things easier.
> >
> > I think one is also oversimplifying the options. My dreamlist is that
> > SWORD produce a well-formed, valid, complete OSIS document for an
> > arbitrary KeyList that I pass it with FMT_OSIS set. That basically
> > boils down to getting the *OSIS filters up to snuff and standardized.
> > The second item on the list is a readily extensible mechanism for
> > SWORD outputting HTML from that OSIS. If that choice is providing an
> > XSL stylesheet with the library, a C++ SAX processor that a front-end
> > can readily extend, a DOM interface that can be easily customized is
> > immaterial to me. I like all three of those, and can easily
> > understand and extend all of them.
> >
> > I think any of those technologies would be an improvement over all
> > in-house C++ for the second half of any such processing. If we are
> > using XML in Open Source Software, let's leverage the work of others
> > who have happily given us permission to use their libraries!
> >
> > --Greg
> >
> > _______________________________________________
> > sword-devel mailing list: sword-devel at crosswire.org
> > http://www.crosswire.org/mailman/listinfo/sword-devel
> > Instructions to unsubscribe/change your settings at above page
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20101201/c573ccfd/attachment-0001.html>
More information about the sword-devel
mailing list