Excuse me for being pure Java and not knowing Sword C++ at all but can I add (perhaps obviously) that an XSLT framework will perform noticeably slower than a SAX-like framework.<div><br></div><div><a href="http://java.sun.com/developer/technicalArticles/xml/JavaTechandXML_part2/">Here </a>are some performance comparisons. They are old and Java-centric and so XSLT performance may have improved but these tests show that in the worst case XSLT was 3 times slower than SAX and a good SAX processor was twice as fast as a good XSLT processor. If pages are parsed at the chapter level then users may notice a delay turning pages on smaller machines like mobile phones.</div>
<div><br></div><div>Martin<br><br><div class="gmail_quote">On 1 December 2010 12:20, Troy A. Griffitts <span dir="ltr"><<a href="mailto:scribe@crosswire.org">scribe@crosswire.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
The logic to get from any Publisher Source Document to rendered HTML is<br>
a very complex task to solve.<br>
<br>
We conceptually create Plato's Form of, say, a Bible, and try to fit<br>
imperfect Publisher markup into this concept. A Bible has verses,<br>
headings between verses, chapter intros, footnotes, crossrefs, lemma<br>
information, etc.<br>
<br>
If we do not do this, then we become a PDF reader-- there are already<br>
PDF readers and we lose the ability to do Bible specific things with our<br>
software. For example, if we didn't normalize the concept of crossref<br>
across all Books, then we couldn't turn them on and off; we couldn't<br>
provide a crossref panel in the reader which fills according to which<br>
crossref is hovered over, etc. Same with notes, strongs, headings, etc.<br>
<br>
This causes us to impose our Form onto a publisher's text. I understand<br>
why some people may not like this, but it is very much to our end users'<br>
benefit that we do this. Without this, we become a web-browser or a PDF<br>
reader. Which are fine for their purpose, but we intend to provide<br>
common, familiar, and sometimes novel Bible study aides to our reader.<br>
<br>
The current processing model is dark magic and I apologize for this. It<br>
should be well documented and easy to modify. I will attempt to improve<br>
the dissemination of knowledge of exactly WHAT our Forms are, how we<br>
impose those Forms on publishers' texts and improve the documentation<br>
and code to help others understand and have the ability to improve the code.<br>
<br>
I'll attempt to post a few easy to swallow SWORD 101 classes in email,<br>
which will help us gather our thoughts and documents on how all this works.<br>
<font color="#888888"><br>
<br>
Troy<br>
</font><div><div></div><div class="h5"><br>
<br>
<br>
On 12/01/2010 12:09 AM, Greg Hellings wrote:<br>
> On Tue, Nov 30, 2010 at 1:08 PM, Troy A. Griffitts <<a href="mailto:scribe@crosswire.org">scribe@crosswire.org</a>> wrote:<br>
>> Having finally returned from a hectic 2 weeks of conferences, and lots<br>
>> to do before leaving for Christmas, I'm not sure I'm up for a heated,<br>
>> passionate debate about technologies right now, but by all means, please<br>
>> commence the public discussion.<br>
>><br>
>> Let me start by saying that everyone (I believe) agrees that we would<br>
>> like to have an HTML output from the engine which is more generic and<br>
>> would allow CSS to be applied if a frontend would like to do this.<br>
>> Currently HTMLHREF output from the engine is used by the widest number<br>
>> of frontends (to my knowledge) and would benefit everyone involved by<br>
>> becoming much more generic. e.g.,<br>
>><br>
>> <title> -> <h1><br>
>> rather than<br>
>> <title> -> <b><br /><br>
>><br>
>> <transChange type="added"> -> <span class="tcAdded"><br>
>> rather than<br>
>> <transChange type="added"> -> <i><br>
>><br>
>> etc.<br>
>><br>
>> I believe this will solve a number of issues and possibly get the BT and<br>
>> MacSword teams onboard to using the same HTML output filters as the<br>
>> other projects involve (or at least subclassing them and using the<br>
>> majority of their functionality).<br>
><br>
> I think this is our pretty well accepted premise. The current filters<br>
> stink to various degrees and currently no one is willing to step up<br>
> and tackle them.<br>
><br>
>><br>
>><br>
>> Now, as to the other issue of using XSLT internally in the engine to<br>
>> process OSIS -> HTML<br>
>><br>
>> I will throw a few melons into the air for target practice, and let the<br>
>> shooting commence.<br>
>><br>
>> _____________________________<br>
>> *Multiple Language*<br>
>><br>
>> XSLT is a programming language in the same sense that C++ is a<br>
>> programming language.<br>
>><br>
>> The SWORD Project C++ engine is written in C++. It is not a Python<br>
>> engine; it is not a Perl engine; it is not a Java engine; it is C++.<br>
>><br>
>> One might say, "Well, you can use XSLT from C++. Doesn't JSword do this<br>
>> from Java?" Well, yes, of course you can, and DM can comment, if he<br>
>> feels the desire to recommend his decision to encorporate an XSLT engine<br>
>> into the JSword logic flow. But simply because one CAN doesn't mean one<br>
>> SHOULD. We COULD encorporate a Perl text processing engine in our C++<br>
>> code, or an Awk processing engine... that doesn't mean we SHOULD. I'm<br>
>> sure some would say we SHOULD. And obviously DM has thought he SHOULD<br>
>> encorporate XSLT processing for JSword, so I'm not intending to say it<br>
>> is a BAD decision, just that it is not a decision I would make; in the<br>
>> same way as our projects each chose C++ vs. Java to implement our objective.<br>
><br>
> If a developer is going to develop OSIS -> HTML filters, for instance,<br>
> we are already assuming they know OSIS and HTML. OSIS is XML and HTML<br>
> is SGML (though most of our work is probably targetting a more<br>
> XML-dialect of HTML). XSLT is also XML. Formally, it is not even a<br>
> programming language, but just a set of formatting/processing<br>
> instructions in XML.<br>
><br>
> Any developer using XML who is worth their salt should at least be<br>
> familiar with the basics of XSL - they may not be a guru of XPath<br>
> expressions or have every attribute of XSL memorized - and would<br>
> probably expect a library which handles XML as its preferred input<br>
> method to utilize one of the standard XML processing methods. I know<br>
> I'm not the only person who was surprised to look in the library<br>
> filters and see neither DOM, SAX nor XSLT technologies in use. That<br>
> was when I first ran and hid.<br>
><br>
> Of course, this portion of the discussion is only relevant for the<br>
> from-OSIS filters.<br>
><br>
>><br>
>> _______________________<br>
>> *XSLT better than C++*<br>
>><br>
>> One might say, "well, XSLT is better suited to process XML than C++."<br>
>> That's a loaded and unquantified statement.<br>
>><br>
>> Certainly the C++ language specification doesn't include facilities to<br>
>> easily process XML, but that doesn't mean a plethora of C++ libraries<br>
>> don't exists for assisting in this task.<br>
>><br>
>> The SWORD engine includes classes like XMLTag and SWBasicFilter which<br>
>> implement a SAX processing model.<br>
>><br>
>> The current filters do not all use SWBasicFilter, nor XMLTag. They've<br>
>> been written over 15 years and many before these classes existed. Some<br>
>> are ugly and need to be rewritten for readability, certainly. But not<br>
>> necessarily in a different programming language.<br>
><br>
> XSLT being "better" is, yes, a matter of complete subjectivity. And,<br>
> as I mentioned above, is only useful when our source is XML to begin<br>
> with. For GBF or Plaintext sources, XSLT is clearly not even<br>
> applicable.<br>
><br>
> But the current C++ is so good that you seem the only person willing<br>
> to touch it. Peter just mentioned he tried once and couldn't get it.<br>
> I have gone into the filters before with a singular goal in mind and<br>
> was able to produce my desired changes, but it was long, drawn-out and<br>
> painful. Doing the same tasks in XSL would have taken me mere<br>
> seconds. I know a few other people, at least, have said they would<br>
> know how to do a task if XSLT was used instead of C++. Of course,<br>
> that is a hypothetical - I can't know that they would have done so,<br>
> but that was their claim at the time.<br>
><br>
> Our recent discussion about the use of the "n" attribute for footnotes<br>
> in ThML is a perfect example. Maintaining the attribute in XSL would<br>
> have been a trivial task I could have handled in seconds. Instead, it<br>
> required you, myself and Karl and took about 10 days to get fixed.<br>
> You had to alert Karl and me to presence of the attributes, I provided<br>
> him a preliminary patch to incorporate the values, then he had to<br>
> heavily modify the patch to operate correctly in non-ThML source and a<br>
> few other corner cases. And, in the end, the fix is only in Xiphos'<br>
> code base - I would have to go through 2 of those three steps again in<br>
> Bibletime, BPBible, MacSword and any other applications I wanted to<br>
> see proper behavior in. Alternatively I could tackle the filters -<br>
> but I'm not really inclined to do so.<br>
><br>
> Is XSLT "better"? For me, it would be better because I could more<br>
> easily modify its behavior based on the fact that I know XML and could<br>
> easily locate the necessary processing directive. For you, maybe not.<br>
> Are there things you simply cannot do in XSL that C++ can? Yes. IMO<br>
> the benefits of XSL outweigh the benefits of C++ for this task, but<br>
> you clearly disagree. :) I would also say that DOM or SAX processing<br>
> would be better for all the same reasons - it shields the user from<br>
> having to see the XML parsing and handle inconsistencies in<br>
> whitespace, validation, etc and is still a decently well-known<br>
> technology among XML users (even if it's slightly less well-known than<br>
> XSL). And with a DOM or SAX parser, you could still happily employ<br>
> the full power of C++.<br>
><br>
>><br>
>> ________________________<br>
>> *COMPLEXITY*<br>
>><br>
>> The task of enumerating all types of OSIS <title> tags, and deciding<br>
>> what to do with each, and how to classify all <title> tags from all<br>
>> possible OSIS documents into our enumeration is still going to be a<br>
>> complex task using XSLT. <title> is a complex example, but certainly<br>
>> not the most complex.<br>
>><br>
>> It is a tall task to generalize all elements of all documents from all<br>
>> publishers into one conceptual model with one chosen output for a<br>
>> frontend-- whether that be for an audience on the Desktop, web-based, or<br>
>> a handheld.<br>
>><br>
>> The complex processing required by the engine will require long, complex<br>
>> XSLT-- which likely will encorporate callbacks to C++. It will not be<br>
>> more simple-- only mixed language.<br>
><br>
> I could also argue that the XSL would not require a developer to<br>
> mentally filter out the code that just identifies and locates XML<br>
> elements and attributes and parses them from the code that transforms<br>
> them and generates the output. Thus yes, it might include some<br>
> extension functions into C++ but it would be simpler. And it would<br>
> also be more expressive.<br>
><br>
> The enumeration of every OSIS <title> tag is a moot point for the<br>
> decision. You need to enumerate them all in C++ as well and decide<br>
> what to do with them. That doesn't change in the XSL - just the<br>
> method used. An XSL match along the lines of <xsl:template<br>
> match="title[@type=psalm]"> still has to be done in C++ with some sort<br>
> of if(<a href="http://tag.name" target="_blank">tag.name</a>() == "title && tag.attr("type") == "psalm") or whatever<br>
> the syntax is. And that is assuming the current filter is using<br>
> XMLTag and isn't comparing character strings directly.<br>
><br>
>> _______________________<br>
>> *Semantic vs. Display*<br>
>><br>
>> Some will say (and have), "well, let everything be display oriented and<br>
>> let the publisher decide". Fine, then you lose 2 things: the ability to<br>
>> display differently per user preference, per display device; and you<br>
>> also give up the promise to actually do any interesting research on the<br>
>> text. When you lose semantic markup, then you lose all interesting<br>
>> information about WHAT is being marked up.<br>
><br>
> I just want to be clear that I'm not advocating the use of display<br>
> over semantics as a general choice. My statements are strictly based<br>
> around my specific task and the fact that OSIS support in SWORD and<br>
> the front ends is not as good as the support of ThML. Largely this is<br>
> because most applications display in HTML and my required task is<br>
> framed entirely in terms of the presentation and display - not the<br>
> semantics. I would love and prefer to use OSIS for this task, but I<br>
> simply cannot accomplish it with the state of SWORD at this time.<br>
><br>
>><br>
>> _______________________<br>
>> *More than a Rending Engine*<br>
>><br>
>> The SWORD C++ Engine is more than simply a text rendering engine-- it is<br>
>> a Biblical text research engine.<br>
>><br>
>> If I'd like to know the morphology of word 3 in 2Thes 2.13 of the WHNU<br>
>> Greek text, the entire program to do such is:<br>
>><br>
>> SWMgr library;<br>
>> SWModule *whnu = library.getModule("WHNU");<br>
>> whnu->setKey("2th.2.13");<br>
>> whnu->RenderText();<br>
>><br>
>> cout << "The morphology of word three is: " <<<br>
>> whnu->getEntryAttributes()["Word"]["003"]["Morph"] << endl;<br>
>><br>
>><br>
>> That reads nice (at least in my opinion). I don't need to know about<br>
>> XML, XSLT, care what markup the WHNU module uses, I don't even have to<br>
>> know how to make a SWORD filter. The current filters do all the work of<br>
>> breaking out these attributes and making them available in a nice and<br>
>> interesting map.<br>
><br>
> I'd like to be clear again, that XSL would only be useful for material<br>
> already in OSIS formats (or in valid ThML - I think TEI is also an XML<br>
> format?). I doubt many modules in ThML are strictly valid at their<br>
> import times, so XSL wouldn't be very useful, and GBF is a monster<br>
> unto itself. Doing the above in XSL from an OSIS source would not be<br>
> much different in complexity than what you have listed there.<br>
><br>
> <xsl:template match="verse[@osisID='2thes.2.13']/w[@n=3]"><br>
> The morphology of word three is: <xsl:value-of select="@morph" /><br>
> </xsl:template><br>
><br>
> Or something similar (my knowledge of exact OSIS attribute names and<br>
> values wanes and it's been two or three weeks since I wrote an XPath<br>
> expression).<br>
><br>
> Of course, the string processing portion of SWORD would continue to be<br>
> of great importance for any modules in GBF format or similar to bring<br>
> them into a useful form. In that way, SWORD would continue to be more<br>
> than just a text rendering engine. It would continue to offer all of<br>
> its features, its buffering from the system and from the format, its<br>
> indexing, its module fetching and storing, etc.<br>
><br>
>> ______________________<br>
>><br>
>><br>
>> And finally, if bullets aren't flying already, I'll stir the heat up with...<br>
>><br>
>> XSLT sucks. A good C++ programmer can do anything in C++ better than<br>
>> any XSLT programmer.<br>
>><br>
>><br>
>> :)<br>
><br>
> A C++ programmer can definitely do more, since C++ is actually a<br>
> programming language and XSLT is a set of processing instructions.<br>
> Better? That depends on what the criteria is. For me, in my current<br>
> role as a module creator, the use of C++ is not currently better<br>
> because it is less flexible and extensible. For you, as the library<br>
> maintainer, perhaps C++ is better because it's what you are already<br>
> comfortable with and because it has largely been your hand in the<br>
> filters.<br>
><br>
>><br>
>> *duck*<br>
>> Have fun.<br>
>><br>
>> Troy<br>
>><br>
>> PS. In summary, I understand the current filters are sometimes overly<br>
>> complex and need cleanup, standardization, etc. It comes down to the<br>
>> fact that they mostly work, and other things which don't get priority,<br>
>> so they don't get much attention. But honestly, I think one might be<br>
>> oversimplifying the problem at hand without realizing it, if one simply<br>
>> thinks switching to XSLT will make things easier.<br>
><br>
> I think one is also oversimplifying the options. My dreamlist is that<br>
> SWORD produce a well-formed, valid, complete OSIS document for an<br>
> arbitrary KeyList that I pass it with FMT_OSIS set. That basically<br>
> boils down to getting the *OSIS filters up to snuff and standardized.<br>
> The second item on the list is a readily extensible mechanism for<br>
> SWORD outputting HTML from that OSIS. If that choice is providing an<br>
> XSL stylesheet with the library, a C++ SAX processor that a front-end<br>
> can readily extend, a DOM interface that can be easily customized is<br>
> immaterial to me. I like all three of those, and can easily<br>
> understand and extend all of them.<br>
><br>
> I think any of those technologies would be an improvement over all<br>
> in-house C++ for the second half of any such processing. If we are<br>
> using XML in Open Source Software, let's leverage the work of others<br>
> who have happily given us permission to use their libraries!<br>
><br>
> --Greg<br>
><br>
> _______________________________________________<br>
> sword-devel mailing list: <a href="mailto:sword-devel@crosswire.org">sword-devel@crosswire.org</a><br>
> <a href="http://www.crosswire.org/mailman/listinfo/sword-devel" target="_blank">http://www.crosswire.org/mailman/listinfo/sword-devel</a><br>
> Instructions to unsubscribe/change your settings at above page<br>
<br>
<br>
_______________________________________________<br>
sword-devel mailing list: <a href="mailto:sword-devel@crosswire.org">sword-devel@crosswire.org</a><br>
<a href="http://www.crosswire.org/mailman/listinfo/sword-devel" target="_blank">http://www.crosswire.org/mailman/listinfo/sword-devel</a><br>
Instructions to unsubscribe/change your settings at above page<br>
</div></div></blockquote></div><br></div>