[sword-svn] r89 - in trunk: . modules modules/calvinscommentaries python python/swordutils python/swordutils/xml

lukeplant at www.crosswire.org lukeplant at www.crosswire.org
Thu Jul 19 15:51:33 MST 2007


Author: lukeplant
Date: 2007-07-19 15:51:32 -0700 (Thu, 19 Jul 2007)
New Revision: 89

Added:
   trunk/modules/calvinscommentaries/
   trunk/modules/calvinscommentaries/README
   trunk/modules/calvinscommentaries/calvinscommentaries.conf
   trunk/modules/calvinscommentaries/combine_calcom.py
   trunk/python/
   trunk/python/swordutils/
   trunk/python/swordutils/__init__.py
   trunk/python/swordutils/xml/
   trunk/python/swordutils/xml/__init__.py
   trunk/python/swordutils/xml/combine.py
   trunk/python/swordutils/xml/thml.py
   trunk/python/swordutils/xml/utils.py
Log:
Added Python library of various tools for making modules, and 
specific script for creating a combined Calvin's Commentaries module


Added: trunk/modules/calvinscommentaries/README
===================================================================
--- trunk/modules/calvinscommentaries/README	                        (rev 0)
+++ trunk/modules/calvinscommentaries/README	2007-07-19 22:51:32 UTC (rev 89)
@@ -0,0 +1,45 @@
+
+Conversion of Calvin's commentaries into OSIS format and a Sword module
+
+Requirements:
+-------------
+- ThML sources: calcom??.xml files, as downloaded from CCEL.
+  For convenience, a recent version of the files can be downloaded here:
+  http://lukeplant.me.uk/misc/sword/calcom_sources.tar.bz2
+  Extract this file.
+- thml2osis.xslt from
+  http://crosswire.org/svn/sword-tools/trunk/thml2osis/xslt/
+- xsltproc for processing the above
+- Python for script that combines calcom??.xml files
+- Python swordutils library:
+   http://crosswire.org/svn/sword-tools/trunk/python
+  A checkout of this directory should be in your PYTHONPATH
+
+Make the module
+---------------
+
+$ ./combine_calcom.py calcom_sources/calcom??.xml
+(output stored in calvinscommentaries.thml)
+$ xsltproc --novalid path/to/thml2osis.xslt calvinscommentaries.thml > calvinscommentaries.osis
+
+TODO
+- convert OSIS commentary to Sword module
+
+Explanation of these steps
+--------------------------
+1) 'Correct' some of the ThML files.  In particular, change the
+   'scripCom' tags so that they enclose the text they refer to,
+   rather than just come at the beginning of it.
+   This is done as part of combine_calcom.py
+
+2) Combine all the ThML files into one big one, and at the same time:
+   - modify the header information, using one of the calcom??.xml files
+     as a template
+   - make any corrections necessary to the ThML for the new context
+
+   Output: calvinscommentaries.thml
+
+3) Convert to OSIS, using thml2osis.xslt
+
+4) TODO - convert to Sword module.  The current osis2mod utility expects
+   commentaries to be marked up like Bibles.

Added: trunk/modules/calvinscommentaries/calvinscommentaries.conf
===================================================================
--- trunk/modules/calvinscommentaries/calvinscommentaries.conf	                        (rev 0)
+++ trunk/modules/calvinscommentaries/calvinscommentaries.conf	2007-07-19 22:51:32 UTC (rev 89)
@@ -0,0 +1,17 @@
+[CalvinsCommentaries]
+DataPath=./modules/comments/zcom/calvinscommentaries/
+ModDrv=zCom
+BlockType=CHAPTER
+SourceType=OSIS
+CompressType=ZIP
+Lang=en
+Description=Calvin's Collected Commentaries
+About=John Calvin's commentaries on many books of the Bible, collected \
+into a single volume from material found at Christian Classics Ethereal Library \par \
+Converted to Sword module format by Luke Plant <L.Plant.98 at cantab.net>
+Version=1.0
+Encoding=UTF-8
+LCSH=Bible--Commentaries.
+DistributionLicense=Public Domain
+TextSource=http://www.ccel.org/
+MinimumVersion=1.5.2

Added: trunk/modules/calvinscommentaries/combine_calcom.py
===================================================================
--- trunk/modules/calvinscommentaries/combine_calcom.py	                        (rev 0)
+++ trunk/modules/calvinscommentaries/combine_calcom.py	2007-07-19 22:51:32 UTC (rev 89)
@@ -0,0 +1,78 @@
+#!/usr/bin/env python
+
+# Converts the source calcom??.xml files into a single
+# ThML file, with corrections made to allow it to be
+# used as a Sword module
+
+#------------------------------------------------------------
+# CONFIG
+
+PUBLISHERID = u"lukeplant.me.uk"
+
+#------------------------------------------------------------
+
+from xml.dom import minidom
+from xml import xpath
+from datetime import datetime
+from swordutils.xml import thml, utils
+from swordutils.xml.utils import RemoveNode, GeneralReplaceContents, ReplaceContents, do_replacements
+from swordutils.xml.combine import LazyNodes
+import sys
+
+
+now = datetime.now() # for general timestamping purposes
+
+
+def do_head_replacements(doc):
+    
+    corrections = {
+        "//DC.Title[@sub='Main']":        ReplaceContents(u"Calvin's Combined Commentaries"),
+        "//DC.Title[@sub='authTitle']":   RemoveNode(),
+        "//DC.Title[@sub='Alternative']": RemoveNode(),
+        "//printSourceInfo":              ReplaceContents(u"<published>Multiple printed works, Baker</published>"),
+        "//electronicEdInfo/bookID":      ReplaceContents(u"calvincommentaries"),
+        "//DC.Identifier":                RemoveNode(), # TODO - new identifier?
+        "//electronicEdInfo/editorialComments":
+          GeneralReplaceContents(lambda t: u"Multiple ThML files combined into single ThML file by a script.  Original editoral comments: " + t),
+        "//electronicEdInfo/revisionHistory":
+          GeneralReplaceContents(lambda t: unicode(now.strftime('%Y-%m-%d')) +   u": Multiple ThML files combined into single ThML file by a script. Original revision history:" + t),
+        "//electronicEdInfo/publisher": ReplaceContents(PUBLISHERID),
+
+    }
+    do_replacements(doc, corrections)
+
+def do_body_corrections(doc):
+    # Correct <scripCom>
+    rootNode = utils.getRoot(doc)
+    thml.expandScripComNodes(rootNode)
+    # Other corrections
+    corrections = {
+        # id attributes can now contain duplicates due to combination
+        # of multiple files, so we remove them all.
+        "//@id": RemoveNode(),
+
+    }
+    do_replacements(doc, corrections)
+
+def combine(templatefile, allfiles):
+    # Get the main one
+    templatexml = minidom.parse(templatefile)
+    mainBody = utils.getNodesFromXPath(templatexml, '//ThML.body')[0]
+    mainBody.childNodes = []
+    do_head_replacements(templatexml)
+    # The following childNodes will be lazily evaluated as
+    # templatexml.writexml iterates over them
+    mainBody.childNodes = LazyNodes(templatexml, allfiles, do_body_corrections, '//ThML.body')
+
+    fh = open('calvinscommentaries.thml', 'wb')
+    utils.writexml(templatexml, fh)
+    fh.close()
+
+def main(filenames):
+    combine(filenames[0], filenames)
+
+if __name__ == "__main__":
+    if len(sys.argv) < 2:
+        print "Usage: ./combine_and_correct.py filename.xml [filename2.xml ...]"
+        sys.exit(1)
+    main(sys.argv[1:])


Property changes on: trunk/modules/calvinscommentaries/combine_calcom.py
___________________________________________________________________
Name: svn:executable
   + *
Name: svn:eol-style
   + native

Added: trunk/python/swordutils/__init__.py
===================================================================


Property changes on: trunk/python/swordutils/__init__.py
___________________________________________________________________
Name: svn:eol-style
   + native

Added: trunk/python/swordutils/xml/__init__.py
===================================================================


Property changes on: trunk/python/swordutils/xml/__init__.py
___________________________________________________________________
Name: svn:eol-style
   + native

Added: trunk/python/swordutils/xml/combine.py
===================================================================
--- trunk/python/swordutils/xml/combine.py	                        (rev 0)
+++ trunk/python/swordutils/xml/combine.py	2007-07-19 22:51:32 UTC (rev 89)
@@ -0,0 +1,29 @@
+# Utilities for combining multiple module source files
+# into one.
+
+from xml.dom import minidom
+from swordutils.xml import utils
+
+class LazyNodes(object):
+    # Pulling all the documents in at once uses up too much memory.
+    # This class is responsible for acting as a replacement
+    # 'childNodes' which loads documents one at a time,
+    # does corrections on them and spews out the body nodes
+    def __init__(self, maindoc, files, alterationfunc, nodepath):
+        self.maindoc = maindoc # Don't actually need this
+        self.files = files
+        self.iterated_count = 0
+        self.nodepath = nodepath
+        self.alterationfunc = alterationfunc
+
+    def __iter__(self):
+        self.iterated_count += 1
+        if self.iterated_count == 2:
+            # We've got a big performance bug if this happens.
+            raise Exception('Performance bug')
+        for f in self.files:
+            doc = minidom.parse(f)
+            self.alterationfunc(doc)     
+            body = utils.getNodesFromXPath(doc, self.nodepath)[0]
+            for n in body.childNodes:
+                yield n


Property changes on: trunk/python/swordutils/xml/combine.py
___________________________________________________________________
Name: svn:eol-style
   + native

Added: trunk/python/swordutils/xml/thml.py
===================================================================
--- trunk/python/swordutils/xml/thml.py	                        (rev 0)
+++ trunk/python/swordutils/xml/thml.py	2007-07-19 22:51:32 UTC (rev 89)
@@ -0,0 +1,87 @@
+# Utility functions for manipulating ThML
+
+from xml.dom import minidom
+from swordutils.xml import utils
+
+    
+def isScripCom(node):
+    return node.nodeName == u'scripCom'
+
+def findParentDiv(node):
+    pnode = node.parentNode
+    if pnode is None:
+        raise Exception("Cannot find parent div for node %r" % node)
+    if pnode.nodeType == minidom.Document.ELEMENT_NODE \
+        and pnode.nodeName.startswith(u'div'):
+        return pnode
+    else:
+        return findParentDiv(pnode)    
+
+def moveToParent(node, destParent):
+    if node.parentNode is destParent:
+        return
+    else:
+        pnode = node.parentNode
+        pnode.removeChild(node)
+        pnode.parentNode.insertBefore(node, pnode)
+        return moveToParent(node, destParent)
+
+def _findNextScripComNode(node, return_parent):
+    if node is None:
+        return None
+    if isScripCom(node):
+        if return_parent:
+            return node.parentNode
+        else:
+            return node
+        
+    else:
+        # Search deeper, but return node that is on the
+        # same level as our original node
+        descendent = _findNextScripComNode(node.firstChild, True)
+        if descendent is not None:
+            if return_parent:
+                return descendent.parentNode
+            else:
+                return descendent
+        else:
+            return _findNextScripComNode(node.nextSibling, False)
+
+def _expandScripComNode(scNode):
+    nextSCN = _findNextScripComNode(scNode.nextSibling, False)
+    collection = []
+    n = scNode.nextSibling
+    while (n is not None and n is not nextSCN):
+        collection.append(n)
+        n = n.nextSibling
+    for n in collection:
+        n.parentNode.removeChild(n)
+        scNode.appendChild(n)        
+
+def expandScripComNodes(node):
+    """Expands all empty <scripCom> nodes so that they contain
+       the nodes that they refer to, using neighboring <scripCom>
+       nodes and the structure of the XML as a guide,
+       starting at the supplied node"""
+
+    if isScripCom(node):
+        # Often placed as markers instead of enclosing
+        # the nodes to which they apply.
+        if node.nodeValue is None or node.nodeValue == "":
+            # Try to find scope over which the <scripCom> element
+            # should actually be placed.
+            # Rules:
+            #  - move the scripCom element 'up' the tree until is
+            #    a descendent of a `divX' node, placing it before
+            #    any of its parent nodes along the way
+            #  - make all its sibling nodes that are below it
+            #    into child nodes, up to the point where there
+            #    is another <scripCom> element
+            div = findParentDiv(node)
+            moveToParent(node, div)
+            _expandScripComNode(node)
+            
+    if node.childNodes.length > 0:
+        for n in node.childNodes:
+            expandScripComNodes(n)
+


Property changes on: trunk/python/swordutils/xml/thml.py
___________________________________________________________________
Name: svn:eol-style
   + native

Added: trunk/python/swordutils/xml/utils.py
===================================================================
--- trunk/python/swordutils/xml/utils.py	                        (rev 0)
+++ trunk/python/swordutils/xml/utils.py	2007-07-19 22:51:32 UTC (rev 89)
@@ -0,0 +1,65 @@
+# General XML utilities
+
+from xml.dom import minidom
+from xml import xpath
+import codecs
+
+def getFileWriter(fileHandle):
+    """Gets a 'writer' for a file object that encodes
+    as UTF-8"""
+    return codecs.lookup("UTF-8").streamwriter(fileHandle)
+
+def writexml(doc, fileHandle):
+    """Writes an XML document to a file handle"""
+    doc.writexml(getFileWriter(fileHandle), encoding="UTF-8")
+
+def getNodesFromXPath(document, path):
+    """Selects nodes specified by 'path' from 'document',
+    where path is a string or a compiled xpath object"""
+    if isinstance(path, basestring):
+        path = xpath.Compile(path)
+    return path.select(xpath.CreateContext(document))
+
+_rootxpath = xpath.Compile('/')
+def getRoot(doc):
+    """Returns the root node of a document"""
+    return getNodesFromXPath(doc, _rootxpath)[0]
+
+
+# Classes to help us with modifications
+class RemoveNode:
+    def act(self, node):
+        if isinstance(node, minidom.Attr):
+            node.ownerElement.removeAttribute(node.name)
+        else:
+            node.parentNode.removeChild(node)
+
+class GeneralReplaceContents:
+    """Replace the contents of a node,
+    with user providable function for calculating replacement text
+    """
+    def __init__(self, replacefunc):
+        self.replacefunc = replacefunc
+    def act(self, node):
+        origText = u''.join(c.toxml() for c in node.childNodes)
+
+        # Usually replacefunc will just return text,
+        # but we allow it to return xml as well
+        newNodes = minidom.parseString(u'<dummy>' + self.replacefunc(origText) + u'</dummy>' )
+        # newNodes is a DOM instance, and it is has a dummy
+        # element wrapping the nodes we actually want.
+        node.childNodes = newNodes.childNodes[0].childNodes
+
+class ReplaceContents(GeneralReplaceContents):
+    def __init__(self, replacementtext):
+        assert isinstance(replacementtext, unicode)
+        def _replacefunc(text):
+            return replacementtext
+        self.replacefunc = _replacefunc
+
+def do_replacements(doc, replacements):
+    ctx = xpath.CreateContext(doc)
+    for path, action in replacements.items():
+        xp = xpath.Compile(path)
+        for n in xp.select(ctx):
+            action.act(n)


Property changes on: trunk/python/swordutils/xml/utils.py
___________________________________________________________________
Name: svn:eol-style
   + native




More information about the sword-cvs mailing list