org.crosswire.common.xml
Class XMLUtil

java.lang.Object
  extended by org.crosswire.common.xml.XMLUtil

public final class XMLUtil
extends Object

Utilities for working with SAX XML parsing.

Author:
Joe Walker [joe at eireneh dot com], DM Smith [dmsmith555 at yahoo dot com]
See Also:
for license details.
The copyright to this program is held by it's authors.

Field Summary
private static Map badEntities
           
private static Set goodEntities
           
private static Pattern invalidCharacterPattern
          Pattern that negates the allowable XML 4 byte unicode characters.
private static Logger log
          The log stream
private static Pattern validCharacterEntityPattern
          Pattern for numeric entities.
 
Constructor Summary
private XMLUtil()
          Prevent Instansiation
 
Method Summary
static String cleanAllCharacters(String broken)
          Remove all invalid characters in the input.
static String cleanAllEntities(String broken)
          For each entity in the input that is not allowed in XML, replace the entity with its unicode equivalent or remove it.
static String cleanAllTags(String broken)
          XML parse failed, so we can try getting rid of all the tags and having another go.
static void debugSAXAttributes(Attributes attrs)
          Show the attributes of an element as debug
static String escape(String s)
          Normalizes the given string
static org.jdom.Document getDocument(String subject)
          Get and load an XML file from the classpath and a few other places into a JDOM Document object.
private static String handleEntity(String entity)
          Replace entity with its unicode equivalent, if it is not a valid XML entity.
static String writeToString(SAXEventProvider provider)
          Serialize a SAXEventProvider into an XML String
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

goodEntities

private static Set goodEntities

badEntities

private static Map badEntities

log

private static final Logger log
The log stream


validCharacterEntityPattern

private static Pattern validCharacterEntityPattern
Pattern for numeric entities.


invalidCharacterPattern

private static Pattern invalidCharacterPattern
Pattern that negates the allowable XML 4 byte unicode characters. Valid are: #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Constructor Detail

XMLUtil

private XMLUtil()
Prevent Instansiation

Method Detail

getDocument

public static org.jdom.Document getDocument(String subject)
                                     throws org.jdom.JDOMException,
                                            IOException
Get and load an XML file from the classpath and a few other places into a JDOM Document object.

Parameters:
subject - The name of the desired resource (without any extension)
Returns:
The requested resource
Throws:
IOException - if there is a problem reading the file
org.jdom.JDOMException - If the resource is not valid XML

writeToString

public static String writeToString(SAXEventProvider provider)
                            throws SAXException
Serialize a SAXEventProvider into an XML String

Parameters:
provider - The source of SAX events
Returns:
a serialized string
Throws:
SAXException

debugSAXAttributes

public static void debugSAXAttributes(Attributes attrs)
Show the attributes of an element as debug


escape

public static String escape(String s)
Normalizes the given string


cleanAllEntities

public static String cleanAllEntities(String broken)
For each entity in the input that is not allowed in XML, replace the entity with its unicode equivalent or remove it. For each instance of a bare &, replace it with &
XML only allows 4 entities: &, ", < and >.

Parameters:
broken - the string to handle entities
Returns:
the string with entities appropriately fixed up

cleanAllCharacters

public static String cleanAllCharacters(String broken)
Remove all invalid characters in the input. XML has stringent requirements as to which characters are or are not allowed. The set of allowable characters are:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
Note: Java handles to ￿

Parameters:
broken - the string to be cleaned
Returns:
the cleaned string

cleanAllTags

public static String cleanAllTags(String broken)
XML parse failed, so we can try getting rid of all the tags and having another go. We define a tag to start at a < and end at the end of the next word (where a word is what comes in between spaces) that does not contain an = sign, or at a >, whichever is earlier.


handleEntity

private static String handleEntity(String entity)
Replace entity with its unicode equivalent, if it is not a valid XML entity. Otherwise strip it out. XML only allows 4 entities: &amp;, &quot;, &lt; and &gt;.

Parameters:
entity - the entity to be replaced
Returns:
the substitution for the entity, either itself, the unicode equivalent or an empty string.

Copyright ¨ 2003-2007