[jsword-devel] Segs, comparing texts and indexing

Wed Sep 26 12:47:23 MST 2012

Hi

I have a question around the *seg *element in OSIS. Versions like the WLC
have separated each part of the word as a seg (see below). And then
separates the words (sets of segs) with spaces.

<div><title type="x-gen">Genesis 1:1</title><verse osisID="Gen.1.1"><seg
type="x-morph">בְּ</seg><seg type="x-morph">רֵאשִׁ֖ית</seg> <seg
type="x-morph">בָּרָ֣א</seg> <seg type="x-morph">אֱלֹהִ֑ים</seg> <seg
type="x-morph">אֵ֥ת</seg> <seg type="x-morph">הַ</seg><seg
type="x-morph">שָּׁמַ֖יִם</seg> <seg type="x-morph">וְ</seg><seg
type="x-morph">אֵ֥ת</seg> <seg type="x-morph">הָ</seg><seg
type="x-morph">אָֽרֶץ׃</seg> </verse></div>

The JSword compare functionality (and indexing)
uses OSISUtil.getCanonicalText(). This seems to add spaces between the seg
elements which then makes for inconsistent results in the diffing (extra
differences). The following comment is seen in the method:
 // make sure that adjacent text elements are separated by whitespace
// TODO(dms): verify that the xml parser does not split words containing
entities.

Presumably, we want to add an exception for *seg* elements. I assume
indexing/searching is also going to be affected by this problem...

My question is whether to add a new block in the instanceof Element part.
Or in the instanceof Text part?
Also, are there any other times ever where we want additional spaces
between segs? Any other gotchas?

(see below for JSword function)
Cheers
Chris

Copy of the function in JSword:

*public static String getCanonicalText(Element root) {*
        StringBuilder buffer = new StringBuilder();

        // Dig past osis, osisText, if present, to get to the real content.
        List<Content> frag = OSISUtil.getFragment(root);

        Iterator<Content> dit = frag.iterator();
        String sID = null;
        Content data = null;
        Element ele = null;
        while (dit.hasNext()) {
            data = dit.next();
            if (data instanceof Element) {
                ele = (Element) data;
                if (!isCanonical(ele)) {
                    continue;
                }

                if (ele.getName().equals(OSISUtil.OSIS_ELEMENT_VERSE)) {
                    sID = ele.getAttributeValue(OSISUtil.OSIS_ATTR_SID);
                }

                if (sID != null) {
                    getCanonicalContent(ele, sID, dit, buffer);
                } else {
                    getCanonicalContent(ele, null,
ele.getContent().iterator(), buffer);
                }
            } else if (data instanceof Text) {
                // make sure that adjacent text elements are separated by
                // whitespace
                // TODO(dms): verify that the xml parser does not split
words
                // containing entities.
                int lastIndex = buffer.length() - 1;
                String text = ((Text) data).getText();
                // Ignore empty text nodes.
                if (text.length() != 0) {
                    if (lastIndex >= 0 &&
!Character.isWhitespace(buffer.charAt(lastIndex)) &&
!Character.isWhitespace(text.charAt(0))) {
                        buffer.append(' ');
                    }
                    buffer.append(text);
                }
            }
        }

        return buffer.toString().trim();
    }

Cheers
Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/jsword-devel/attachments/20120926/becbb691/attachment.html>