[jsword-devel] Segs, comparing texts and indexing
DM Smith
dmsmith at crosswire.org
Wed Sep 26 15:57:09 MST 2012
The other factor is that block elements might not actually be containers but milestoned.
On Sep 26, 2012, at 6:39 PM, DM Smith <dmsmith at crosswire.org> wrote:
> The routine is not very bright. The issue that the inserted space is trying to handle is when text in one block element is followed by text perhaps in another block element.
>
> A block element implies a newline. The seg element is an inline element and does not imply added spacing. However, it is only an issue if elements split words.
>
> So the fix is to categorize each element as block or inline and smartly putting the extra space where it is needed.
>
> Need to verify how ruby is handled in OSIS. Or change this may break that.
>
> Please file a bug.
>
> In Him,
> DM
>
> On Sep 26, 2012, at 3:47 PM, Chris Burrell <chris at burrell.me.uk> wrote:
>
>> Hi
>>
>> I have a question around the seg element in OSIS. Versions like the WLC have separated each part of the word as a seg (see below). And then separates the words (sets of segs) with spaces.
>>
>> <div><title type="x-gen">Genesis 1:1</title><verse osisID="Gen.1.1"><seg type="x-morph">בְּ</seg><seg type="x-morph">רֵאשִׁ֖ית</seg> <seg type="x-morph">בָּרָ֣א</seg> <seg type="x-morph">אֱלֹהִ֑ים</seg> <seg type="x-morph">אֵ֥ת</seg> <seg type="x-morph">הַ</seg><seg type="x-morph">שָּׁמַ֖יִם</seg> <seg type="x-morph">וְ</seg><seg type="x-morph">אֵ֥ת</seg> <seg type="x-morph">הָ</seg><seg type="x-morph">אָֽרֶץ׃</seg> </verse></div>
>>
>> The JSword compare functionality (and indexing) uses OSISUtil.getCanonicalText(). This seems to add spaces between the seg elements which then makes for inconsistent results in the diffing (extra differences). The following comment is seen in the method:
>> // make sure that adjacent text elements are separated by whitespace
>> // TODO(dms): verify that the xml parser does not split words containing entities.
>>
>> Presumably, we want to add an exception for seg elements. I assume indexing/searching is also going to be affected by this problem...
>>
>> My question is whether to add a new block in the instanceof Element part. Or in the instanceof Text part?
>> Also, are there any other times ever where we want additional spaces between segs? Any other gotchas?
>>
>> (see below for JSword function)
>> Cheers
>> Chris
>>
>> Copy of the function in JSword:
>>
>> public static String getCanonicalText(Element root) {
>> StringBuilder buffer = new StringBuilder();
>>
>> // Dig past osis, osisText, if present, to get to the real content.
>> List<Content> frag = OSISUtil.getFragment(root);
>>
>> Iterator<Content> dit = frag.iterator();
>> String sID = null;
>> Content data = null;
>> Element ele = null;
>> while (dit.hasNext()) {
>> data = dit.next();
>> if (data instanceof Element) {
>> ele = (Element) data;
>> if (!isCanonical(ele)) {
>> continue;
>> }
>>
>> if (ele.getName().equals(OSISUtil.OSIS_ELEMENT_VERSE)) {
>> sID = ele.getAttributeValue(OSISUtil.OSIS_ATTR_SID);
>> }
>>
>> if (sID != null) {
>> getCanonicalContent(ele, sID, dit, buffer);
>> } else {
>> getCanonicalContent(ele, null, ele.getContent().iterator(), buffer);
>> }
>> } else if (data instanceof Text) {
>> // make sure that adjacent text elements are separated by
>> // whitespace
>> // TODO(dms): verify that the xml parser does not split words
>> // containing entities.
>> int lastIndex = buffer.length() - 1;
>> String text = ((Text) data).getText();
>> // Ignore empty text nodes.
>> if (text.length() != 0) {
>> if (lastIndex >= 0 && !Character.isWhitespace(buffer.charAt(lastIndex)) && !Character.isWhitespace(text.charAt(0))) {
>> buffer.append(' ');
>> }
>> buffer.append(text);
>> }
>> }
>> }
>>
>> return buffer.toString().trim();
>> }
>>
>>
>> Cheers
>> Chris
>>
>> _______________________________________________
>> jsword-devel mailing list
>> jsword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/jsword-devel
>
> _______________________________________________
> jsword-devel mailing list
> jsword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/jsword-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/jsword-devel/attachments/20120926/72f8c8c5/attachment-0001.html>
More information about the jsword-devel
mailing list