[jsword-devel] Comparing texts

DM Smith dmsmith at crosswire.org
Sat Sep 1 07:26:32 MST 2012


If you want the option to have word based vs char based, then open a Jira issue so we can track it. That'd give the front-end and possibly the user the option tune the diff.

Also, re accents, you could preprocess the inputs of diff to do folding. There's a filter in Lucene that is able to do this (in the context of Lucene indexing and searching, so not directly applicable here. But at least example code). So can ICU. If this is something that you think should be in JSword, as opposed to a front-end, open a Jira issue to track it.

WRT priorities, it will be low for me at least until av11n is done.

In His Service,
	DM

On Sep 1, 2012, at 10:20 AM, Chris Burrell <chris at burrell.me.uk> wrote:

> Thanks for this. We've decided to stick with the letter options for now. As it highlights the subtle differences between words like saith and said rather well.
> 
> One thing I notice however, and I'm not sure how we would do this, is that the diffing takes account of the accents in the original text. I'm guessing there is no easy way to have that work out of the box, apart from changing the OSIS returned by the call and amending it prior to the diff occurring. 
> 
> Chris
> 
> 
> 
> 
> On 29 August 2012 19:00, DM Smith <dmsmith at crosswire.org> wrote:
> It was based upon an earlier version of diff-match-patch, which was written in javascript, not java. The selection criteria I had was that it had to have a license compatible to JSword. When the original author was hired by google, the code changed to an incompatible license for porting. Since then it was ported to Java 5.
> 
> I ported the earlier version to Java 1.4. But I broke it out into multiple classes. (We might be able to eliminate our version and use the google version directly).
> 
> I think there is a way to have it do a word based match, but with code changes:
> http://code.google.com/p/google-diff-match-patch/wiki/LineOrWordDiffs
> 
> 
> On Aug 29, 2012, at 12:50 PM, Chris Burrell <chris at burrell.me.uk> wrote:
> 
>> Hi all
>> 
>> The current diffing produces some fairly strange results from time to time. I was wondering how much work it would be to make it work for a word by word diff, rather than letter by letter. I've a quick scan through the diff-ing engine, but it looks fairly complicated and can't figure out how much of this is a copy of http://code.google.com/p/google-diff-match-patch and how much has changed.
>> 
>> In the example below, 
>> 
>>            "And God saw that the light , that it was good : and God dividwas good. And God separated the light from the darkness          "
>> 
>> The new diff would hopefully not chop "that and "the"  in the first occurrence above. It would not chop "divid" off either, but rather have longer words, which would in turn make things slightly more readable.
>> 
>> (bold indicates strike through)
>> 
>> Chris
>> 
>> _______________________________________________
>> jsword-devel mailing list
>> jsword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/jsword-devel
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/jsword-devel/attachments/20120901/92670348/attachment.html>


More information about the jsword-devel mailing list