[sword-devel] Musings about the Cherokee NT module

Chris Little chrislit at crosswire.org
Sun Jul 1 13:44:05 MST 2012


On 7/1/2012 12:51 PM, refdoc at gmx.net wrote:
> While I must confess my interest in Cherokee is fairly limited, the
> process of proximity testing would be extremely helpful for study bible
> creation in any number of languages. Could you explain the algorithms
> with mire details? Are there cpan or python modules available?

You can find the definition of the algorithm for finding Levenshtein 
edit distance on Wikipedia, along with pseudocode. (Google Levenshtein.) 
It's also implemented as part of NLTK in Python, and I'm sure someone 
has written something and posted it to CPAN too.

The way I learned edit distances, from Dan Jurafsky in Cousera's NLP 
course, additions & deletions have costs of 1 while substitutions have a 
cost of 2. Wikipedia and the NLTK docs seem to indicate that they should 
all cost 1. So... there is some variety in the particulars of its 
implementation, and there are other ways of computing edit distance.

Converting to Soundex (see Wikipedia again) or another phonological 
encoding before computing edit distances will probably produce better 
results, but Soundex-type representations are necessarily language-specific.

If you just want to experiment a bit and are comfortable in Python, I 
would recommend grabbing NLTK and playing with its implementation.

--Chris



More information about the sword-devel mailing list