[sword-devel] Musings about the Cherokee NT module
Chris Little
chrislit at crosswire.org
Sun Jul 1 13:44:05 MST 2012
On 7/1/2012 12:51 PM, refdoc at gmx.net wrote:
> While I must confess my interest in Cherokee is fairly limited, the
> process of proximity testing would be extremely helpful for study bible
> creation in any number of languages. Could you explain the algorithms
> with mire details? Are there cpan or python modules available?
You can find the definition of the algorithm for finding Levenshtein
edit distance on Wikipedia, along with pseudocode. (Google Levenshtein.)
It's also implemented as part of NLTK in Python, and I'm sure someone
has written something and posted it to CPAN too.
The way I learned edit distances, from Dan Jurafsky in Cousera's NLP
course, additions & deletions have costs of 1 while substitutions have a
cost of 2. Wikipedia and the NLTK docs seem to indicate that they should
all cost 1. So... there is some variety in the particulars of its
implementation, and there are other ways of computing edit distance.
Converting to Soundex (see Wikipedia again) or another phonological
encoding before computing edit distances will probably produce better
results, but Soundex-type representations are necessarily language-specific.
If you just want to experiment a bit and are comfortable in Python, I
would recommend grabbing NLTK and playing with its implementation.
--Chris
More information about the sword-devel
mailing list