[sword-devel] Detecting Problem Characters

Greg Hellings greg.hellings at gmail.com
Fri Sep 23 10:30:17 MST 2011


Michael,

On Fri, Sep 23, 2011 at 12:20 PM, Michael Hart <just_mike_y at yahoo.com> wrote:
> I've got a couple modules-in-making both of which I'm working on quote marks
> that aren't displaying at all or are displaying block "mystery" characters.
>  I'm spending time trying to separate apostrophes from single quotes on both
> modules with the hope I can preserve or achieve the ability to use OSIS <Q>
> tags....
>
> HOWEVER
>
> In both modules, at some point I've lost control of a few characters and now
> ms excel or openoffice calc, or jEdit now can't see all the end of line
> characters. That is, when I try to open the file VPL, it almost but not
> quite works.  Some verses are grouped together in either spreadsheet while
> jedit sees them as properly separated.
>
> Recently or not so recently I saw a comment in some post describing a way
>  or a program with summarizes all 'non-ascii' or 'out of this encoding'
> characters that appear in a file.  I've spent time searching for this post
> but cannot locate it or any information about this step on the module
> creation wiki.
>
> Can someone enlighten me (again) as to the best method to find offending
> characters and deal with them?

I wrote the following script which will work great if your text is in
plain text format. Its output will be skewed if you are in something
like OSIS or imp format, but it will still run.
http://dl.thehellings.com/count.py
It makes the further assumption that you are encoded in UTF-8 format.
You can change that readily enough. The program will terminate
incorrectly if there are non-UTF8 characters in the input file,
otherwise it will print out a list of all the characters it
encountered, their frequency, and their Unicode name.

>
> Thanks in advance,
>
> Mike
> ___________________________________________________________________
>
> PS.  Modules in progress are based on these documents:
>
> 1. Holy New Covenant (public domain on publication in 2004.)
> http://www.thomhackett.com/the-holy-new-covenant.htm
>
> The "palm doc" file actually opens as a ms word 97 or 2003 file.)  It is my
> intention to get this into sword to evaluate it as to it's readability and
> usability.  From my cursory review is is a fairly faithful treatment of
> scripture. Galilee Translation Team mentioned appears to be affiliated with
> The Church of Christ in some way.
>
> 2. The Riverside New Testament (published 1923 and copyright renewed (1948?)
> according to Google, but even if still copyrighted should be distributable
> within the next decade... If I have my facts straight).
>
> http://sourceforge.net/projects/zefania-sharp/files/Zefania%20XML%20Modules%20%28old%29/Bibles%20ENG/The%20Riverside%20New%20Testament%20%281923%29/sf_Riverside_NT2.zip/download
>
> Came to me as a 'zefania' xml file.  Note that this file is now (after I
> started working on this last year) already available in OSIS format at:
>
> http://sourceforge.net/projects/zefania-sharp/files/Osis%20XML%20Modules%20%28raw%29/
>
> so this is really more of an exercise in 'what am I doing wrong' for me.

For reasons not entirely mine to go into, nor germane to your
questions, CrossWire policy is generally to ignore zefania files.
Among such, as you point out, is that many of their files have been
found to violate copyright laws.

--Greg



More information about the sword-devel mailing list