[sword-devel] Detecting Problem Characters
Michael Hart
just_mike_y at yahoo.com
Fri Sep 23 15:08:38 MST 2011
Here's a link to the file prior to import in OO calc 3.3 which produces
the problem
http://www.archive.org/details/HolyNewCovenant
It's under the 'all files http' link ending in .csv It takes too much
time to remove the server name. in the final link.
(*)I import this file delimited on TABS only with no 'text' delimiter
character as an UTF-8 file, which is what I loaded and saved as in
jedit. (In later OO calc versions, to get no text delimiter, you have to
delete and click on some other field.)
The trouble for me starts on on row 61.
An earlier version of this file (overwritten now) loaded properly into
OO calc, based on flags that calc raised, I manually updated a mirror
Jedit file for all the false verses ( now appearing as _##_ in their
proper verses instead of out of sequence verses on their own rows.) the
import into OOO was not saved. I then further edited in Jedit for some
or all of the quoting issues (block characters appearing in jedit where
quotes should be.) (fixing the false versification, and dealing with
quotification) in jedit were saved.
I'll try to run some of the character summary scripts when I get to a
working linux box. might be a day or two. (python isn't on my current
desktop, nor is Bash, and I didn't see any other methods.)
I suspect this has something to do with the spreadsheet import module
interpreting some character as an 'absolute text quote' and assuming
many lines are one because of It. BUT I can't see any logic to what's
happening. The way I'm importing, no character should be doing this and
the EOL should be respected. As far as I can tell, it's not happening on
any character I can see in jedit, but it is happening on some of the
verses i've searched/replaced with jedit, which is suggesting jedit is
hiding something from view on replace, that OOO is seeing.
________________________________________________
Re: EOL's as the source of the problem.
Since all EOL's are coming from JEDIT, I can assume they're all the same
structure? (Whatever jre 6 rev 26 under windows produces?)
One of my steps in conversion is to remove ALL EOL characters from the
file and then Insert EOL's with jedit prior to any tab character (placed
by jedit on chapter and booknames) or any exactly 2 digit number with
spaces preceding and following it Or in the case of the HNC the space
following has already been further modified to a tab for nice import to
spreadsheet. For full bible this leaves a few verses in Psalms and
Isaiah that I have to deal with individually (unless the bible properly
spells out any numbers appearing in the text, where it becomes easier to
also insert EOL's before 3 character numbers also.
After removing all return characters, my document is one row long,
somewhere around 5 megabytes wide (or 1.2 megabytes for NT only.)
The replace structure is like this
1. remove newlines
Search: \n
Replace:
2. add VPL new lines
Search: ( [0-9][0-9] )
Replace: \n$1
Search: \t
Replace: \n\t
With regexp enabled. In windows (both vista and XP, but 80% XP because
the Sword windows utilities won't run on Vista.)
With the Holy New Covenant work, I replaced the original EOL's with the
text "<>", in order to preserve paragraphing. The document still has a
bunch of diamonds in it waiting resolution at some point. But the EOL's
are still inserted by me 100% with jedit search/replace.
__________________
(*) - This file started as the 'palmdoc' word document at the
thomhackett.com site I referred to earlier. I've textified it and VPL'd
it (note that the text has paraphased, grouped verses in it, so I will
later need to IMPortify it or OSISify it.) The proper coding for me
starts with getting the text for each 'verse' into a single row and
building the verse declarations in a spreadsheet.
Other notes: In addition, the text file that came out of the word save
as UTF-8 had what appeared to be bulleted text on 4-5 verses, which i
reverted to straight text, no bullets, no return characters. In all
except one verse this appeared to be completely bogus, but I haven't
followed up with the original document to see if bullets were there or
not. they didn't convert properly even if they were present originally
(kept on going well after any bulleted list would have stopped.
More information about the sword-devel
mailing list