[sword-devel] Detecting Problem Characters

Michael Hart just_mike_y at yahoo.com
Fri Sep 23 15:08:38 MST 2011


Here's a link to the file prior to import in OO calc 3.3 which produces 
the problem

http://www.archive.org/details/HolyNewCovenant

It's under the 'all files http' link ending in .csv  It takes too much 
time to remove the server name. in the final link.

(*)I import this file delimited on TABS only with no 'text' delimiter 
character as an UTF-8 file, which is what I loaded and saved as in 
jedit. (In later OO calc versions, to get no text delimiter, you have to 
delete and click on some other field.)

The trouble for me starts on on row 61.

An earlier version of this file (overwritten now) loaded properly into 
OO calc, based on flags that calc raised, I manually updated a mirror 
Jedit file for all the false verses ( now appearing as _##_ in their 
proper verses instead of out of sequence verses on their own rows.) the 
import into OOO was not saved. I then further edited in Jedit for some 
or all of the quoting issues (block characters appearing in jedit where 
quotes should be.)  (fixing the false versification, and dealing with 
quotification) in jedit were saved.

I'll try to run some of the character summary scripts when I get to a 
working linux box.  might be a day or two. (python isn't on my current 
desktop, nor is Bash, and I didn't see any other methods.)

I suspect this has something to do with the spreadsheet import module 
interpreting some character as an 'absolute text quote' and assuming 
many lines are one because of It.  BUT  I can't see any logic to what's  
happening. The way I'm importing, no character should be doing this and 
the EOL should be respected. As far as I can tell, it's not happening on 
any character I can see in jedit, but it is happening on some of the 
verses i've searched/replaced with jedit, which is suggesting jedit is 
hiding something from view on replace, that OOO is seeing.

________________________________________________

Re: EOL's as the source of the problem.

Since all EOL's are coming from JEDIT, I can assume they're all the same 
structure? (Whatever jre 6 rev 26 under windows  produces?)

One of my steps in conversion is to remove ALL  EOL characters from the 
file and then Insert EOL's with jedit prior to any tab character (placed 
by jedit on chapter and booknames) or any exactly 2 digit number with 
spaces preceding and following it Or in the case of the HNC the space 
following has already been further modified to a tab for nice import to 
spreadsheet.  For full bible this leaves a few verses in Psalms and 
Isaiah that I have to deal with individually (unless the bible properly 
spells out any numbers appearing in the text, where it becomes easier to 
also insert EOL's before 3 character numbers also.

After removing all return characters, my document is one row long, 
somewhere around 5 megabytes wide (or 1.2 megabytes for NT only.)


The replace structure is like this
1. remove newlines
Search: \n
Replace:

2. add VPL new lines
Search: ( [0-9][0-9] )
Replace: \n$1

Search: \t
Replace: \n\t

With regexp enabled. In windows (both vista and XP, but 80% XP because 
the Sword windows utilities won't run on Vista.)

With the Holy New Covenant work, I replaced the original EOL's with the 
text "<>", in order to preserve paragraphing.  The document still has a 
bunch of diamonds in it waiting resolution at some point. But the EOL's 
are still inserted by me 100% with jedit search/replace.


__________________
(*) - This file started as the 'palmdoc' word document at the 
thomhackett.com site I referred to earlier. I've textified it and VPL'd 
it (note that the text has paraphased, grouped verses in it, so I will 
later need to IMPortify it or OSISify it.)  The proper coding for me 
starts with getting the text for each 'verse' into a single row and 
building the verse declarations in a spreadsheet.


Other notes: In addition, the text file that came out of the word save 
as UTF-8 had what appeared to be bulleted text on 4-5 verses, which i 
reverted to straight text, no bullets, no return characters.  In all 
except one verse this appeared to be completely bogus, but I haven't 
followed up with the original document to see if bullets were there or 
not. they didn't convert properly even if they were present originally 
(kept on going well after any bulleted list would have stopped.



More information about the sword-devel mailing list