[sword-devel] Turkish to English glossary problem
Caleb Maclennan
caleb at alerque.com
Tue Jan 5 11:56:14 MST 2016
Disregard about the module, I found it in a different section of the module
manager. So I have the ERtr_en module now, but as far as I can figure in
Xephos it's useless. Turkish is an agglutinated language and almost no
words in an actual text like the Bible appear in their root or stem form as
found in a dictionary. Ergo no words (except a handful of conjunctions,
numbers, etc. that sometimes have no suffixes) you click on to look up in
the dictionary even have a chance of coming up with an actual meaning. Even
if you know how to parse words and type the stems into the dictionary
lookup bar, it rarely has them and throws the closest match (Alphabetical?
Levenshtein distance?) which is less than useless.
Unless I'm missing something, it might be just as well to disable the
module as insult anybody that tries to use it with data this useless. Am I
missing something? Is there a use-case that makes it worth trying to
cleanup the character set issue? I'll still look into it if you say it's
worth some time to do.
Caleb
On Tue, Jan 5, 2016 at 8:39 PM, Caleb Maclennan <caleb at alerque.com> wrote:
> DM,
>
> Honestly I'm willing to put some effort into this if it will be beneficial
> to anybody using Turkish scriptures, but the Wayback Machine link you
> provided is not encouraging. Not only is the encoding garbage, but the data
> itself is rife with mistakes.Not a full minute of skimming it and I found
> several misspelled Turkish words (not just wrong encoding, actual
> misspellings) and outright bogus definitions. It's a very low quality data
> set. Is what's an the page representative of what is going to come out even
> if I dive down an archaic Windows rabbit hole and manage to surface with a
> properly encoded list? Is such a dictionary really helping anybody? It
> doesn't seem to have much in the way of Biblical/theological terminology
> anyway. Is this just for looking up word definitions in while reading a
> text or does it serve some purpose for cross referencing translations?
>
> I have a copy of Xiphos handy, but for some reason Turkish isn't showing
> up in the dictionary modules available for download. Is this not in the
> default CrossWire repo?
>
> Caleb
>
> On Tue, Jan 5, 2016 at 8:11 PM, DM Smith <dmsmith at crosswire.org> wrote:
>
>> Thanks Caleb,
>>
>> I’m working on JSword which is the Java version of the SWORD engine. As
>> such I run all the modules I can get my hands on through a process that
>> reads all of each module reporting what it cannot handle. It was that
>> effort that made me look closer at the module. Either the problem was in
>> JSword or it was in the module.
>>
>> With Peter, David and your input, we can safely say that it is the
>> module’s problem.
>>
>> Most front-ends don’t display the module as a list (i.e. browse the
>> contents). Bible Desktop does. Most front-ends allow you to select a word
>> and look it up in a dictionary. The Glossary modules allow you to look up a
>> word in one language and display it in another. Bible Desktop doesn’t.
>>
>> If you let us know which front-end you use, we can explain how to
>> download the module for it and how to use it in that program.
>>
>> The SWORD utility mod2imp will dump a module’s content in imp format.
>> Since this module is a RawLD module, the *dat file is readable. In your
>> modules folder it would be:
>> modules/lexdict/rawld/glossaries/ertr_en/ertr_en.dat. The ertr_en.idx file
>> is not readable as it is in a proprietary binary format.
>>
>> While it certainly is possible to take the dump from mod2imp, edit it and
>> rebuild the module, we prefer not to do that. What is best is to get the
>> source again and create a module from it. And if the source was not the
>> original location, it is best to identify the original and get it from
>> there. In the case of our source, we got it from:
>> http://www.wordgumbo.com/al/tur/ertureng.htm
>> Currently this site is down, so I found it via the Internet Wayback
>> Machine:
>>
>> https://web.archive.org/web/20131124010613/http://www.wordgumbo.com/al/tur/ertureng.htm
>>
>> I noted that WordGumbo sourced the files from Ergane. That is the
>> originator of the data and it can be found here:
>> http://download.travlang.com//
>>
>> Ergane is software that runs under Windows only. It doesn’t run under
>> Windows 10 (64-bit). I haven’t tried Windows 7 (64-bit). The software
>> requires various zips to be installed to be useful. I downloaded one of the
>> zip files and it contained an MDB file, which I’m pretty sure is a Windows
>> database file, perhaps Access. From the website’s description of the
>> program:
>>
>> Ergane is a multilingual <http://users.nccs.gov/~rickyk/scicomp/> translation
>> dictionary for Windows that uses the artificial language Esperanto to
>> translate words and short expressions from one natural language to another.
>> Ergane is a product of Majstro Aplikaĵoj
>> <http://www.majstro.com/Bedrijf/contact_eng.html>.
>>
>>
>> and
>>
>> You won't need a masters in computer science
>> <https://cisonline.bu.edu/master-of-science-in-computer-information-systems/> to
>> download Ergane ,but make sure you do have Windows.
>>
>> Windows 95 or higher.
>>
>> Ideally, the output of the program for the Turkish to English needs to be
>> obtained from it, converted into UTF-8, if it isn’t and a module source
>> file created for it. Proof-reading is invaluable.
>>
>> Let us know what you are willing to do.
>>
>> In Him,
>> DM
>>
>> On Jan 5, 2016, at 12:28 PM, Caleb Maclennan <caleb at alerque.com> wrote:
>>
>> Hey DM,
>>
>> I am fluent in Turkish and can help out here. That being said I'm a
>> little confused what you're into here. Can you point me at where to see the
>> source files for this in context and where it comes out in an app?
>>
>> It looks from the bits you pasted like a file somewhere along the line
>> got read and interpreted with the wrong code-page. Of the text you pasted,
>> all of it is wrong, but it is all off with a 1-to-1 character transpose
>> that could make it right. All the "O"s are "İ" and all the "1"s are "I" in
>> the dictionary list for example.
>>
>> Caleb
>>
>> On Tue, Jan 5, 2016 at 4:56 PM, DM Smith <dmsmith at crosswire.org> wrote:
>>
>>> Does anyone know Turkish that can help figure out a problem I am having?
>>>
>>> Background: In ASCII the first 32 characters (00 to 1F) are control
>>> characters and most are not valid for XML, but are valid for UTF-8.
>>>
>>> In one of our modules, ERtr_en, I am seeing data such as:
>>> For the 26th entry, the entry looks like
>>>
>>> AUSTOS 1. August<br />
>>>
>>> However, the key AUSTOS has a non-printable between A and U of the
>>> control character with the hex value 1F:
>>> ‘A’ ‘1F’ ‘U’ ’S’ ’T’ ‘O’ ’S’
>>>
>>> What is the correct value?
>>>
>>> Note: There are hundreds of such problems in this module. And I’m seeing
>>> such non-printables in many other modules from the same source (
>>> wordgumbo.com).
>>>
>>> For those that are interested, here are the first entries in the
>>> dictionary, none of which see right to me (ran a few of the definitions
>>> through google translate):
>>> index offset size key value
>>> 0 33132 22 0NCIL 1. Bible<br />
>>> 1 33156 72 0NGILIZ 1. English<br />2. Englishman; Sassenach...
>>> 2 33260 32 0NGILIZ KAM1_1 1. bamboo<br />
>>> 3 33230 28 0NGILIZCE 1. English<br />
>>> 4 33294 44 0NGILTERE 1. England<br />2. England<br />
>>> 5 33340 28 0RAN 1. Iran; Persia<br />
>>> 6 33370 25 0RANL1 1. Iranian<br />
>>> 7 33397 26 0RLANDA 1. Ireland<br />
>>> 8 33425 43 0RLANDAL1 1. Irish<br />2. Irishman<br />
>>> 9 33470 21 0SA 1. Christ<br />
>>> 10 33493 22 0SLAM 1. Islam<br />
>>> 11 33517 24 0SPANYA 1. Spain<br />
>>> 12 33543 28 0SPANYOL 1. Spaniard<br />
>>> 13 33573 39 0SRAIL 1. Israel<br />2. Israel<br />
>>> 14 33614 28 0STANBUL 1. Istanbul<br />
>>> 15 33644 24 0SVEÇ 1. Sweden<br />
>>> 16 33670 41 0SVEÇLI 1. Swedish<br />2. Swede<br />
>>> 17 33713 31 0SVIÇRE 1. Switzerland<br />
>>> 18 33746 41 0SVIÇRELI 1. Swiss<br />2. Swiss<br />
>>> 19 33789 23 0TALYA 1. Italy<br />
>>> 20 33814 42 0TALYAN 1. Italian<br />2. Italian<br />
>>> 21 33858 44 0TALYANCA 1. Italian<br />2. Italian<br />
>>> 22 33904 26 0ZLANDA 1. Iceland<br />
>>> 23 33086 20 1L1K 1. warm<br />
>>> 24 33108 22 1RMAK 1. river<br />
>>> 25 7062 25 AUSTOS 1. August<br />
>>>
>>>
>>> Thanks in advance!
>>>
>>> In Him,
>>> DM Smith
>>>
>>>
>>> _______________________________________________
>>> sword-devel mailing list: sword-devel at crosswire.org
>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>> Instructions to unsubscribe/change your settings at above page
>>>
>>
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>>
>>
>>
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20160105/3041218f/attachment-0001.html>
More information about the sword-devel
mailing list