[sword-devel] Chinese Bible search program

Joel Mawhorter sword-devel@crosswire.org
Wed, 13 Dec 2000 10:26:31 -0800


On Tuesday 12 December 2000 18:09, you wrote:
> Joel Mawhorter wrote:
> > Hello everyone,
> >
> > I have written to this list a few times in the past about supporting
> > various languages such as Chinese, Arabic, etc. in Bible search software.
> > I have
>
> Whooo!  Talk about taking up a challenge. :-)
>
> > decided that the best way to support some of these languages is to write
> > software specifically for that purpose rather than extending a project
> > such as Sword. Some of the requirements for these languages are very
> > different than for English-like languages. I am in my last year of my
> > computer science
>
> Please understand that I've never done I18N, but I have read about it in
> conjunction with my GUI work in the past; so take this with a large volume
> of salt... :-)
>
> You have 2 basic problems from a display standpoint:  holding the data, and
> "printing" the data in the proper direction.  (I am assuming the correct
> font exists. :-)  In the X-Window world, and Motif specifically, the data
> is held in arrays of wchar_t (wide char type), so you can put Unicode in
> it.  The Motif XmText widget (via the XmString type) also has direction
> (i.e. left to right or right to left), therefore, you can display Arabic,
> Hebrew, and so forth in it.  I don't think top to bottom is supported. :-( 
> My point is, you're going to need to find support for your I18N work in a
> GUI widget set of some sort, and if you can find the proper widget to do
> that, your life will be very easy from there on.  There are functions (at
> least in Motif) to help you manipulate wchar_t data, so look for something
> like that too in whatever you pick.

I have decided on Java because it has very good support for Unicode and 
Unicode fonts. It even does bidirectional text properly by default. If you 
give a Java widget a Hebrew string, it prints it right to left but if you 
give it an English string, it prints it from left to right. As well, Java's 
default char and String types use Unicode.

> ....
>
> > Also, is there anyone on this list who reads Chinese who would be willing
> > to assist me with suggestions, testing, etc.
>
> Not me, but if you get stuck and can't find anyone else, let me know and I
> have a friend who might.
>
> > My goal is to make this program very simple (i.e. no texts other than the
> > Bible, no pictures, no formatted text, etc.). However, I want to make the
> > searching capability as powerful as possible. I have read a few good
> > discussions on this list in the past about searching so I thought I would
> > solicit some suggestions. My current plan is to implement AND, OR, NOT,
> > wildcard, proximity and phrase searching. I would love to hear any
> > suggestions that people might have about this. Specifically, I am unsure
> > whether to implement NOT as a general operator or only AND NOT. For
> > example, the former would allow a search such as "NOT (Love | Joy |
> > Peace)" which would find all verses not containing one of those three
> > words. The latter would only allow searches such as "Love AND NOT Peace".
> > My intent with the
> >
> >From Boolean Algebra, you don't need just NOT, the same functionality can
>
> always be implemented with AND and OR.  So you could avoid that work, and
> put something in Help that tells them this rule [in case you don't know,
> reverse all operators, so your example becomes "Love & Joy & Peace"].  Your
> AND NOT operator can't get any simpler, so if you want that functionality,
> you'll have to put it in.  In a perfect world, AND NOT would be available.
> :-)

Actually, NOT (x | y | z) becomes (NOT x) & (NOT y) & (NOT z) by de Morgan's 
Law. I don't think there is any way to express NOT (x | y | z) without either 
NOT, NAND, or NOR.

> > proximity operator is to allow people to search for two words which occur
> > within x verses for each other. Should I also allow people to search for
> > two words which occur within x words of each other? (This doesn't even
> > really make much sense for Chinese but I'm thinking ahead for other
> > languages).
>
> There's been a few times I could have used proximity. :-)  But it's
> probably not worth it if it's too hard to implement.

Do you mean verse proximity or word proximity? The first is fairly ease to 
implement.

> > Also, how useful is XOR since most people have no idea what it is and
> > those who do probably know that "a XOR b" can be written as "(a AND NOT
> > b) | (b AND NOT a)".
>
> That's a correct transformation, and no, *I* wouldn't bother implementing
> XOR.
>
> Any other suggestions I would have are probably already on your list. 
> FWIW, QuickVerse implements:  AND, OR, NOT, XOR, * (0 or more chars), ? (1
> and only 1 char), and () for grouping.  It also does "case in/sensitivity"
> and "match all word endings" (which might be nice, but is easily done with
> "*").

That's good to know. I agree with you about "match all word endings". Why 
have two simple ways of doing the same thing. Also, case in/sensitivity and 
word endings don't make any difference in Chinese but would for some future 
languages I would like to implement.

Thanks for the comments.

Joel

> HTH,
> Kevin