[sword-devel] Chinese Bible search program
Joel Mawhorter
sword-devel@crosswire.org
Tue, 12 Dec 2000 12:55:36 -0800
Hello everyone,
I have written to this list a few times in the past about supporting various
languages such as Chinese, Arabic, etc. in Bible search software. I have
decided that the best way to support some of these languages is to write
software specifically for that purpose rather than extending a project such
as Sword. Some of the requirements for these languages are very different
than for English-like languages. I am in my last year of my computer science
undergrad and I am doing a project course. I decided to do a Chinese Bible
program for this course. I am still in early development (all I really have
so far is the Chinese Bible in an acceptable format and the full text index
completed). As an aside, Chinese is very interesting to index because there
are no spaces between words in Chinese. As well, manual segmentation of
Chinese into words can produce different results with different human
segmentors (i.e. ABCD might be segmented ABC D by one person and AB CD by
another). As a result most of my work so far has been researching how best to
index Chinese. I hope to have something functional fairly soon.
Troy, do you think this is something that could be brought under the umbrella
of Crosswire.
Also, is there anyone on this list who reads Chinese who would be willing to
assist me with suggestions, testing, etc.
My goal is to make this program very simple (i.e. no texts other than the
Bible, no pictures, no formatted text, etc.). However, I want to make the
searching capability as powerful as possible. I have read a few good
discussions on this list in the past about searching so I thought I would
solicit some suggestions. My current plan is to implement AND, OR, NOT,
wildcard, proximity and phrase searching. I would love to hear any
suggestions that people might have about this. Specifically, I am unsure
whether to implement NOT as a general operator or only AND NOT. For example,
the former would allow a search such as "NOT (Love | Joy | Peace)" which
would find all verses not containing one of those three words. The latter
would only allow searches such as "Love AND NOT Peace". My intent with the
proximity operator is to allow people to search for two words which occur
within x verses for each other. Should I also allow people to search for two
words which occur within x words of each other? (This doesn't even really
make much sense for Chinese but I'm thinking ahead for other languages).
Also, how useful is XOR since most people have no idea what it is and those
who do probably know that "a XOR b" can be written as "(a AND NOT b) | (b AND
NOT a)".
Any other suggestions that people have, especially regarding searching would
be appreciated.
Thanks,
Joel