[sword-devel] ICU and internationalization in sword

Troy A. Griffitts sword-devel@crosswire.org
Tue, 01 Oct 2002 11:14:27 -0700


Joel,
	Making a dependency on a library is not a trivial thing.  You have not 
given me a task that you currently cannot implement without making the 
SWORD library ALWAYS need icu.

	Your praise of the functionality of icu is rightly due, and I say 'use 
it'.  But we will not force people to use it.

	If your custom index search needs it and cannot work without it, that's 
fine.  People that want to use will have to include icu.  I'm sure we'll 
have another implementation of custom index searching that can work 
without it.  And I'll happily use the current non-index searching on my 
handheld device or any other device that might not want the extra overhead.

	We have the best of both worlds already.  There is no need to force 
everyone in the ICU world.

	This has been our policy on all kinds of things, from compression 
algorythms to encryption, and is good modular design.  And if you are 
correct about the need for more than wordbreak functionality, perhaps a 
new class design will be necessary for these methods, instead of the 
current function library for strings.

	-Troy.




Joel Mawhorter wrote:
> On October 1, 2002 00:46, Troy A. Griffitts wrote:
> 
>>My position on ICU:
>>
>>We use it in the engine.  It is never exposed in the engine and is
>>always optional.
>>
>>For example, we have a utf8_toupper function in the api that does it's
>>best to change a utf8 string to uppercase.  If the api is configured to
>>use icu, it does a much better job for non-roman script languages.
>>
>>If you don't configure the api with unicode, the engine will not present
>>options to the user for transliteration, which also uses the icu library.
>>
>>I don't have a problem using the icu library where necessary.  And if
>>there is anything that we want to support for which icu already includes
>>rich support, then I think we should use it.
>>
>>I am not willing to make a dependency on icu.  We are not required to do
>>so.  With good design, we can take complete advantage of it's benefits,
>>and still provide acceptable roman script functionality without it.  The
>>16 megs of flash memory in the zaurus and ipac will not allow it.
> 
> 
> This is where static linkage could be used. In that case we should only link 
> in the functionality that we actually use and using ICU should only increase 
> the executable size by about the same as a custom implementation of the 
> functionality we want. That said, do we actually have devices with a 16 Mb 
> limit? Certainly most handhelds have a slot for upgrading the flash. 
> Currently 64 and 128 Mb SanDisks are very inexpensive and 1 Gb SanDisks are 
> on the market. These kinds of significant memory limitations are a very short 
> term problem (espeically when we are talking about just a few megabytes).
> 
> 
>>To use icu for your search problem, you may, for example, if you NEED
>>functionality for word breaks that icu provides you, add a function to
>>our string utilities called:
>>
>>const char *utf8_getNextWordStart(const char *buf) {
>>#ifndef _ICU_
>>   return strtok(buf, " ");  // well, basically
>>#else
>>   // do some fancy icu calls
>>   // to let it determine next word break
>>   // and return result.
>>#endif
>>}
> 
> 
> I wish it was that easy. (grin) But seriously, I forsee a growing number of 
> cases like this. I would suggest that just biting the bullet and using the 
> ICU functionality will be better in the long run as it will reduce 
> development time and keep us from having two Swords, one which can acceptably 
> use non-latin based langauges and one that can't.
> 
> 
>>I think if you investigate further, you will find that icu really
>>doesn't give you much language-specific word break support.  Does it
>>work on Chinese?
> 
> 
> Actually, ICU has amazing language-specific support for word and character 
> boundary detection. I've not used it on Chinese but I expect it makes the 
> simplifying assumption for Chinese that 1 ideograph = 1 word (which is not 
> exactly correct but is sufficient for most things). However, for Thai it 
> actually supports using a dictionary based scheme for detecting word 
> boundaries. To my knowledge ICU supports word and character boundary 
> detection for all scripts encoded by Unicode (I am sure this mostly consists 
> of having a lookup table of what characters are punctuation and what are 
> alphabetic).
> 
> 
>>	Do you have any other reason you would like to force a dependency?
> 
> 
> Yes, to get rid of strstr() and stristr() in the searching functionality. Byte 
> for byte string comparison will probably only produce good searching results 
> in a few languages and stristr() only works properly for English and other 
> languages that only use the lower 7 bits of ASCII. If you are interested the 
> ICU docs have a brief overview of this here: 
> http://oss.software.ibm.com/icu/userguide/searchString.html
> 
> ICU provides good functionality for Roman based scripts as well as non-Roman 
> based scripts. Since the only downside I'm aware is the size issue and since 
> this is a short term problem that can be dealt with by staticly linking 
> against ICU, is there any reason to keep Sword from depending on ICU?
> 
> Joel
> 
> 
>>	-Troy.
>>
>>Martin Gruner wrote:
>>
>>>I agree with Joel. ICU dependancy is the price we have to pay for clean
>>>Unicode support. We also will need it for the locale handling (toupper).
>>>
>>>Martin
>>>
>>>Am Dienstag, 1. Oktober 2002 00:40 schrieb Joel Mawhorter:
>>>
>>>>Hi all,
>>>>
>>>>I'm writting to get reactions to the idea of making sword dependent on
>>>>ICU. Currently we only have optional dependencies on ICU (at least for
>>>>transliteration but I'm not sure what else). I would like to suggest
>>>>making ICU required for sword. The reason I would like to see this
>>>>happen is that I would like to use functionality in ICU in the searching
>>>>and indexing code (and probably other things in the future). Dealing
>>>>with strings in a language specific way is far from trivial for many
>>>>operations. For example, doing a search for whole words only (e.g.
>>>>searching for God doesn't return godly) isn't too hard just for English
>>>>but to do this for all languages that are or can be supported by sword
>>>>requires a lot of special logic since punctuation and even the concept
>>>>of what a word is vary so much from language to language. Either we can
>>>>use thirdy party code to do this or someone else or I can write this
>>>>specially for sword. I can't speak for others but I think that if I had
>>>>to write code like this it would likely not be as good as the ICU
>>>>implementation is. Another example of something we need is case
>>>>insensitive searching. Currently this is done with stristr() wich only
>>>>handles ASCII. ICU allows this for any language supported by Unicode. I
>>>>have already concluded that index creation will need to depend on ICU
>>>>since the hardest part of indexing is breaking up a text into words
>>>>which is different from language to langauge.
>>>>
>>>>Since ICU is well designed (IMO), open source, cross platform and
>>>>contains about everything you could think of for Unicode string
>>>>handling, the only downside I can see to requiring it is the added size
>>>>requirements for the ICU libraries. The default build of the three main
>>>>ICU 2.2 libraries on my machine total about 14 MB and they gzip to about
>>>>5.5 MB. For most platforms this is not a siginificant size increase.
>>>>Even downloading over a modem, this doesn't add too much download time.
>>>>For platforms where size is a significant issue, sword could be
>>>>statically linked against ICU so that we only linked in the parts of ICU
>>>>that we needed.
>>>>
>>>>I think that if we want to eventually have really good support for
>>>>non-Latin based languages in sword we will at some point have to start
>>>>using a library like ICU. I would rather do that now so that I don't have
>>>>to write a bunch of code for searching that I will just throw out later.
>>>>Another advantage of requiring ICU is that the front ends can start using
>>>>it as well for internationalization of the user interfaces. What do you
>>>>all think about this, especially regarding the advantages and
>>>>disadvantages? Obviously Troy has the final say on this one but I
>>>>thought a open discussion on this would be good.
>>>>
>>>>In Christ,
>>>>
>>>>Joel Mawhorter
>>>