[sword-devel] Languages without a space between words

Troy A. Griffitts scribe at crosswire.org
Mon Apr 17 20:08:24 EDT 2023


Great suggestions all.  One thing to interject: SWORD raw search simply 
looks for a needles in a haystack-- it doesn't break words at all in the 
haystack.  Multi-word search-type will break the needles up by a space, 
e.g., if you search for "God love world" and specify multi-word then you 
effectively get a search for a 3 needles. "phrase" search-type takes the 
search term as one needle. Whether or not that would be more or less 
useful here, I'll let the language-informed determine.

On 4/17/23 11:24, Greg Hellings wrote:
> Yes, that looks like the type of thing. Although that is for Lucene 
> (Java). I don't know the status of CLucene's implementation of that 
> nor of Xapian's. But that would be the proper place for such 
> processing to occur. If those libraries do not have one, interested 
> parties could submit one. They could probably develop it inside of the 
> SWORD library to be sure it's doing what they want it to do (I believe 
> those filters are designed to be pluggable by the calling application) 
> before submitting it to those projects for inclusion.
>
> --Greg
>
> On Mon, Apr 17, 2023 at 1:12 PM David Haslam <dfhdfh at protonmail.com> 
> wrote:
>
>     Thanks, Greg.
>
>     I just came across this
>
>     https://lucene.apache.org/core/3_2_0/api/contrib-analyzers/org/apache/lucene/analysis/th/ThaiWordFilter.html
>
>     Is that the kind of thing you were thinking of?
>
>     David
>
>     Sent from Proton Mail for iOS
>
>
>     On Mon, Apr 17, 2023 at 17:51, Greg Hellings
>     <greg.hellings at gmail.com
>     <mailto:On+Mon,+Apr+17,+2023+at+17:51,+Greg+Hellings+%3C%3Ca+href=>>
>     wrote:
>>     I don't believe you're going to get that sort of feature directly
>>     in the engine's simple search.
>>
>>     However, if you're using a build of the library that utilizes
>>     CLucene or Xapian, then that should be the function of those
>>     libraries. They are supposed to be able to handle all of that
>>     type of functionality if the language has a corresponding
>>     contribution to that library. It might be better to check in with
>>     them.
>>
>>     --Greg
>>
>>     On Mon, Apr 17, 2023 at 11:46 AM David Haslam
>>     <dfhdfh at protonmail.com> wrote:
>>
>>         Unlike Hebrew and Arabic, etc, none of the names of the Thai
>>         Unicode characters contain the word FINAL. Likewise for
>>         Myanmar letters.
>>
>>         A possible way forward might be to run one of the several
>>         Word Segmentation programs on the text of the ThaiKJV.
>>
>>         Examples: KuCut, DeepCut, AttaCut
>>
>>         This should insert a Unicode zero width non-joiner (ZWNJ) as
>>         a word separator.
>>
>>         NB. The module would have to be updated using the segmented
>>         source text.
>>
>>         Visually, the resulting text would display the same as the
>>         original, but the module would be amenable to indexing for
>>         word searches.
>>
>>         A difficulty that might then arise is how the front-end user
>>         might enter the search query for an exact phrase search type
>>         (containing more than one word). Other search types (all
>>         words, any word) might be OK as is.
>>
>>         Aside: The KuCut method developed in 2004 was originally
>>         trained using the text of the ThaKJV.
>>
>>         Regards,
>>
>>         David
>>
>>         Sent from Proton Mail for iOS
>>
>>
>>         On Mon, Apr 17, 2023 at 17:16, Peter Von Kaehne
>>         <refdoc at gmx.net
>>         <mailto:On+Mon,+Apr+17,+2023+at+17:16,+Peter+Von+Kaehne+%3C%3Ca+href=>>
>>         wrote:
>>>         Does Thai Burmese etc etc use end forms for letters? if so,
>>>         are these encoded as such?
>>>         Peter
>>>         *Gesendet:* Montag, 17. April 2023 um 16:47 Uhr
>>>         *Von:* "David Haslam" <dfhdfh at protonmail.com>
>>>         *An:* sword-devel at crosswire.org
>>>         *Betreff:* [sword-devel] Languages without a space between words
>>>         How (if at all) does the SWORD API generate a search index
>>>         for a module that is for a language without a space between
>>>         words?
>>>         |Please consider how best to generate a useful search index
>>>         for modules that are for Bible translations in languages
>>>         that have no spaces between words. Example: CrossWire module
>>>         ThaiKJV See
>>>         https://en.wikipedia.org/wiki/Category:Writing_systems_without_word_boundaries
>>>         Has this ever been considered before.|
>>>         Best regards,
>>>         David
>>>         Sent from Proton Mail for iOS
>>>         _______________________________________________ sword-devel
>>>         mailing list: sword-devel at crosswire.org
>>>         http://crosswire.org/mailman/listinfo/sword-devel
>>>         Instructions to unsubscribe/change your settings at above page
>>>         _______________________________________________
>>>         sword-devel mailing list: sword-devel at crosswire.org
>>>         http://crosswire.org/mailman/listinfo/sword-devel
>>>         Instructions to unsubscribe/change your settings at above page
>>         _______________________________________________
>>         sword-devel mailing list: sword-devel at crosswire.org
>>         http://crosswire.org/mailman/listinfo/sword-devel
>>         Instructions to unsubscribe/change your settings at above page
>>
>     _______________________________________________
>     sword-devel mailing list: sword-devel at crosswire.org
>     http://crosswire.org/mailman/listinfo/sword-devel
>     Instructions to unsubscribe/change your settings at above page
>
>
> _______________________________________________
> sword-devel mailing list:sword-devel at crosswire.org
> http://crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://crosswire.org/pipermail/sword-devel/attachments/20230417/8b16984c/attachment.htm>


More information about the sword-devel mailing list