[sword-devel] Languages without a space between words
Troy A. Griffitts
scribe at crosswire.org
Mon Apr 17 20:08:24 EDT 2023
Great suggestions all. One thing to interject: SWORD raw search simply
looks for a needles in a haystack-- it doesn't break words at all in the
haystack. Multi-word search-type will break the needles up by a space,
e.g., if you search for "God love world" and specify multi-word then you
effectively get a search for a 3 needles. "phrase" search-type takes the
search term as one needle. Whether or not that would be more or less
useful here, I'll let the language-informed determine.
On 4/17/23 11:24, Greg Hellings wrote:
> Yes, that looks like the type of thing. Although that is for Lucene
> (Java). I don't know the status of CLucene's implementation of that
> nor of Xapian's. But that would be the proper place for such
> processing to occur. If those libraries do not have one, interested
> parties could submit one. They could probably develop it inside of the
> SWORD library to be sure it's doing what they want it to do (I believe
> those filters are designed to be pluggable by the calling application)
> before submitting it to those projects for inclusion.
>
> --Greg
>
> On Mon, Apr 17, 2023 at 1:12 PM David Haslam <dfhdfh at protonmail.com>
> wrote:
>
> Thanks, Greg.
>
> I just came across this
>
> https://lucene.apache.org/core/3_2_0/api/contrib-analyzers/org/apache/lucene/analysis/th/ThaiWordFilter.html
>
> Is that the kind of thing you were thinking of?
>
> David
>
> Sent from Proton Mail for iOS
>
>
> On Mon, Apr 17, 2023 at 17:51, Greg Hellings
> <greg.hellings at gmail.com
> <mailto:On+Mon,+Apr+17,+2023+at+17:51,+Greg+Hellings+%3C%3Ca+href=>>
> wrote:
>> I don't believe you're going to get that sort of feature directly
>> in the engine's simple search.
>>
>> However, if you're using a build of the library that utilizes
>> CLucene or Xapian, then that should be the function of those
>> libraries. They are supposed to be able to handle all of that
>> type of functionality if the language has a corresponding
>> contribution to that library. It might be better to check in with
>> them.
>>
>> --Greg
>>
>> On Mon, Apr 17, 2023 at 11:46 AM David Haslam
>> <dfhdfh at protonmail.com> wrote:
>>
>> Unlike Hebrew and Arabic, etc, none of the names of the Thai
>> Unicode characters contain the word FINAL. Likewise for
>> Myanmar letters.
>>
>> A possible way forward might be to run one of the several
>> Word Segmentation programs on the text of the ThaiKJV.
>>
>> Examples: KuCut, DeepCut, AttaCut
>>
>> This should insert a Unicode zero width non-joiner (ZWNJ) as
>> a word separator.
>>
>> NB. The module would have to be updated using the segmented
>> source text.
>>
>> Visually, the resulting text would display the same as the
>> original, but the module would be amenable to indexing for
>> word searches.
>>
>> A difficulty that might then arise is how the front-end user
>> might enter the search query for an exact phrase search type
>> (containing more than one word). Other search types (all
>> words, any word) might be OK as is.
>>
>> Aside: The KuCut method developed in 2004 was originally
>> trained using the text of the ThaKJV.
>>
>> Regards,
>>
>> David
>>
>> Sent from Proton Mail for iOS
>>
>>
>> On Mon, Apr 17, 2023 at 17:16, Peter Von Kaehne
>> <refdoc at gmx.net
>> <mailto:On+Mon,+Apr+17,+2023+at+17:16,+Peter+Von+Kaehne+%3C%3Ca+href=>>
>> wrote:
>>> Does Thai Burmese etc etc use end forms for letters? if so,
>>> are these encoded as such?
>>> Peter
>>> *Gesendet:* Montag, 17. April 2023 um 16:47 Uhr
>>> *Von:* "David Haslam" <dfhdfh at protonmail.com>
>>> *An:* sword-devel at crosswire.org
>>> *Betreff:* [sword-devel] Languages without a space between words
>>> How (if at all) does the SWORD API generate a search index
>>> for a module that is for a language without a space between
>>> words?
>>> |Please consider how best to generate a useful search index
>>> for modules that are for Bible translations in languages
>>> that have no spaces between words. Example: CrossWire module
>>> ThaiKJV See
>>> https://en.wikipedia.org/wiki/Category:Writing_systems_without_word_boundaries
>>> Has this ever been considered before.|
>>> Best regards,
>>> David
>>> Sent from Proton Mail for iOS
>>> _______________________________________________ sword-devel
>>> mailing list: sword-devel at crosswire.org
>>> http://crosswire.org/mailman/listinfo/sword-devel
>>> Instructions to unsubscribe/change your settings at above page
>>> _______________________________________________
>>> sword-devel mailing list: sword-devel at crosswire.org
>>> http://crosswire.org/mailman/listinfo/sword-devel
>>> Instructions to unsubscribe/change your settings at above page
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
>
>
> _______________________________________________
> sword-devel mailing list:sword-devel at crosswire.org
> http://crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://crosswire.org/pipermail/sword-devel/attachments/20230417/8b16984c/attachment.htm>
More information about the sword-devel
mailing list