[mobile-devel] Accented Searching

Caleb Maclennan caleb at alerque.com
Fri May 28 04:42:12 MST 2010


2010/5/28 Tóth Tamás <tomika_nospam at freemail.hu>:
> It's clear that preprocessing the string to be found is not enough in this
> case. As I see a custom compare algorithm has to be implemented.

Tom,

I don't think you understand how pre-processing text with filters for
search applies to this problem. It does have it's weaknesses but the
example you give is exactly the kind of problem it solves gracefully.

Remember that both the data set and the search term are run through
the same filters. So when you search whether you type in Jónás or
jonas or JöNâŞ, the engine is going to filter that and be looking
through the text for jonas. At the same time the text it is searching
through has been filtered the same way, so ALL instances in the next
have been normalized to jonas. When results are returned, they can be
returned from the original text, not the striped / filtered version,
so the proper accents can be shown in the front-end.

In other words the engine will find all instances of a word even when
the input and output sides don't match because both the query text and
the source text have been normalized to the same middle.

The limitations involve things that change the meaning of words and do
not normalize easily. For example in my language of Turkish there is a
problem with the letter i and the undotted variant ı. A user searching
for "kin" might actually want the word "kin" or they might be using a
keyboard without the ı letter and want to find the word "kın". As you
might guess these words have entirely different meanings. Basically
what ends up happening in a strip/filter senario is BOTH words get
returned all the time and it is impossible to specifically search for
only one variant. In general this is preferred over not returning
results at all.

Regards,
Caleb



More information about the mobile-devel mailing list