[sword-devel] indexed search discrepancy
Troy A. Griffitts
scribe at crosswire.org
Fri Aug 28 10:38:51 MST 2009
Thanks for investigating this Matthew. There shouldn't really be any
repercussions to increasing this within reason, though I would like to
find a way to remove this code if we can.
Does anyone know if clucene REALLY wants a wchar_t buffer, and if so,
what EXACTLY does it want?
wchar_t on windows is 16 bits, and on linux is typically 32 bits.
This would mean that likely it expects UTF-16??? Or maybe just limits
to 16 bit characters and doesn't support the full Unicode range (at
least on windows)?
We have methods to convert to both UTF-16 and UTF-32 in our engine,
which don't need a fixed length buffer, so I would like to replace:
lucene_utf8towcs(wcharBuffer, content, MAX_CONV_SIZE);
with a call to our code, if we can nail down exactly what clucene wants
in the resultant wcharBuffer
Anyway, for now, upping the buffer should be fine, or dynamically
allocating to say 2*source length should also be practically safe, but
some of our module drivers support a 4 byte size, so retaining a static
buffer with a fixed size would mean we'd need to make it fairly large to
support the full range of data.
-Troy.
PS. I just typed my last command and looked at my history...
scribe at scribe-laptop:~/src/sword/src/modules$ svn blame swmodule.cpp > blame
scribe at scribe-laptop:~/src/sword/src/modules$ vi blame
scribe at scribe-laptop:~/src/sword/src/modules$ rm blame
...
and felt an all encompassing Love and acceptance, being reminded of what
our God has done for us when I type:
rm blame
and solidly pressed return. :)
Matthew Talbert wrote:
> The problem is more universal and serious than I originally thought.
> SWORD indexed search is performing rather poorly against BT's. At any
> rate, the biggest issue is the size given to MAX_CONV_SIZE in
> swmodule.cpp. Here are some tests:
>
> //default value of 2047
> ./search Finney "good" | wc
> [0=================================50===============================100]
> ======================================================================
> 22 255 1549
>
>
> //MAX_CONV_SIZE = 10000
> ./search Finney "good" | wc
> [0=================================50===============================100]
> ======================================================================
> 51 576 3573
>
> //MAX_CONV_SIZE = 15000
> ./search Finney "good" | wc
> [0=================================50===============================100]
> ======================================================================
> 56 650 3985
>
> But even 15000 isn't high enough to get all occurrences of words at
> the end of long text sections. For Finney, a value of 20000 is
> probably required and it's entirely possible that other modules would
> require higher values.
>
> I don't know what the consequences are of changing this value, but
> currently we're missing a huge number of hits in genbook modules and a
> substantial number of hits in commentaries as well.
>
> If this gets fixed, I think searching and results should be added to
> the test suite. It would be simple to add; just run mkfastmod, then
> the search program (it would be nice to be able to change the search
> type without re-compiling so that different search types could be
> done).
>
> Matthew
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
More information about the sword-devel
mailing list