[sword-devel] [Fwd: [CLucene-dev] clucene 0.9.0]

Daniel Glassey danglassey at ntlworld.com
Thu Mar 31 15:51:16 MST 2005


Hi,
If anyone would like to try out this new development version of clucene 
and let us know how well it works with sword (especially on non-English 
languages) that would be great.

Regards,
Daniel

-------- Original Message --------
Subject: [CLucene-dev] clucene 0.9.0
Date: Mon, 28 Mar 2005 01:57:44 +0200
From: Ben van Klinken


Hi All,

If anyone is interested in trying out my branch, please go to:

http://clucene.sourceforge.net/support/misc/

and download one of the clucene-0.9.0 files

I've changed a lot of things in it, including many performance
changes. You should notice a considerable performance increase. In
some places 20% and even up to 100% increases in speed... and more to
come.

I've done a major restructure of the configure system, so it should be
more x-platform. I've managed to get it compile on all the compile
farms except the ppc's - which are complaining about the _T's which
shouldn't be too hard to fix - although i'm not sure about endianess
compatibility...

I've pasted below my change log for this branch. There are a lot of
changes now, including some fairly major header changes. There are
still a few more changes to go before I suggest people start using it,
but if you want to experiment please go for it.

Note: You'll have to bootstrap to use the linux version.

Also i've got the beginnings of a documentation script running, and
also an automated build script. Go to clucene.sourceforge.net to see
an 'underconstruction' page.



----
Have changed the clucene interface significantly. All functions that
used references have been changed to pointers. Where a function
returns a reference to an internal variable, a reference is used. When
a function "consumes" a variable - i.e. takes responsibility for
deleting the object, the function will be (should be) a reference.
This has not been applied rigoursly yet, but will be done over a
period of time.

Pyclene:
I have managed to compile pyclene. This is what I needed to do:
I had problems compiling pyclene because ProcessorNameString wasn't
defined in my registry. Putting in a semi-valid value seemed to fix it
(in my case "amd athlon").
I replaced all the relevant function names using the list in
CLBackwards as a guide.
By default now, clucene is built with unicode if supported. So an
_ASCII preprocessor had to be added
I had to do some post-processing on the _clucene_wrap.cpp. I was
getting to static definitions for many of the functions. For example:
static static lucene_util_FileReader___eq__... Removed one of the
statics for each function. This was also occuring
float_t had to be defined in pyclene.i, not sure why -
HitCollector::collect would not generate its directory interface
Had to explicity setup the typemap for TCHAR*



change log:
Large commit:
I realise that some of these changes will break people's code, but I
think in the interest of creating a clucene which is more 'accessible'
to new developers these changes should be made. Please let me know
what you think.

* In order to make clucene a more standard release where abouts
developers can more easily begin programming clucene, some of the
exotic functions have been converted to the TCHAR equivalents, which
should be more familiar to some and will be at least a more
standardised version of the former functions. These change also
changes char_t to TCHAR. Include the file CLucene/CLBackwards.h after
StdHeader.h to (hopefully) maintain backwards compatibility. See notes
in CLBackwards.h for more information.

* Linux Unicode version. Removed UTF8 code in favour of real unicode.
Ensure _UNICODE is defined in config.h to enable unicode. _UNICODE
uses 2 bytes to store its characters. Note that this is only clucene's
internal representation, characters are still stored in the index as
UTF8. This brings the linux version of clucene up to the same
character capabilities as java-lucene.

* Implmented debug reference counting. This allows developers to see
which clucene classes have not been properly deleted. Define
LUCENE_ENABLE_LUCENEBASE to enable this functionality. Note that this
only counts clucene objects - it does not include undeleted
non-clucene memory (and strings returned from clucene), nor does it
guarantee that memory leaks within clucene objects haven't occurred.
None the less, it is still usefull for general clucene usage

* Closely related to the lucenebase functionality, there is a
pseudo-reference counting mechanism. This mechanism is not used
internally, but can be used by developers to ensure that their objects
are not deleted by clucene's internal handlings. A Document returned
from a Hits object, for example, will only last as longs as it is
valid in the hits cache. Calling __cl_addref() on the Document object
will ensure it is not deleted by clucene's internals. The Hits cache
will call _DELETE on the Document - which will not actually delete the
object - and only when the 'owner' calls _DECDELETE (or __cl_decref(),
then _DELETE) will the object truly be deleted. Define
LUCENE_ENABLE_REFCOUNT to enable this functionality.

* This change has made it important to call _CLNEW when creating
clucene objects. _DELETE should always be called for pointers, or
_LDELETE for l-values ( returned from a function, for example).
_DECDELETE calls __cl_decref() and if the refcount is 0, deletes the
object (note: make sure you don't use an function value here, because
the function will be called twice, which may have undesirable results)

* A fairly major rework of StdHeader.h has been done. This should help
in cross-platform compilation. I have used some ideas from the stl
port code to identify platforms. The configuration should be a lot
more accurate maintainable and cleaner (maybe *grin*).

* To Use an alternative CLConfig.h, define OVERRIDE_DEFAULT_CLCONFIG.
A file called AltCLConfig.h will be included. Make sure you define all
the required definitions in this file. If anyone can think of a better
way of doing this, please let me know.

* I have made CLucene MSVC 6 compatible. Quite a few changes were made
to make the code MSVC 6 compatible. Most of these are transparent,
except the VoidList and VoidMap classes. Now the value and key
respectively for these classes are now assigned as a pointer and thus
VoidList<Directory> will be a list of Directory pointers (previously
VoidList<Directory*>).

* The Reader Class has been reworked. The Reader does not use the
FSInputStream anymore, instead the character encoding can be specified
as ASCII,UTF8,8859_1,UNICODEBIG or UNICODELITTLE. This should fix some
of the 'read past EOF' errors that have been occuring because the
files are assumed to be utf8. PLATFORM_DEFAULT_READER_ENCODING can be
used as a default character encoding. LUCENE_OOR_CHAR is used when
converting between a larger character type and a smaller type.

* Changed the character conversion function names. STRDUP_XtoX and
STRCPY_XtoX where X is A(ascii) W(unicode) or T(the current character
type)

* Changed CND_DEBUG functionality so that users can implement their
own debug function. See _CND_DEBUG_DONTIMPLEMENT_OUTDEBUG in the
CLConfig.h

* changed float_t to double_t - TODO: is this better? i think it has
better accuracy???

* Changed the thread implementation so that users can implement their
own thread handling functionality in their own code. See
_LUCENE_DONTIMPLEMENT_THREADMUTEX in CLConfig.h

* Examples/Util has an example of using incremental indexing. You can
also use groups to increment only certain parts of the index. Use the
syncronize command to remove documents that have changed or that have
been deleted from the index. Merge has been fixed to use the proper
addIndexes function.

* To reduce incompatibilities, all using namespace lucene::* have been
removed from headers.

=== Other things ===
* I've begun some of the process of gnu'ifying clucene. By making it a
more standard package. Please make suggestions on this. John Wheeler
(I think it's you working on some of the documentation) can you give
us an idea I can update my build script to include documentation in
the release. Is this a good idea? Or should we just provide help on
how to create the documentation - what tools, etc.

* Please look through the files like AUTHORS, etc and check for
mistakes and things that have been left out. I'd like to see the basic
documentation, at least, correct and 'helpful'

* Added a 'monolithic' msvc project. This is based on David Rushby's
idea - all the .cpp are compiled into one object, thus speeding up the
compilation exponentially. It is good for quickly building the
project, but not as good for developing and debugging.

* Changed all internal file representations from TCHAR to char. Having
TCHAR representations of files is not necessary and only makes porting
to unicode for *nix more difficult. Some changes will need to be made
to client code for this to work.

* implemented Java style String interning function. This is used in
Term (and for caching in the future). This will save some memory and
might increase performance a bit. Functions that compare field names
now compare with == instead of _tcscmp.

* Option of pre-allocating memory for Terms. This can increase
performance *alot*, but will increase memory use. See the Term.h file
for more information. This feature can be disabled to save memory (See
CLConfig.h)

* Fixed StringBuffer.append(double). This has the side affect of
changing the query.toString value, which now should more similar to
the results that the java version returns.

* Made some significant changes to WildcardTermEnum. This should speed
up this query a lot.

-- 
Ben van Klinken



More information about the sword-devel mailing list