[sword-devel] A possible way to speed up was Re: Search optimized (still too slow)

Troy A. Griffitts sword-devel@crosswire.org
Thu, 08 Apr 2004 14:24:34 -0700


William,
	CLucene should still be a option for index searches.  I haven't tried 
it lately, but it is the indended route if you would like to use that 
search engine.  It not only is fast, but provides additional syntax 
options for search strings.  I don't know of anyone that has compiled it 
on the Mac.  It seems I always have a few small things I have to change 
in the code to get it to compile, depending on for what system I'm 
compiling it.  My hope is that we can supply some patches to improve the 
clucene build system.

	-Troy.



William Thimbleby wrote:
> What happened to clucene, I've been trying to get it to work but no  
> luck as yet. With all the talk of speeding up searches, and I don't  
> know too much about searching, but I think the only sensible way to  
> search anything biggish is to create an index. Yes with faster  
> computers and more memory, we can just read the bibles in and run  
> through them fast. However searches can get complicated, and modules  
> bigger.
> 
> Perhaps a index could be created the first time a module is searched.  
> Much in the same way MacSword and BibleTime cache the contents of  
> lexicons, to speed them up. -- ideally we wouldn't have to do this  either.
> 
> Using indexes would not be helped by separating content and markup.  
> Other things might such as rendering speeds - I don't know.
> 
> –Will
> 
> On 8 Apr 2004, at 19:27, Daniel Glassey wrote:
> 
>> Hiya,
>> I was going to wait until I had thought this through (and had got
>> somewhere) but since it has been brought up I think I'd better mention
>> it. Quite a while back David White suggested that separating content
>> from markup would be a good idea. With the files getting big by using
>> raw OSIS(or is it pseudo-OSIS, I'm not sure) and the search being so
>> slow in these modules I think it is worth doing - to aim for 1.6.0 or
>> 2.0.0 or whatever the next major version is.
>>
>> What I'm suggesting is to make a new module type that contains a binary
>> representation of OSIS with the text in one file and the markup in a
>> second file. I think the markup should be based on something like WBXML
>> (http://www.w3.org/TR/wbxml/) but have pointers into the text rather
>> than containing the text.
>> Suggested name SBXML (Sword Binary XML)
>> This would mean that the search could be made on just the plain text.
>> Most filters would only operate on the markup.
>>
>> If we think it's a good idea then let's try to design this using the
>> wiki. I've added a page for it[1].
>>
>> I think it should be possible to subclass the existing classes for use
>> by new module drivers and filters so that the current code will  continue
>> to work.
>>
>> Until it would be ready to become core would be optionally included on  a
>> configure option.
>>
>> I don't think I've explained that very well so questions, discussion,
>> plain opinions and constructive criticism would be very welcome :)
>>
>> I'm starting at the bottom up so I'm currently looking at changing
>> VerseKey (new class VerseKey2) to support multiple versification
>> systems. I'll explain that once I get far enough to do so. But it's
>> basically going to be based on the OSIS refsys system[2] and it is  going
>> to lump all the books together rather than separating into testaments.
>> Chris, I see now you've already been doing something on the
>> versification stuff[3], how is that going?
>>
>> Regards,
>> Daniel
>>
>> [1]http://www.crosswire.org/ucgi-bin/twiki/view/Swordapi/SbXml
>> [2]http://www.ccel.org/refsys/refsys.html
>> [3]http://www.crosswire.org/ucgi-bin/twiki/view/Swordapi/ 
>> AlternateVersification
>>
>> On Thu, 2004-04-08 at 14:59, Joachim Ansorg wrote:
>>
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>>
>>> Hi,
>>> I spent some time to optimize the search in CVS.
>>> The problem is/was for example the extensive the use of XMLTag in 
>>> the  filters,
>>> I tried to avoid them in the filters where it was possible without  
>>> having to
>>> rewrite them.
>>> I also used SWBuf::append directly where SWBuf::operator+ was used  
>>> before.
>>>
>>> I see some good chances where we can optimize:
>>>     -Using XMLTag as few as possible
>>>     -Change copy constructor of SWBuf to implicit sharing, we have 
>>> lots  of SWBuf
>>> copy-constructor calls I think
>>>     -optimize SWBuf::append(char), maybe we can tweak the memory  
>>> allocation to
>>> alloc larger blocks but more seldom. the append(char) function gets  
>>> called
>>> more than any other function in a search
>>>
>>> But the best solution would be to parse the text only once and then  
>>> do the
>>> right stuff with it. ATM each filter parses the text again which 
>>> will  make
>>> modules with lot's of filters slow (e.g. KJV).
>>>
>>> I got these results (with debug code and profiling code included):
>>> WEB:
>>> before:    0m8.233s
>>> after:    0m7.586s
>>>     
>>> KJV:
>>> before:    1m35.769s
>>> after:    0m21.874s
>>>
>>>
>>> I have not yet committed, because I have to make sure the code  
>>> doesn't have
>>> some untested bugs.
>>>
>>> Joachim
>>
>>
>>>
>>
>> _______________________________________________
>> sword-devel mailing list
>> sword-devel@crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>
> 
> _______________________________________________
> sword-devel mailing list
> sword-devel@crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel