[jsword-devel] Lucene
DM Smith
dmsmith555 at yahoo.com
Tue Dec 19 06:07:14 MST 2006
While the design is recursive, it is probably not going to recurse
except for Raw GenBooks.
In JSword the interface for a Key allows for any Key to have
children. This would be akin to a book having chapters and chapters
having verses. However in the case of a Bible the key is a flat list.
With regard to the storage requirements of a Key to the whole bible,
the amount of storage it takes is dependent upon what kind of
optimization is used for the Key. It might be a:
BitwisePassage with one bit for each verse in the Key. BitwisePassage
has a constant space requirement.
RangedPassage with very little storage overhead. Each range is stored
separately. It is slower to iterate over than any of the other
implementations.
DistinctPassage uses way too much storage, with one Key object per
verse.
PassageTally keeps a weight for each of the keys it stores. It is
used prioritize search results.
I have found that this generation of the search index is expensive.
But I have found ways to make it faster. The first thing is that
Lucene uses lots of temporary documents on disk to build the index.
Depending on what hardware I use, I can index an entire bible from <2
minutes to 5 minutes. However, on Windows I found that it took in
excess of 40 minutes. This with an AMD 2400+. I did two things that
got it down to a few minutes. First I turned off Microsoft's "fast
index". Turns out MS tried to index all of these temporary documents.
It should not have tried to index any. Second, I was using a "smart"
virus programmer that scanned every document as it is deposited on
the disk or perhaps accessed from the disk. Not sure which. Turning
both of these off gave me an index speed of about 4 minutes.
Subsequently, I replace the virus scanner and never turned "fast
indexing" back on.
However, I don't expect that we will build an index on a "small"
device. Rather, I would imagine we would pre-build it and load it.
With regard to the Job class, we should change it to a Job interface
and make the current class an implementation of it. Then we can
create a null implementation of the Job that does nothing in contexts
where there should be no reporting of progress. Or we can create an
appropriate implementation for the target device.
Please note that from the comment "// report progress" that none of
that is needed if we don't report progress.
Also, there are several opportunities for optimization here (e.g. the
number of verses in a bible does not change as progress is made).
Also, the implementation should be generalized a bit moe. This does
not allow for indexing commentaries or dictionaries. It should allow
for indexing all books.
I'll see if I can make those changes.
In Him,
DM
On Dec 19, 2006, at 12:38 AM, Zhaojun Li wrote:
> Here are the two methods, one original, one mirror.
>
> /**
> * Dig down into a Key indexing as we go.
> */
> private void newgenerateSearchIndexImpl( List errors,
> IndexWriter writer, Key key) throws BookException, IOException
> {
> int bookNum = 0;
> int oldBookNum = -1;
> int percent = 0;
> String name = ""; //$NON-NLS-1$
> String text = ""; //$NON-NLS-1$
> BookData data = null;
> Key subkey = null;
> Verse verse = null;
> Document doc = null;
> for (Iterator it = key.iterator(); it.hasNext(); )
> {
> subkey = (Key) it.next();
> if ( subkey.canHaveChildren())
> {
> newgenerateSearchIndexImpl( errors, writer, subkey);
> }
> else
> {
> data = null;
> try
> {
> data = book.getData(subkey);
> }
> catch (BookException e)
> {
> errors.add(subkey);
> continue;
> }
>
> text = data.getVerseText();
>
> // Do the actual indexing
> if (text != null && text.length() > 0)
> {
> doc = new Document();
> doc.add(new Field(FIELD_NAME, subkey.getOsisRef
> (), Field.Store.YES, Field.Index.NO));
> doc.add(new Field(FIELD_BODY, new StringReader
> (text)));
> writer.addDocument(doc);
> }
>
> // report progress
> verse = KeyUtil.getVerse(subkey);
>
> try
> {
> percent = 95 * verse.getOrdinal() /
> BibleInfo.versesInBible();
> bookNum = verse.getBook();
> if (oldBookNum != bookNum)
> {
> name = BibleInfo.getBookName (bookNum);
> oldBookNum = bookNum;
> }
> }
> catch (NoSuchVerseException ex)
> {
> log.error("Failed to get book name from verse:
> " + verse, ex); //$NON-NLS-1$
> assert false;
> name = subkey.getName();
> }
>
>
> }
> }
> }
>
> /**
> * Dig down into a Key indexing as we go.
> */
> private void generateSearchIndexImpl(Job job, List errors,
> IndexWriter writer, Key key) throws BookException, IOException
> {
> int bookNum = 0;
> int oldBookNum = -1;
> int percent = 0;
> String name = ""; //$NON-NLS-1$
> String text = ""; //$NON-NLS-1$
> BookData data = null;
> Key subkey = null;
> Verse verse = null;
> Document doc = null;
> for (Iterator it = key.iterator(); it.hasNext(); )
> {
> subkey = (Key) it.next();
> if (subkey.canHaveChildren())
> {
> generateSearchIndexImpl(job, errors, writer, subkey);
> }
> else
> {
> data = null;
> try
> {
> data = book.getData(subkey);
> }
> catch (BookException e)
> {
> errors.add(subkey);
> continue;
> }
>
> text = data.getVerseText();
>
> // Do the actual indexing
> if (text != null && text.length() > 0)
> {
> doc = new Document();
> doc.add(new Field(FIELD_NAME, subkey.getOsisRef
> (), Field.Store.YES , Field.Index.NO));
> doc.add(new Field(FIELD_BODY, new StringReader
> (text)));
> writer.addDocument(doc);
> }
>
> // report progress
> verse = KeyUtil.getVerse(subkey);
>
> try
> {
> percent = 95 * verse.getOrdinal() /
> BibleInfo.versesInBible();
> bookNum = verse.getBook();
> if (oldBookNum != bookNum)
> {
> name = BibleInfo.getBookName(bookNum);
> oldBookNum = bookNum;
> }
> }
> catch (NoSuchVerseException ex)
> {
> log.error("Failed to get book name from verse:
> " + verse, ex); //$NON-NLS-1$
> assert false;
> name = subkey.getName();
> }
>
> job.setProgress(percent, Msg.INDEXING.toString(name));
>
> // This could take a long time ...
> Thread.yield();
> if (Thread.currentThread().isInterrupted())
> {
> break;
> }
> }
> }
> }
>
>
> On 12/19/06, Zhaojun Li <lzj369 at gmail.com> wrote:
> Hi, Dear all,
>
> I am new to Lucene, so please help.
>
> I need to remove the job class from the current Lucene
> implementation. What I did is: create mirror method from
> generateSearchIndexImpl by removing any Job class reference. I
> tested it and it works.
>
> However, the speed is not good. In the design, it is a recursive
> call. How to do multithreading for this? I mean by usual thread
> class, not JSWORD Job api.
>
> Thanks!
>
> Zhaojun
>
>
>
>
> _______________________________________________
> jsword-devel mailing list
> jsword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/jsword-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.crosswire.org/pipermail/jsword-devel/attachments/20061219/5525a44e/attachment.html
More information about the jsword-devel
mailing list