[jsword-devel] Lucene
Zhaojun Li
lzj369 at gmail.com
Tue Dec 19 09:58:05 MST 2006
Very good! I tried ramdirectory to generate the index. and dump to the file
after it is done in memeory, it takes only 20-30 sec. It used to be like 40
minutes!
/**
* Generate an index to use, telling the job about progress as you go.
* @throws BookException If we fail to read the index files
*/
public LuceneIndex(Book book, URL storage, boolean create,String f)
throws BookException
{
assert create;
this.book = book;
File finalPath = null;
try
{
finalPath = NetUtil.getAsFile(storage);
this.path = finalPath.getCanonicalPath();
}
catch (IOException ex)
{
throw new BookException(Msg.LUCENE_INIT, ex);
}
SimpleAnalyzer analyzer=new SimpleAnalyzer();
IndexStatus finalStatus = IndexStatus.UNDONE;
try
{
synchronized (CREATING)
{
book.setIndexStatus(IndexStatus.CREATING);
File tempPath = new File(path + '.' +
IndexStatus.CREATING.toString());
// An index is created by opening an IndexWriter with the
// create argument set to true.
RAMDirectory ramDir = new RAMDirectory();
IndexWriter writer = new IndexWriter(ramDir, analyzer,
true);
//addDocs(writer, docsInIndex);
//IndexWriter writer = new IndexWriter(
tempPath.getCanonicalPath(), new SimpleAnalyzer(), true);
List errors = new ArrayList();
newgenerateSearchIndexImpl( errors, writer,
book.getGlobalKeyList());
writer.optimize();
IndexWriter fsWriter = new IndexWriter(path, analyzer,
true);
fsWriter.addIndexes(new Directory[] { ramDir });
writer.close();
fsWriter.close();
tempPath.renameTo(finalPath);
if (finalPath.exists())
{
finalStatus = IndexStatus.DONE;
}
if (errors.size() > 0)
{
StringBuffer buf = new StringBuffer();
Iterator iter = errors.iterator();
while (iter.hasNext())
{
buf.append(iter.next());
buf.append('\n');
}
}
}
}
catch (IOException ex)
{
throw new BookException(Msg.LUCENE_INIT, ex);
}
finally
{
book.setIndexStatus(finalStatus);
}
}
On 12/19/06, Zhaojun Li <lzj369 at gmail.com> wrote:
>
> Thanks, DM!
> Can we do indexing all in memory? This way we can avoid temp files from
> being generated. Most of PC have enough memory now.
>
> On 12/19/06, DM Smith <dmsmith555 at yahoo.com> wrote:
> >
> > While the design is recursive, it is probably not going to recurse
> > except for Raw GenBooks.
> > In JSword the interface for a Key allows for any Key to have children.
> > This would be akin to a book having chapters and chapters having verses.
> > However in the case of a Bible the key is a flat list. With regard to the
> > storage requirements of a Key to the whole bible, the amount of storage it
> > takes is dependent upon what kind of optimization is used for the Key. It
> > might be a:
> > BitwisePassage with one bit for each verse in the Key. BitwisePassage
> > has a constant space requirement.
> > RangedPassage with very little storage overhead. Each range is stored
> > separately. It is slower to iterate over than any of the other
> > implementations.
> > DistinctPassage uses way too much storage, with one Key object per
> > verse.
> > PassageTally keeps a weight for each of the keys it stores. It is used
> > prioritize search results.
> >
> > I have found that this generation of the search index is expensive. But
> > I have found ways to make it faster. The first thing is that Lucene uses
> > lots of temporary documents on disk to build the index. Depending on what
> > hardware I use, I can index an entire bible from <2 minutes to 5 minutes.
> > However, on Windows I found that it took in excess of 40 minutes. This with
> > an AMD 2400+. I did two things that got it down to a few minutes. First I
> > turned off Microsoft's "fast index". Turns out MS tried to index all of
> > these temporary documents. It should not have tried to index any. Second, I
> > was using a "smart" virus programmer that scanned every document as it is
> > deposited on the disk or perhaps accessed from the disk. Not sure which.
> > Turning both of these off gave me an index speed of about 4 minutes.
> > Subsequently, I replace the virus scanner and never turned "fast indexing"
> > back on.
> >
> > However, I don't expect that we will build an index on a "small" device.
> > Rather, I would imagine we would pre-build it and load it.
> >
> > With regard to the Job class, we should change it to a Job interface and
> > make the current class an implementation of it. Then we can create a null
> > implementation of the Job that does nothing in contexts where there should
> > be no reporting of progress. Or we can create an appropriate implementation
> > for the target device.
> >
> > Please note that from the comment "// report progress" that none of that
> > is needed if we don't report progress.
> >
> > Also, there are several opportunities for optimization here ( e.g. the
> > number of verses in a bible does not change as progress is made). Also, the
> > implementation should be generalized a bit moe. This does not allow for
> > indexing commentaries or dictionaries. It should allow for indexing all
> > books.
> >
> > I'll see if I can make those changes.
> >
> > In Him,
> > DM
> >
> >
> > On Dec 19, 2006, at 12:38 AM, Zhaojun Li wrote:
> >
> > Here are the two methods, one original, one mirror.
> >
> > /**
> > * Dig down into a Key indexing as we go.
> > */
> > private void newgenerateSearchIndexImpl( List errors, IndexWriter
> > writer, Key key) throws BookException, IOException
> > {
> > int bookNum = 0;
> > int oldBookNum = -1;
> > int percent = 0;
> > String name = ""; //$NON-NLS-1$
> > String text = ""; //$NON-NLS-1$
> > BookData data = null;
> > Key subkey = null;
> > Verse verse = null;
> > Document doc = null;
> > for (Iterator it = key.iterator(); it.hasNext(); )
> > {
> > subkey = (Key) it.next();
> > if ( subkey.canHaveChildren())
> > {
> > newgenerateSearchIndexImpl( errors, writer, subkey);
> > }
> > else
> > {
> > data = null;
> > try
> > {
> > data = book.getData(subkey);
> > }
> > catch (BookException e)
> > {
> > errors.add(subkey);
> > continue;
> > }
> >
> > text = data.getVerseText();
> >
> > // Do the actual indexing
> > if (text != null && text.length() > 0)
> > {
> > doc = new Document();
> > doc.add(new Field(FIELD_NAME, subkey.getOsisRef(),
> > Field.Store.YES , Field.Index.NO));
> > doc.add(new Field(FIELD_BODY, new
> > StringReader(text)));
> > writer.addDocument(doc);
> > }
> >
> > // report progress
> > verse = KeyUtil.getVerse(subkey);
> >
> > try
> > {
> > percent = 95 * verse.getOrdinal() /
> > BibleInfo.versesInBible();
> > bookNum = verse.getBook();
> > if (oldBookNum != bookNum)
> > {
> > name = BibleInfo.getBookName (bookNum);
> > oldBookNum = bookNum;
> > }
> > }
> > catch (NoSuchVerseException ex)
> > {
> > log.error("Failed to get book name from verse: " +
> > verse, ex); //$NON-NLS-1$
> > assert false;
> > name = subkey.getName();
> > }
> >
> >
> > }
> > }
> > }
> >
> > /**
> > * Dig down into a Key indexing as we go.
> > */
> > private void generateSearchIndexImpl(Job job, List errors,
> > IndexWriter writer, Key key) throws BookException, IOException
> > {
> > int bookNum = 0;
> > int oldBookNum = -1;
> > int percent = 0;
> > String name = ""; //$NON-NLS-1$
> > String text = ""; //$NON-NLS-1$
> > BookData data = null;
> > Key subkey = null;
> > Verse verse = null;
> > Document doc = null;
> > for (Iterator it = key.iterator(); it.hasNext(); )
> > {
> > subkey = (Key) it.next();
> > if (subkey.canHaveChildren())
> > {
> > generateSearchIndexImpl(job, errors, writer, subkey);
> > }
> > else
> > {
> > data = null;
> > try
> > {
> > data = book.getData(subkey);
> > }
> > catch (BookException e)
> > {
> > errors.add(subkey);
> > continue;
> > }
> >
> > text = data.getVerseText();
> >
> > // Do the actual indexing
> > if (text != null && text.length() > 0)
> > {
> > doc = new Document();
> > doc.add(new Field(FIELD_NAME, subkey.getOsisRef(),
> > Field.Store.YES , Field.Index.NO));
> > doc.add(new Field(FIELD_BODY, new
> > StringReader(text)));
> > writer.addDocument(doc);
> > }
> >
> > // report progress
> > verse = KeyUtil.getVerse(subkey);
> >
> > try
> > {
> > percent = 95 * verse.getOrdinal() /
> > BibleInfo.versesInBible();
> > bookNum = verse.getBook();
> > if (oldBookNum != bookNum)
> > {
> > name = BibleInfo.getBookName (bookNum);
> > oldBookNum = bookNum;
> > }
> > }
> > catch (NoSuchVerseException ex)
> > {
> > log.error("Failed to get book name from verse: " +
> > verse, ex); //$NON-NLS-1$
> > assert false;
> > name = subkey.getName();
> > }
> >
> > job.setProgress(percent, Msg.INDEXING.toString(name));
> >
> > // This could take a long time ...
> > Thread.yield();
> > if (Thread.currentThread().isInterrupted())
> > {
> > break;
> > }
> > }
> > }
> > }
> >
> >
> > On 12/19/06, Zhaojun Li <lzj369 at gmail.com> wrote:
> > >
> > > Hi, Dear all,
> > >
> > > I am new to Lucene, so please help.
> > >
> > > I need to remove the job class from the current Lucene implementation.
> > > What I did is: create mirror method from generateSearchIndexImpl by removing
> > > any Job class reference. I tested it and it works.
> > >
> > > However, the speed is not good. In the design, it is a recursive
> > > call. How to do multithreading for this? I mean by usual thread class, not
> > > JSWORD Job api.
> > >
> > > Thanks!
> > >
> > > Zhaojun
> > >
> > >
> > >
> > >
> > _______________________________________________
> > jsword-devel mailing list
> > jsword-devel at crosswire.org
> > http://www.crosswire.org/mailman/listinfo/jsword-devel
> >
> >
> >
> > _______________________________________________
> > jsword-devel mailing list
> > jsword-devel at crosswire.org
> > http://www.crosswire.org/mailman/listinfo/jsword-devel
> >
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.crosswire.org/pipermail/jsword-devel/attachments/20061219/bebcb4cf/attachment-0001.html
More information about the jsword-devel
mailing list