[sword-devel] Comming soon: new improved sword searching
Joel Mawhorter
sword-devel@crosswire.org
Sun, 15 Sep 2002 00:23:12 -0700
On September 14, 2002 21:39, David White wrote:
> Joel,
>
> Thank you for your response; comments are below.
>
> > Did you use an index for regular expression searching or did you just
> > match the regex against blocks of text? This is an area that I'm still
> > not sure how to best handle.
>
> As I said below, I used an index for regular expressions where you can
> work out that a certain word, or certain set of words, must be present
> for the regular expression to match. Otherwise you have to do a full
> text search.
How did you determine whether or not to use the index. I can imagine regexps
that would be slower using an index than just brute force searching a text.
> > > I didn't have the "search within 3 verses" thing, and I'm not sure how
> > > useful that is; all I can say is try it and see.
> >
> > This is potentially very useful (and I can say that fairly since I didn't
> > come up with the idea :-). Often in the Bible, sentences span multiple
> > verses. Therefore, if you remember a few words from a sentence in the
> > Bible you may not find it if you restrict your search to individual
> > verses. Also people may want to search for some passage that they vaguely
> > remember and may need to extend the search to groups of verses.
>
> Yes, I can see how this could be useful.
>
> > > I would tend to recommend to use AND, OR, and NOT instead of &, |, and
> > > !; since AND, OR, and NOT are likely to be recognizable to people who
> > > have learned how to use any pre-Google search engine effectively, while
> > > &, |, and ! are likely to be recognizable only to people who know C.
> >
> > This is a good point but let me explain why I think the using AND, OR and
> > NOT isn't the best way:
>
> ok, I still think that for an English-language frontend, using AND, OR,
> and NOT is best; but I do agree that it is perhaps best to handle this
> in the front end.
I don't think most of the frontends are trying to be English only frontends.
But since I'm such a nice guy (grin), here is what I'm willing to do. If any
of the frontend writers want to use a different grammar than what I create,
I'll write the parser for that grammar for them.
> The symbols !, and perhaps & (I don't think |) can
> perceivably found in texts, so using them as literals should probably be
> handled somehow as well.
I don't intend to support searching for punctuation as I don't think people
would really need that. Do you think this should be supported?
> > This is good to know. However, I don't think for searching it is
> > necessary to store markup seperate from the text. Instead I would like to
> > ensure that the index has enough information that searching will not have
> > to be done on the text itself, thus removing the need for striping the
> > text. If the only disk I/O involved is reading from specific spots in the
> > index, searching should be several times faster even than reading an
> > entire markup-free text from disk and searching that.
>
> Well, remember that you are almost always going to have to load and
> strip some verses, to make sure the phrase matches. For instance, if
> someone searched for the phrase "Jesus said", then you are going to have
> to search all verses which contain both "Jesus", and "said", and see if
> the words appear sequentially. If someone searched for "Jesus & said",
> naturally you aren't going to have to load anything in at all, but I
> think this would be substantially less common.
The way you suggest for phrase searching is the way I implemented it in the
Bible search program I created. However, another way is to store the location
of each occurrence of each word. That way if someone searches for "Jesus
said" you just need to find all of the verses that have Jesus as the nth word
and said as the n+1st word. Based on suggestions for the list, this second
way is the way I am currently implementing the searching for Sword.
> Using an index is going to speed things up, but you don't *need* an
> index to get fast search speeds. Just try using grep :-)
Then try implementing Google using grep and you'll find out why it isn't done
that way! :-> But seriously, grep has a very simple task to do and so can do
it fairly quickly for small amounts of text.
> > > That said, my Bible program did use indexing, in combination with the
> > > approach I just mentioned. It would store the location of all words
> > > that occurred less than a certain amount of times (1000 I believe).
> > > Other words would have acknowledgement of their existence stored, but
> > > not the actual verses - it's not worth storing the 40,000 verses that
> > > the word "the" appears.
> >
> > I'm don't think I agree on that point. The cost in disk space to store
> > very common words isn't really that high and people may want to search
> > for them.
>
> I seriously doubt anyone is going to want to search just for a common
> word, like "and". If they do, then I don't think it's a big deal that
> it'll take a couple of seconds to show them their results.
>
> People are going to sometimes search for common words mixed with
> uncommon words; for instance the phrase "the beginning". Now, if you
> just store the word "beginning", and not "the", then you will get 104
> verses (in the KJV), each of which you will have to load and search to
> see if they match the exact phrase. If you also store "the" in your
> index, then you will have 24,094 verses which match. You will have to
> find the common subset of verses which contain "beginning" and "the",
> and end up loading, stripping, and searching the 101 verses that contain
> both. The cost of loading 24,094 verse references, and doing a search
> through these references to find the common subset with 104 verse
> references is going to far outstrip the cost of reading in 3 extra
> verses and searching them. In this case, having the common words indexed
> will end up being a pessimization instead of an optimization.
Based on how I intend to do phrase searching, it is more costly to remove
common words than it is for the above technique.
> As for space, in my program, I stored the index as a list of words with
> references to verses those words appeared in - a byte each for book,
> chapter, and verse. So 3 bytes per reference, in addition to the cost of
> the words themselves and a little meta-data. I don't think you'd be able
> to get it much more compressed than that.
This is exactly how I did it for the Bible program I was working on. It must
be that great minds think alike ;-) For Sword, things will be a little
different. I'll post a page soon with the description of what I am planning
to use as an index file format.
> Anyhow, if I stored the
> locations of all words, the index file ballooned out to over 2 megabytes
> (for the KJV, or any typical Bible). When I stored only words that
> appeared less than 2000 times, it was just over 1 megabyte. That's going
> to be a substantial difference on space-limited devices.
Do you really think 1 megabyte makes that much difference? How space-limited
is the most limited device that Sword currently exists on? Very space-limited
devices may want to have no index at all. Certianly on most PCs, +/- 1 Mb
isn't a big deal. Perhaps some of the handheld frontend maintainers can give
some specs.
> Also, remember that the file being smaller will give faster access, even
> if you are only reading the same amount of data. Less levels of
> indirection have to be used on an inode-based filesystem, and on FAT the
> allocation table is smaller and faster to traverse; also there is a
> greater chance of cache hits with subsequent access to different parts
> of the file.
OK, now were talking about a difference of milliseconds. I'm not going to get
too worried about that.
> So anyhow, that's my case for not storing too-common words. I think you
> should at least consider making it optional, so that the module creator
> can decide if they want to index all words, or set a threshold, above
> which existence-only is recorded.
I think making which terms get indexed optional is a good idea. I'm not sure
that frequency is the best way to determine this since common words may still
be useful search terms (e.g. God appears in nearly 4000 verses in the KJV).
Perhaps there could be two options at index creation time. Either specify a
maximum frequency to index or provide a list of words not to index.
> > This is a good idea. Perhaps once I have the framework in place for
> > indexing and searching you could extend it with something like what you
> > describe. Unfortunately any time the index isn't sufficient to completely
> > do the search you will have to pay for stripping the text. Perhaps you
> > could talk to Troy and Chris about ways to either store the markup
> > seperately (as you suggested) or optimize the process of stripping the
> > markup. I think the lack of regex support is perhaps the most significant
> > shortcoming of what I have proposed so far. Of course, the current regex
> > support will still be there but if it can be speed up that would be
> > great.
>
> Well I don't see how the separation of markup from the text has any
> effect on this suggestion, since you will still be having to load in
> potentially matched verses to confirm that they match anyhow. You would
> use the same sort of mechanism.
Not with the phrase matching technique I plan to use. Of all of the types of
searches I have imagined for the future, only regexp searches will need to go
to the module text.
> However, I am still convinced that separating markup and text is the
> best solution. I am unsure how easy or hard it would be to create a
> module type which does this though, so it may or may not be feasible.
> Also, how are you planning to handle searching of markup itself? Using
> separation, it would be fairly easy: just iterate over the markup too,
> and search any markup that is flagged as being "searchable". Did you
> have plans to handle searching of markup in your system?
I have no plans to search markup. Do you think this would be useful?
> > > I'd be interested in helping out with this, if you want any help; I'm
> > > trying to get involved with Sword development myself. Let me know if
> > > you want any assistance.
> >
> > I would love to have help. Once I get a bit further in the
> > implementation, I'll have a better idea of what help I need with.
> > Probably I could use help in optimizing the searching code I create. I'll
> > keep you posted on what I'm doing.
>
> ok, great; I look forward to hearing about your progress on this.
>
> -David.
Keep the suggestion coming. I would like Sword searching to be as good as
possible so, the more heads thinking about it, the better.
In Christ,
Joel