Search Applet Instructions
This applet allows searching for individual keywords in the text of
the papers contained in the collection. There are several parameters
for the applet:
- You can choose either an exact search or an approximate search.
If you choose an approximate search, then it will search for all
words that match the search string with an edit distance of at most
one or two. Approximate search allows you to find references that
are incorrectly recognized by the OCR process.
- You can search for pages that contain all of the words in your
search string or any of the words.
- Search strings can be any words of four or more letters, but some
very common words are ignored. Thus when searching for the words "key
refresh", the word key is ignored because it is too short, and the applet
will instead return all occurrences of the word "refresh".
- There is a limit of 500 results that can be displayed for a search.
If you receive more than 500 results, try varying the search terms to
narrow the search.
Once the search window is displayed, it will take some time to load
the hash table used for performing searches. Once your search has
completed, the results will be listed in the area below. If you
highlight one of the results, then it will direct your browser to
load the indicated paper. If a word appears several times in a paper,
then each page on which the word appears will be listed in the results.
Clicking on each of these will cause the PDF file to be loaded, but you
will need to manually scroll forward to the indicated page. The URL
that is generated shows the actual page number.
Potential problems
Unfortunately, several things can go wrong with this applet.
- It has only been tested with Microsoft Internet Explorer
version 4.01 and Netscape 4.06.
The applet may run on other platforms, but security restrictions
may prevent it from working (the applet needs to read files from the CD,
and uses features of Java 1.1). Versions of Netscape prior to 4.06 may
require you to download the Java 1.1 update. No guarantee can be made
that this will work with later versions because the security
restriction is outside the scope of the Java language specification.
- Java is still a rather unstable software platform, and minor variations
between platforms may cause unusual behavior.
- If you have changed your default security settings for Internet
Explorer, then the applet loaded from disk may no longer have access
to the disk it was loaded from.
- You may not have enough memory. The inverted keyword index and
other data structures used by this applet are approximately five
megabytes in size.
- You may not be patient enough. Check the Java console to see if
any error messages have been reported, and be patient while it loads
the large hash table at startup.
How reliable is the search capability?
The search capability on this CD is based on text produced by an
optical character recognition process. As such, the text contains
many errors in it, and searches performed on this text can miss some
occurences of a given text string, and can also produce false positive
search results.
In order to deal with the errors that are inherent in the text, the
applet provides the capability of searching for keywords that have a
small Levenshtein distance from the keywords that you enter (also
called edit distance). In this way, even if the scanning process
scanned a word like "linear" and mistakenly read this as "lineer" or
"liner", then the approximate search capability will still consider
this a match.
In order to minimize the size of the inverted keyword index, we consider
certain common words to be useless search terms. Unfortunately, words
such as "Diffie" appear so frequently in the text that the applet may label
them as too common.
What is the search algorithm used?
The algorithm being used in this version is rather simplistic. When
designing a text retrieval algorithm, it usually involves a balance of
preprocessing time, memory, storage, running time of queries, and
accuracy of the results. In addition, algorithms can be made to run
faster if it takes into account the statistics of the underlying text.
Approximate text retrieval algorithms are even more complicated, since
they are very sensitive to the types of errors that are introduced
into the text.
The algorithm used for performing these searches has not been
documented, but has a running time of O(mn), where is the total number
of keywords entered and and n is the number of words in the original
corpus of documents. There are asymptotically faster algorithms, but
they tend to be difficult to implement with small memory requirements.
At present the search capability is limited to Levenshtein distance of
at most two in each keyword.
Wish List
- The applet should be able to get by with less memory. The current
implementation consumes approximately five megabytes for storing the
inverted keyword list.
- Regular expression matches should be possible.
- Boolean expressions should be possible in the query. For example,
you might want to locate all pages that contain the word recovery but not
the word escrow.
- Exact string matching should be supported in addition to simple word
searches.
- There are some common errors that result from OCR, such as the
misidentification of the string "bum" as "burn" caused by the fact
that the letter m is visually similar to the digram "rn". These are
not taken into account by the search applet.
I would be interested in hearing suggestions for future improvement or
bug reports. Information on how to do this, and any further information
that becomes available will be posted at
www.iacr.org/cd.
Kevin McCurley
®Java is a registered trademark of Sun Microsystems, Inc.
Copyright © 1998, Springer-Verlag.
In addition, the applet is Copyright © 1998, IBM Corporation.