Search Applet Instructions

This applet allows searching for individual keywords in the text of the papers contained in the collection. There are several parameters for the applet: Once the search window is displayed, it will take some time to load the hash table used for performing searches. Once your search has completed, the results will be listed in the area below. If you highlight one of the results, then it will direct your browser to load the indicated paper. If a word appears several times in a paper, then each page on which the word appears will be listed in the results. Clicking on each of these will cause the PDF file to be loaded, but you will need to manually scroll forward to the indicated page. The URL that is generated shows the actual page number.

Potential problems

Unfortunately, several things can go wrong with this applet.

How reliable is the search capability?

The search capability on this CD is based on text produced by an optical character recognition process. As such, the text contains many errors in it, and searches performed on this text can miss some occurences of a given text string, and can also produce false positive search results.

In order to deal with the errors that are inherent in the text, the applet provides the capability of searching for keywords that have a small Levenshtein distance from the keywords that you enter (also called edit distance). In this way, even if the scanning process scanned a word like "linear" and mistakenly read this as "lineer" or "liner", then the approximate search capability will still consider this a match.

In order to minimize the size of the inverted keyword index, we consider certain common words to be useless search terms. Unfortunately, words such as "Diffie" appear so frequently in the text that the applet may label them as too common.

What is the search algorithm used?

The algorithm being used in this version is rather simplistic. When designing a text retrieval algorithm, it usually involves a balance of preprocessing time, memory, storage, running time of queries, and accuracy of the results. In addition, algorithms can be made to run faster if it takes into account the statistics of the underlying text. Approximate text retrieval algorithms are even more complicated, since they are very sensitive to the types of errors that are introduced into the text.

The algorithm used for performing these searches has not been documented, but has a running time of O(mn), where is the total number of keywords entered and and n is the number of words in the original corpus of documents. There are asymptotically faster algorithms, but they tend to be difficult to implement with small memory requirements. At present the search capability is limited to Levenshtein distance of at most two in each keyword.

Wish List

I would be interested in hearing suggestions for future improvement or bug reports. Information on how to do this, and any further information that becomes available will be posted at www.iacr.org/cd.

Kevin McCurley


®Java is a registered trademark of Sun Microsystems, Inc.


Copyright © 1998, Springer-Verlag.

In addition, the applet is Copyright © 1998, IBM Corporation.