Foreword Preface Contents Author Index Keyword Index Search Help

Foreword

One of the challenges of embracing the information age is to enhance and carry forward the enormous amount of information that is archived in paper format. In this collection we have collected together the 14692 pages of information from the 32 volumes of conference proceedings of CRYPTO and EUROCRYPT. In addition, we have derived textual information that can be used to index and search this archive.

Compressing this much information onto a single CDROM required significant effort, but it was felt that this would enhance the usability of the collection with current technology. As a rough estimate we might assume that one printed volume of cryptology proceedings contains in the average about 460 pages. If we assume that a volume of 460 pages is 3.5 centimeters thick, one has to store 1.12 meters of paper proceedings. Suppose one page of a proceedings volume contains in the average 380 words or, including punctuation, 2500 characters (e.g. one page of volume 963 of LNCS contains 482 words or 3200 characters in the average whereas volume 196 contains only 253 words or 1710 characters per page). In this case we have to store 5.582.960 words or 36.730.000 characters or in computer terms about 40 megabytes if we store it as ASCII text.

Unfortunately, producing such text is nearly impossible, and we have chosen to provide information in the form of PDF files containing images. This is dictated by the content of the volumes, which are predominantly text, but are also mathematical in nature, containing many formulas and mathematical expressions. Over the years the fonts and typefaces changed from typewriter styles to DVI files, and particularly the quality of some early printed source documents is rather poor (especially the proceedings of CRYPTO 81 and EUROCRYPT 86). These factors contribute to a very high error rate for optical character recognition (OCR). Since mathematical content is of no value if the accuracy is compromised, we chose to deliver an electronic product that is as faithful as possible to the original material.

Given that a CDROM has a capacity of approximately 650 MB, this implies that the size of one proceedings page should not be much larger than about 40 KB, in order to leave room for a Keyword Index, an Author Index, the Table of Contents and a search engine for efficient and convenient retrieval of the documents.

By experimentation we learned that 400 dpi is a resolution where the OCR software could be trained to produce reasonable results. One page, scanned with a resolution of 400 dpi, has an average size of 140 KB when stored as 4636x3232 resolution TIF file. The TIF files served as the basis for the OCR process, because we need the text versions to produce indices. Once the TIF images were produced, we used an automatic process to crop white space from the borders, and transformed into PDF files using some of the software in the IBM database of US Patents. We experimented a great deal with different settings to balance the space requirement against the quality of the result. The final process took several days of processing on a personal computer.

Creating a search engine for OCR scanned text is a challenge in itself, from both an algorithmic and software point of view. We experimented with various approaches to this, and Kevin McCurley finally decided to write a Java applet for incorporation into the CDROM. This has several advantages:

Unfortunately Java is still rather slow, consumes substantial memory, and has not yet reached full maturity as a programming language. As a result, we expect that some users may have trouble using the Applet, but perhaps this situation will improve with time.

From an algorithmic point of view, the problem of searching OCR data for keywords is the dual problem of spell checking - in the case of spell checking you assume the dictionary is correct, and compare a possibly incorrect word against the dictionary. In the case of searching OCR data, you assume the errors are in the dictionary (unless these can be removed by reference to a dictionary appropriate to the context), and look for occurences of the (presumably correct) search words in your approximate data. A great deal of work has been done in this field in the last few years, but we decided to adopt a simple approach for the applet. The method used by the applet is simply to check each string that is an edit distance of at most one from the target string, and see whether it appears in the text. For this purpose we use a hash table to locate all references to a given string. Note that if this method would not scale well to allow an edit distance of two, since the complexity of the algorithm is exponential in the maximal edit distance d.

In addition, we encountered further questions concerning quality control:

We are satisfied that our process properly addressed the third point, but the first two remain a concern. When working with the CDROM you will certainly find errors, rough patches, and deficiencies. We invite you to tell us about them and send us suggestions for improvements. Any further information that we can provide to enhance the usability of this CD will be placed at the IACR web site (http://www.iacr.org/cd/).

The process of creating this work has been a collaboration between several people. We would like to particularly thank Andy Clark, Alfred Hofmann, Thomas Berson, Whitfield Diffie, Joan Feigenbaum, Bart Preneel, Tom Griffin, Jason Zien, Sridhar Rajagopalan, and our student workers. Although a curious series of accidents during this project delayed the publication, we are quite satisfied that the result will be of use to the research community.




Claus Dieter Ziegler
Kevin S. McCurley
September 1998


Copyright © 1998, Springer-Verlag.