Papers Past relaunched on GreenstoneSeptember 10th, 2007
Today we announced that our beloved Papers Past website has been relaunched with a fantastic full-text search facility. Papers Past first launched in 2001 with digital image scans of pages, and a lovely chunky Java Applet for zooming in-and-out of articles... but alas no full-text search. Well, the wait is now over and the project team has built a new web application and completed the first phase of extracting the text from the image scans using Optical Character Recognition (OCR). A mammoth exercise.
Browse by region interface for Papers Past in 2007.
The OCR process itself is super-interesting but we wanted to talk today about Greenstone, the application behind the website. Greenstone is a digital library platform that came out of a University of Waikato research project about 10 years ago. There's been some quite amazing technical developments as part of this project and it's worth making a big deal about it because it's open source and we hope the work will be reused. A big up to the team at DL Consulting who have been leading the Greenstone work for us (plus also Click Suite for their user interface work, Gordon Paynter, Tracy Powell and the rest of the team at the National Library).
The main challenge was really about scale. The collection contains 46,927 newspaper issues with searchable text (comprising 254,295 pages and 3,161,143 articles) and 160,740 newspaper issues without searchable text (comprising 865,071 pages). As the OCR process continues the number of full-text pages will rise to over one million... and that presented some challenges for Greenstone.
A significant amount of "streamlining" was done to ensure the system worked well with such a large amount of data. First, the systems for importing the data in to Greenstone were optimised to be as fast as possible, as even a very minor bottleneck became significant when processing millions of articles. Likewise the display-time code was carefully streamlined for optimum performance.
The size of the metadata is also quite extraordinary. Aside from the typical content metadata, we also needed to store the bounding co-ordinates (the location) of every single word and article within a page. This is to support new interface features that highlight search phrases on top of the images scans, and to also allow users to switch between page and article level views. This bounding metadata was extracted from the digital images into the METS /ALTO XML format, and was something that had to be custom-built for Greenstone. It seems crazy... but the amount of metadata for each page is typically bigger in size than the source TIFF image file.
We worked through all these issues, and were very pleased to hear that DL Consulting have been granted research funding from the Foundation for Research Science and Technology (since merged into MSI) to continue their work, allowing Greenstone to scale to even larger collections.
While the redevelopment grew from the need to provide better text search facilities there is also a recognition that, for reading online, the image (as opposed to HTML text) version will be preferable for many users. This is because the OCR process is not perfect so some letters can get muddled in the HTML text version. It is for this reason that image viewing is still an important part of the site. A new image server was built using Perl to dynamically extract and cache web friendly GIF versions.
Another enhancement to Greenstone involved improving the integration with Lucene. One of the reasons we went with Lucene is that it provides support for fuzzy search which helps with the search across imperfect OCR text. Greenstone already had Lucene support available, but it needed some tweaking so it could scale to the 50 GB of text that is estimated the site will grow to.
The Lucene development work has already been returned back to the open source community, and there are plans to also contribute the METS / ALTO import process, the image server, and the core enhancements to the scalability of Greenstone. We're feeling good about the relaunch of the site and it will be interesting to see how it goes down with users.