Papers Fast

I've just finished writing up a project we finished earlier this year: Papers Fast.

Some background

Papers Past was re-launched in 2007 with a new look and new features – particularly search – and quickly become the National Library's most popular website. In the first year the number of visits per month increased 20-fold, and then it kept growing. But even when it was re-launched, Papers Past was not a fast website. And as time passed, and the number of users grew, and the number of pages increased, we noticed it was becoming slower and slower.

To start with we had an easy solution: when we noticed the site was slowing down, we added another web server to share the load. We started with three web servers. By the time we got to eight this approach had stopped working: adding new web servers did not make Papers Past any faster. Worse, we had built up a backlog of almost half a million pages of searchable text that we could not put online because we were worried the whole system would grind to a halt.

Drastic action was necessary.

So the Papers Fast project was launched. Its goal: to make Papers Past fast.

What's the problem?

After talking to people who might know, we identified four factors that might be causing problems:

  • Application. As far as we know, Papers Past is the biggest and most-used Greenstone installation in the world. Maybe Greenstone cannot scale up far enough?
  • CPU. Papers Past was running on old Sun SPARC servers that were due for a refresh. Maybe new servers would do the trick?
  • NFS. Most of the Papers Past data is served up using the Network File Service protocol. Is this a good choice for Greenstone?
  • Network. The Papers Past data is stored on a different part of the network from the web servers, behind a firewall. Is this a problem? Which was it?

To find out, we borrowed a massive computer with 24 terabytes of disk from GEN-i, copied over all our digitised newspaper data, and asked DL Consulting to install a fresh copy of Greenstone, setting up an entirely separate copy of Papers Past.

Then built a fake collection with 2.5 million searchable pages, used Jmeter and our Apache logs to put the test system under twice as much load as we've ever seen before, and watched to see what would happen.

We found the problem was... all of the above.

So what to do?

The first fix was to upgrade Papers Past search to use Apache Solr instead of Apache Lucene. The second was to replace our eight aging webservers with two new Sun Blade Servers with AMD CPUs. Third, we switched to local disk for the metadata and indexes (we'll upgrade to a fibre-attached SAN by the end of the year).

Then we built a new fully-searchable collection (including three new titles) and re-launched on 22 June 2009, two days ahead of schedule!

And no technology project would be complete without a little scope creep. In this case, we had to support the METS/ALTO journal profile so we could add Kai Tiaki: the Journal of the Nurses of New Zealand to the collection, and to extend the image server to support new titles digitised in greyscale. DL Consulting made these changes, and a few more, along the way.

Did it work?

Yes. We've been serving more traffic, and response times have been faster.

For Papers Past, we track traffic from Google separately from everyone else (it's a long story, but the core problem is that we serve so much data to Google that our aging web statistics package can't crunch the numbers).

Everyone is way up – especially Google, who have slurped up about 700,000 pages per day lately, peaking at over a million. Before the upgrade, we had a lot of trouble getting Papers Past fully indexed in Google News Archive, but now it is pretty much all there.

Despite this increased traffic, Papers Past response times are much improved. We have been monitoring response times since 2007, and set out very clear performance targets before we kicked off Papers Fast. Here's the performance targets, and the times we observed before and since the changes were made. (All times are in milliseconds.)

Performance measure Target Before changes Since changes
Average response time for generated page request: measured by Google Webmaster Tools 1000 3000-5000 600-800
Average response time for generated page request measured by the Library 1000 > 3500 402
Average response time for search page request measured by the Library 1500 6639 1055
Average the Library time for image server request measured by the Library 6000 11158 3574

Finally, it has made a big difference for our infrastructure. The NFS traffic to one of our fileservers changed when we moved the Papers Past metadata and search indexes away. It's also freed up corresponding network capacity.

Summary

On 22 June 2009 Papers Past users not only got half a million more searchable pages, they got a big speed bump. Traffic is up since then, but response times have remained low, and we have a plan to handle more data (the SAN) and more users (extra front-end servers).

By Gordon Paynter

Gordon likes libraries.

Post a Comment

(will not be published) * indicates required field

Be the first to comment.