Papers Past data has been set freeSeptember 22nd, 2020 By Melanie Lovell-Smith
On Monday 24 August the National Library of New Zealand Te Puna Mātauranga o Aotearoa quietly let a set of data from Papers Past loose into the world.
Data from 78 New Zealand newspapers
Papers Past is the National Library’s fully text searchable website containing over 150 newspapers from New Zealand and the Pacific, as well as magazines, journals and government reports.
As a result of the data being released, people can now access the data from 78 New Zealand newspapers from the Albertland Gazette to the Victoria Times, all published before 1900. The data itself consists of the METS/ALTO XML files for each issue. The XML files sit in the back of Papers Past and are what allows you to locate keywords within articles.
Work on this began back in 2015, so it is wonderful this has finally happened, and to have it live just as Greig Roulston, who was the main staff member behind the project, finishes up at the Library.
You can now download these files from the Dataset page on the website.
If you’d like to know more about METS/ALTO there’s some useful information, including a great diagram, on the Data Standards page.
Copyright and reuse
The data has been released openly, so people can do anything they like with it (although we do ask people to please credit the source). Hopefully, it will be a fantastic addition to the Papers Past dataset that is already available via the DigitalNZ API.
Why is it important?
So why is this important? Well, in recent years, there’s been a growth in the number of researchers wanting to access large sets of data to explore topics using computational techniques. Access to historic newspaper data offers new opportunities for research into areas such as history, language, and global culture, and over the years, the Library has had several requests for access to Papers Past data to support such research.
You can see some examples of what has already been done using overseas sets of newspaper data on the Projects page.
It’s a pilot and we want to know what you think
As its the first time the Library has done this, we’re hoping to use the pilot to find out a few things, including how people prefer to access the data, and whether it is easily useable in this format. The data will be available for 12 months, and then after reviewing the pilot, they’ll decide whether to maintain, retire or expand the offering.
So, if you are interested and would like to have a play, go to the Papers Past Open Data Pilot — and be sure to let us know what you do.
What’s happened so far?
Since the data was first released, we can see that the .csv file that lists all the titles has been downloaded 24 times and that the individual datasets have been downloaded 423 times in total. The starter kit we provided has also been downloaded 24 times.
Unsurprisingly most of the downloads occurred in the first two days after the data was released. The most popular title so far has been the Albertland Gazette of 1862, but we suspect this is only because it sits at the top of the list.
We’ve had some great feedback
We’ve also received quite a bit of feedback. From the data wranglers, the main feedback has been that they need a way of downloading all the data at once. Security considerations meant that we couldn’t provide researchers the ability to run programmes such as Wget to grab the files, so we are working on other ways of providing access to the whole 235GB at once.
In the meantime, one researcher has worked out, and shared with us, that using a browser’s download manager’s plugins means it is possible to download all the data at once. He has successfully used Firefox with the Downthemall plugin to do this. We added this information to the website last week and can already see that at least two other people have also downloaded the entire set.
Feedback from the library community rapidly segued into a conversation about the quality of the Optical Character Recognition (OCR) on Papers Past. In many ways this is part of an on-going conversation about whether Text Encoding Initiative (TEI) or METS/ALTO standards are the best way to provide fully text searchable documents.
Generally, TEI works well for small scale projects, but it is costly for bulk digitisation; METS/ALTO doesn’t reach the level of accuracy that TEI can, but it is much more cost-efficient for bulk digitisation. It is now the standard not just for Papers Past but also Trove (National Library of Australia) and Chronicling America (Library of Congress) amongst others.
Following on from that, people have been asking for the ability to correct the OCR text on Papers Past (in the way that one can on Trove). We’re still working on this — as supermodel Rachel Hunter said in the 1990s Pantene shampoo advert ‘It won’t happen overnight, but it will happen’.
Post a blog comment