A hand holding a book and a notebook with a winter list.

Data standards — Papers Past newspaper open data pilot

The Papers Past open data pilot has finished. We are reviewing the pilot and aim to have the review and decisions made by January 2022. The data will continue to be available while we are reviewing the pilot.

Get in touch if you used the data. Tell us how it worked for you, what you were able to do or not do, and your thoughts on the future of open data from Papers Past. Email us at paperspast@natlib.govt.nz

Information about the data standards we use for the Papers Past newspaper open data pilot.

Papers Past data standards

The Papers Past Newspapers data consists of METS/ALTO XML files from 78 historic New Zealand newspapers published before 1 January 1900. It does not contain the page images of the newspapers.

How the Papers Past data is created

Newspapers are digitised for Papers Past from microfilm. This means that the images you see on the website are second-generation images — that is they are created from a film image of the original newspaper page. The film is scanned, then cropped and de-skewed as necessary, and saved as two separate Tiff files:

  • Preservation Master — the raw capture straight from the scanner, and
  • Modified Master — the cropped and de-skewed version, which is then put through an automated image conversion process.

Each page has an ALTO file created during image conversion, using Optical Character Recognition (OCR) technology to capture the text. The image conversion process creates blocks of text with coordinates, accuracy, and font information.

Each issue of a newspaper has one METS file. The METS file acts as a guide to the ALTO files and page images. It contains the corrected headlines and text blocks organised into reading order.

As the METS/ALTO data is created by an automated process, it is not always 100% accurate. It is not manually corrected after creation except in the case of headlines.

The METS/ALTO files are used to generate the indexes and link to the images you see on Papers Past

Image of Papers Past data as described in the paragraphs above.
Digitised newspaper technical makeup.

Back to top

Structure of the data

Be wary about moving individual XML files around as they are not uniquely named.

In the example below, you can see the directory structure that the data has been provided in.

Example of an open data directory structure.
  • Each newspaper title has a folder named with an acronym. In the example above the acronym, MTBM stands for the Mt Benger Mail.
  • For every newspaper title, there is one folder for each year.
  • Within the year, each issue has a separate folder.
  • These are named in the following format “ACRO_yyyymmdd” — where ACRO is the title acronym, yyyy is the year of publication, and mmdd is the month and day the issue was published.
  • For every issue, there is a folder for the modified masters (MM_01). The METS and ALTO files sit within the MM_01 folder.
  • The METs and ALTO file names are the same across all titles and issues — such as mets.xml, 0001.xml, 0002.xml etc. It is only the directory structure that is individualised.
  • As well as the XML formats, bibliographic metadata has been provided at the title level as a MARC record and as a Readme file in YAML format.

MARC stands for Machine-Readable Cataloging. It is a Library of Congress standard that describes library collections.

YAML stands for YAML Ain't Markup Language. YAML is a human-readable, structured, data format. Usually used in configuration files, here it has been used to embed structured metadata into each Readme file in a format that is easily readable by both humans and machines.

Get in touch

Let us know how you've found the pilot, what's gone well and what hasn't worked.

If you have any questions about the data or would like to let us know about projects you have been working on with it, please get in touch.

Email us — paperspast@natlib.govt.nz

Back to top

Feature image at top of page: Image created by Greig Roulston from pictures from the pilot dataset.