article: {type: “War reports in all CAPITALS”}

It’s been impossible to not be aware of the First World War as a result of the centenary commemorations. Events have been re-examined through the distancing lens of a hundred years in a range of media, performances, public spaces, and proclamations.

One of the online initiatives the National Library has contributed to is the Life 100 Years Ago initiative spearheaded by the Ministry of Culture and Heritage. It delivers quotes and snippets from the diaries, letters, and newspapers of New Zealanders, by aggregating multiple twitter feeds.

Daily life of a century ago is revealed through contributions from well-known figures like Lt. Col. W.G. Malone, alongside less-widely-known diarists like the labourer James Cox, and also contributions from newspapers, which provide a current events pulse as a backdrop to the personal contributions.

You can learn a lot from the paper

Papers Past has been providing the newspaper tweets for this project. Queueing up three or four daily tweets meant reading a lot of old newspapers, and reflecting those times both neutrally and in a way that occasionally intersected with the diary entries of the other contributors. This ongoing daily close-read of New Zealand papers slowly revealed consistent trends and patterns in the writing and editorial lines of each paper.

Beginning in 2013, the project ticked through the background movements that contributed to the war. Then, with the escalation of Austro-Hungarian tensions around the assassination of the Archduke Ferdinand and his wife, things got real.

Stakes got higher in the newspapers. Editorial lines subtly started to reflect the scale and seriousness of the situation. Some of the reporting in the very early phases of the conflict carefully juggled the need to bring journalistic balance to the reports, with the need to reflect the rapidly escalating emotions of the events.

Then, in a very short period of time, the nature of reports relating to the conflict and its participants changed again. They seemed to quite abruptly lose the balance and detail that the earlier reporting had. This entire process took from July to September 1914 – a fairly short time for what seemed like such entrenched patterns to shift so pervasively.

What caused this change? Was it a reflection of the public mood, and the editorial policy shifting to where newspaper sales were best aligned? Was it the influence of wartime policy? Was this just a reflection of a modern conflict being fought on the informational level?

Newspapers + data analysis = research!

Stepping forward in time a hundred years, it was now time for the 2014 NDF. One fascinating session at that conference was delivered by Douglas Bagnall, entitled “Spying on the past”.

Douglas detailed his approach to disambiguating authorship across texts with unclear creator attributions by training neural nets to identify writing patterns. He began by stating that he was thinking about testing his methodology by using the Papers Past corpus, when along came the election, and the Whale Oil blog provided a more topical source of text for him to analyse. The video above is fascinating, and anyone with a side-interest in AI/neural nets/data analysis (that’s everyone here right?) should watch the video now. We’ll wait.

Baby Robinson sitting on a pillar.Baby Robinson sitting on a pillar, ca 1870s-1880s. Ref: .

Following this fairly mind-blowing presentation, the Library approached Douglas about the possibility of using the same approach to analyse the Papers Past corpus around the early period of World War One, to see if his methodology would hold true as a way to interrogate old information for new insights. Automating the inquiry process is an increasingly important part of the learning process, at all levels, and we were keen to explore this field.

We helped create a research proposal around a hypothesis, provided Douglas with access to the relevant full-text metadata, and waited for the magic to happen...

Only it absolutely wasn’t that straightforward

Initial explorations revealed the significant analytical differences in working with a nigh-on 100% accurate text body (which articles online tend to have), versus the inconsistent characters contained in an OCR-derived body of text sourced from tiny microfilm frames, made from images of old newspapers printed a hundred years ago.

When you’re undertaking a procedural analysis of something, having control over the consistency of the stuff you’re analysing is important. When you’re training a neural net to help you do that analysis, source consistency becomes absolutely key.

A human reading a digitised newspaper page can do it pretty easily. Reading a slightly garbled OCR text is also possible, as human brains are adept at dealing with a few inconsistencies in data. However, when you’re establishing rules for a simple AI, and the source data looks like

We have mads arrangy—b■ fc"» Space for FARMERS STOCK a» |h» Local Work*

then you have a problem. Douglas raised the issue with us, and we understood that our initial intended exploration had morphed into something quite different: How could imperfect datasets be mined for meaningful information?

This is a much harder problem to solve. Douglas’s response involved several steps:

  • Text transformation
  • Selection of newspapers for higher OCR quality
  • Dataset size – too much matching data could result in overfitting, which would bias the results

This started to yield analytical results – not the sort of definitive pattern-matching we were hoping for going into the exercise, but the beginnings of correlations between historic items and the JSON metadata we had derived from newspaper microfilm scan.

Douglas was able to start outputting t-SNE maps (vector plots of sequences of a number of characters, n) like the image below:

t-SNE representtion of articles from Papers Past, showing closeness of relationships and particularly groupings of similar articles.A t-SNE representation of articles, coloured by year, with topic annotations. A significant number of the articles within each rectangle relate to the annotated subjects.

t-SNE visualisations can be used to identify texts with the relative frequencies of characters across words and phrases, and this mapping does reveal matches in subject matter and style.

This visualisation approach also provided an interesting view of advertisements vs articles, below:

t-SNE representation of ads and articles, colour coded by type. Also shows the text of a few articles indicative of type errors.t-SNE representation of ads and articles, with the text of five isolated 'advertisements' found within a region of concentrated articles. The second item (about Ford cars) seems to have an advertising component, while the fifth contains two items conjoined in error. The others appear to be articles.

The data started to run back into its limitations when a multi-headed recurrent neural network (MHRNN) language model, used to measure dissimilarity between articles and known texts, was applied. OCR errors make this method an impractical way to identify individual writing styles, as these overwhelm the small nuances that might disambiguate or correlate authorship across texts.

In the image below, the data comes from a range of texts by the author Elsdon Best – the spread of the score (2-14) indicates difficulty correlating the author to the works due to the OCR errors. The errors are also the main factor influencing the 2-dimensional clustering.

t-SNE representation of items by Elson Best, processed by a recurrant neural network.t-SNE of multi-headed RNN trained on multiple authors, coloured by cross-entropy vs Elsdon Best (rounded down to integers).

Why not have a read of Douglas’s write-up of his investigation, which provided the images above? (pdf, 3.5MB)

So what can we improve?

We came out of this with an increased understanding of how analysis with neural nets can be applied to content metadata produced by automated processes at scale. Some analytical techniques can withstand the OCR noise of scanned texts, others not so much.

The obvious improvement we can deliver to aid this approach is to provide a way to correct the OCR text of the corpus (hmmm, I feel a project coming on).

The other obvious improvement is to think about how we expose the data in future for large-scale analysis – it’s clear that there’s something potentially very useful for research workflows in doing so.

Less obvious lessons included the viability of the JSON data format for this work, and the usefulness of the SHA1 cryptographic hashes each item included, as a means of distributing the data into sets for analysis.

Lastly, we’ve also learned that there’s value in bringing fresh eyes to the resources produced by our workflows – we’d like to thank Douglas for all the insight he brought to this problem, and we hope we can build on the increased understanding he has helped bring about.

By Emerson Vandy

Emerson is Digital Services Manager, taking good care of Papers Past,, and occasionally a beard.

Post a Comment

(will not be published) * indicates required field
Michael Brown November 23rd at 9:52AM

Fascinating - although a pity that OCR was not accurate enough to use for all the possible analyses. it would be interesting to hear more detail on how the articles on different topics were identified, e.g., entertainment, British spirit.