Papers Past satisfaction survey 2021July 21st, 2021 By Tracy Powell
Find out more about common topics on users’ wish lists, including user text correction, extending the date range of newspapers in Papers Past, filtering out duplicate search results and more content types to use in filtering.
Thank you to Melanie Lovell-Smith and Emerson Vandy for assistance with this blog about the Papers Past satisfaction survey in 2021.
Thank you for your feedback
This year’s Papers Past satisfaction survey provided us with a lot of very useful questions and comments, including some helpful ideas about improving the way the site works — thank you for the feedback.
Feedback on common topics raised in the survey
We thought it might be useful to provide a response to some of the most common newspaper topics that came up, including:
- ability for users to correct OCR text
- extending the date range covered
- duplicate search results
- adding more content types
See also our blog looking at the results of the 2019 Papers Past satisfaction survey for tips on how to get the best out of the functionality.
User text correction
We were hoping to add the ability to correct the computer-generated (OCR) text ‘like Trove’ in 2020-2021. However, to support the increased overhead on Papers Past that this would generate, we have needed to upgrade the infrastructure behind Papers Past. This work will continue throughout 2021-2022; once complete, we will begin detailed planning for how and when we will deliver text correction.
Extending newspaper coverage beyond 1950
We have been negotiating with publishers and other rights-holders to deliver more recent content. We are delighted to announce that, with the permission of Stuff Ltd, we will be able to extend firstly, the Press (Christchurch), up to 1995, then the Auckland Star, up to 1991. This will take some years. As we work with more recent newspapers, we need to factor in larger and more pages, which increases cost and time to process. We will also be continuing to add new titles to Papers Past, filling in gaps in geographic spread and extending the date range up to 1950 for other titles.
Duplicate content in search results
Several users commented that they would like to be able to filter out duplicate content, in particular advertisements. The same advertisement may be published many times over long periods of time. If you remove “advertisement” from the Content Type filter this should reduce the number of advertisements in your search results, but won’t help if you would like to display one instance of each advertisement/article.
Newspapers have always reprinted articles, and the growth in news agencies, such as Reuters, and locally, the New Zealand Press Association, means this becomes more common as time passes. Unfortunately there is no effective way at the moment of identifying content that ‘looks’ identical, given the variations in the computer recognition of text that occur across content. The text is automatically extracted from the scanned pages, and the level of accuracy depends on factors such as the quality of the original source material, small print, mixed font, multiple column layouts or damaged pages. Advertisement text is usually particularly poor because of the use of varied fonts, print size and decorative flourishes.
Adding more content types
There were multiple requests for more content types to use in filtering search results, in particular breaking the ‘article’ content type down into categories such as births, deaths, marriages and shipping.
The current content types of article, advertisement and illustration are automatically identified, and are not always accurate. We did a pilot some years ago which tested out the manual application of more granular categorisation, which involved providing detailed rules about the newspapers themselves and where and how such content normally appeared. The results indicated that the level of misidentified and unidentified articles did not warrant its introduction, given the increased effort and cost.
Since then, developments in machine learning have made an automated approach much more viable but this is not something that we are investigating at the moment. Our focus is on increasing the range of content.
Post a blog comment