HeritagePreserveJune 30th, 2014
On Friday the 13th of June, the National Library ran its first codefest or hackathon type event, which brought technical specialists into unprecedented contact with some our digital collections.
This post covers what we set out to do with the event, what we achieved, what we learned, and what we plan to do next. It’s a record of our experience and a guide you might like to use if you want to do the same. This one’s a long read, so settle in...
- The concept
- Preparing for the event
- Getting data and collections
- Setting up the technology
- Day one
- Day two
- What we made
- Lessons learned
- Next steps
A codefest is a loosely organised event which brings together problems, technical experts, and domain specialists. It provides a space to explore problems, build solutions, or otherwise address the event theme within a set time.
Our theme was Digital Preserve. The National Library has a large digital preservation program, running for more than 8 years now. Our world-class digital preservation repository has been a routine business system for more than 4 years.
That application, Rosetta, has become a staple tool in the National Libraries digital armoury, and has allowed us to develop a collective view of what it means to do digital preservation, and to understand the key challenges that we face day to day as we work on the digital objects that come into the Library’s care.
We wanted to put this domain knowledge to use, and hackfests in the digital preservation space around the globe over the past few years provided a model. We saw no reason why we shouldn’t be opening our doors in the same way to our collections and issues.
This was our home for two days.
Preparing for the event
In preparation for the event, we needed to ensure we had:
- enough content accessible to attendees to explore,
- technology in place to assist the attendees in their work,
- a good understanding of the type of challenges that could be addressed in this type of event, and most importantly,
- enough people attending to make the event viable.
First, we needed to ensure that we were comfortable with the proposed access to the national heritage taonga we hold. We arranged a round of discussions and proposals to management and Library leaders to ensure that we would have the right support, and that we were comfortable with the attendant risks that come with opening up collections in this way.
We also reached outside the Library, and spent some time talking to people who had either run these types of events before, or who want to get involved and help out. Catalyst IT very kindly offered us some very useful insight on preparing for the event, and offered to feed and water the attendees.
We spent quite some time making sure that we had the right technology. We worked very closely with our internal technology provider in the Department, and one of our technology vendors, Revera, who also offered to stand up most of the technology stack we wanted to provide. Essentially, we got a special secured network, with a significant amount of storage to allow us to provide safe, unhindered access to our content without putting our corporate platform and services at risk.
This network was then given access to the public internet in a safe way, and finally topped off with secure WiFi in the venue to allow attendees to easily get to the content, and research their chosen challenges.
Once we knew the technology was going to be in place, and there was an appetite for this type of event, we started to promote and see who would be attending the event. Using the Eventbrite event management website, it was pretty straightforward to build the event narrative, and arrange the tracking of ticket sales.
The event was free to attend, but by using the ticketing system we were able to classify two types of attendees, “Coder / Technical” and “Content Custodian”, helping us get a mix of the two types. While we knew that the event was geared towards coders, we also wanted to ensure that we were open and accessible to other organisations that have a preservation requirement, and wanted help exploring their own preservation challenges.
Lunch time. Time for a catchup and a review of progress
Getting data and collections
As signups signed up, we moved into the content preparation phase, working closely within the National Library and Alexander Turnbull Library we found several collections and challenges that lent themselves to this type of event.
We ended up with three main areas that suited our theme: The Internet, Collections and Formats.
The Internet as a category was (unsurprisingly) large, but we found a few specific areas that warranted some exposure.
The Library has some whole-of-domain web harvests, (including a copy of Geocities) that represent the national archive of New Zealand’s web presence over the past 5 years. These collections are vast; and current tools that we have access to don’t really offer meaningful insight into the sets at scale. Of these collections, we’re interested in knowing how to segment New Zealand content from non-New Zealand resources, and what research quality access looks like for multi-terabyte collections.
The internet also houses many interesting resources that are difficult for contemporary tools to collect. Social media content, especially, is routinely difficult for us to gather in a viable state for long term preservation. We’re extremely interested in addressing this so we can preserve the New Zealand presence for future generations.
Finally, the internet is increasingly used as a publishing platform. We’re interested in developing methods that assist our colleagues in collections efficiently gather large quantities of published materials that fall under our Legal Deposit mandate.
We identified several collections that have some preservation issue, or that could be processed to assist future researchers in engaging with the content.
With these collections, we were interested in the same overall question as with the web harvests – what should we be doing with this content that would make it more accessible to Library users.
We of course had to be aware of any donor restrictions placed on our content, and make sure that we only gave access to content that donors would be happy with, and that we were not breaching any copyright requirements.
The result was a collection of text items (some simple text files, some more structured TEI/XML), images from a photographic collection, large scale digitised versions of some historic maps, and an exemplar collection of files that demonstrates the technical complexity facing Library colleagues when we ingest digital collections.
The final area we identified was formats, or file types. File types matter hugely in the preservation space, simply because if we don’t know what a binary object is supposed to be, we can’t make sure we are looking after it properly.
The basic concept is well understood – you need to give a Microsoft Office Word file to an application that understands Microsoft Office Word files, to ensure that the content of the file is properly rendered on screen. If you gave the same file to an MP3 playing application, it wouldn’t know what to make of it, and would most likely fail to load it.
Following this argument, if we don’t know what type a given file purports to be, we can never be sure that we are able to assess the heritage value of the file. To this end, we have a strict set of controls at the ingest side of our digital collection process that means we do not simply ingest files without being able to relatively accurately identify the file type.
Elastic search. Elastic search. Elastic search. Elastic search.
When we’re dealing with older collections, a known file type for each file we have becomes more important, and often more difficult. We’re pretty good at identifying individual file types using a variety of tools, resources and brute force, however there is a lot that could be done by building better tools, solving individual format issues where we can’t tell exactly what the original format was, or developing a deeper understanding of the nuanced preservation risks of specific file types.
We collected a variety of files with known issues to present to the event and asked open questions about what we ought to be doing with them. These included email mailbox files; ISO disk images for video DVD based content; DSLR image files that are underrepresented in the preservation space; MARC-based record sets; and other files of an unknown file type what we wish to assess.
Setting up the technology
Having identified the collections and content we could make available, we started working with our technology partners to extract the content from its current storage, and copy it safely into the event’s temporary environment. This was not a trivial exercise, and we learned lots about how to do this better next time.
We met a few times with interested parties to make sure we had some reasonable expectations for the event. What would attendees expect to see when they arrived on day one? What should we do in advance to try and ensure a smoothing running couple of days? And most importantly, how could we measure the success (or otherwise) of the event once it concluded?
All too soon it was showtime. We had moved content, set up the network, prepared attendees for the event, prepared the room that was to be our home for the next couple of days, and collected the collateral that we wanted to make available at go time...
The first day passed in a blur of activity. The first of the attendees arrived in good time to set up camp, and get familiar with the technology we had set up and the sources of coffee in the Library.
By our formal start time, we had over twenty attendees from outside the Library, from Wellington, Auckland and Christchurch. We also had a good range of National Library colleagues in the room, who had been individually tasked with looking after various parts of the collection/digital preservation paradigm, and to work with attendees that were tackling the attendant challenges.
We wanted to avoid a rigid schedule, expecting the event to be relatively organic in its own organisation. The basic structure we formed was quite sparse: an opening, “speed dating” the collections / problems and attendees, a tour of some Library facilities, and a wrap up session each day.
We invited Alison Elliot, the National Library’s Director of Content Services, to formally open the event. Alison addressed the gathered masses, and we listened as she described the importance of digital preservation for the National Library, setting the scene for the next couple of days’ work.
We moved on to introduce the various people in the room to each other, and then to describe the collections and problems we had ring-fenced for the event. This was a surprisingly complicated task, and one that took quite some time to get right.
The Library colleagues have a vast understanding of the collections they are charged with looking after, or the problem spaces they occupy on a daily basis as they go about their work. Presenting a shopping list of ideas to a room of eager attendees was probably not the best way of getting the attendees focused on particular problems, and the following couple of hours were spent ensuring that everyone had good sight of the challenges and content on the table. This settling down phase was punctuated by the facilities tour, an activity that appeared to be the main highlight for a number of our attendees!
Our staff showed the attendees some of the equipment and processes we have to help us look after our physical collection. This included the audio/video preservation kit, the basement stacks, and some physical conservation areas where the Library’s conservators work diligently on the physical artefacts.
The groups returned from their tours, enthused and encouraged by seeing first-hand the extent of the effort we go to when looking after our physical taonga, and making some really useful parallels with our digital collections.
This is what D3, clever people, and a decent set of data gets you.
For rest of the day attendees settled into small working groups, or worked individually to unravel their chosen problems. This process was organic as expected, and it was extremely useful having Library colleagues on hand to help describe the collections and challenges as we understand them. There was a lot of match making, which involved talking iteratively to the various attendees and pointing them towards people, content, or resources as their various questions dictated.
By lunch time most people were involved in at least one project, and the rest of the event (excluding the wrapping sessions) ran as expected – enthusiastic volunteers working hard on collections that they found interesting, or on challenges they found compelling.
Two particularly standout discussions explored how we might look at gamification to help structure future events, especially as a method of drawing people from various skillsets and domains together, and some really interesting discussion that covered the abstraction of software/platform in the preservation domain, and the inherent complexity of tool-making against specific or generic/abstracted requirements. This conversation specifically drew on ideas in Stewart Brand’s book How Buildings Learn: What Happens After They’re Built .
Day two ran in much the same vein, with working parties busy concentrating on their chosen problems. One of the first discussions we had as a group was to discuss what success looked like for any products of the event, and how to document what was attempted and completed. One of the attendees took up the role of record keeper, and spent the day interviewing the various attendees to capture their ideas and outputs. This effort was invaluable as a working record of the event, as the relative chaos of 15 different individual projects came to a conclusion.
D3 again. This time looking at an email inbox.
What we made
The following list is as complete as we could make it. There were some people working on things that they weren't ready to share at the end of the event, which was entirely okay, and we’ve not included their work in the round up.
Metadata extraction tool
Target: Moving the NLNZ Metadata Extraction tool from SourceForge into a new home in Github, including all the SourceForge history.
Progress: Take a look!
Target: The content of this collection discoverable via DigitalNZ and items able to be retrieved from a publicly accessible URL.
Target: Entity recognition for the dataset, such as people and places.
Progress: Stanbol (Apache) that does entity recognition for proof of concept set (1000 items). However, we need a better authority lists to work from.
Target: Make a GUI that allows users to engage with large datasets in a useful way.
Progress: Wrapped some visualisation around the data, demonstrated a working example to group where filters can be applied to refine searching of specific elements.
Workflow dashboard for taking into digital items into custody
Target: Have a tool that allows colleagues to share progress of digital collections as they go through the appraisal, ingest and description process.
Progress: Proof of concept built that can be deployed and tested as a work in progress.
PST format extraction
Target: Extract these files into open format and emails and attachments.
Progress: Contents extracted as described, basic search interface built to aid appraisal and discovery.
PST format visualisation
Target: visualisation the NLNZ team can use to make informed decisions about the contents of the PST files in the collections.
Progress: Cut down a data to a workable sample (e.g. linkages that have more than 100 connections), showed some visualisations of exposing links between discovered entities.
GeoCities – a WARC in progress
Target: Extract New Zealand content from the Geocities dataset, so it can be archived within NLNZ collections.
Progress: Currently working through a number of errors in code, ongoing. Finding a needle in one heck of a haystack...
Firefox plugin for infinite scroll
Target: Trying to scrape or capture infinitely scrolling pages for their content
Progress: Basic model built.
Troopship magazines (PDF)
Target: Convert image-based PDFs containing full text into the Papers Past/AJHR format (METS/ALTO).
Progress: A solid attempt was made but it looks like this is not a simple task. So while unsuccessful has proven that it is not as trivial as first though which means it can be discounted as a way forward for the digitisation team.
Augmenting SuppleJack to support crawling and binary collection
Target: Test the viability of the DNZ metadata harvester as a digital object harvester.
Progress: The basic challenge was explored, and the collection of binary objects demonstrated. Website crawling to expose the binaries was proving to be a decent challenge for the SuppleJack based approach.
The final outcome was the great exposure we had for the Library, its collections, and the work of its staff. We’ve never opened up collections in this way before, and it was great to walk around the room watching people talk about the collections and start to engage.
Unswervingly focused. This is how the professionals work.
From the Library side, we learned a number of really useful lessons, which will all feed back into the planning process for any future events.
We’re used to dealing with data in our various library systems as they are part of our day to day work. It turned out that we could have structured some of this data in a simpler form to help attendees engage quickly with the issues in hand. This preparation will include liberating information from esoteric structures prior to any future events, providing some written context or problem statements with each of the collection sets, and providing ready access to any known tools that would be useful in relation to the problem or collection.
Without entering into a philosophical discussion around what form these lists should take, we learned that it would be useful to prepare some notion of authorities from a Library perspective as an available resource. These authority lists include people, places and other concepts that would be useful as seed points for searches, entity extractions or other process on collections. It’s not really critical that these lists are the authority, it’s more so that these exist in a tangible form and can be used with tools to add value to proofs and models that are being developed.
A home for any produced works
We noted that we need to consider what the output stages are for work produced during the event. This covers two spaces, one being where we locate any sharable code that’s written, and how to push any new refined structure or entity extractions back at the Library systems. This could be feeding back into DNZ, or adding data to Library catalogues. This represents the long tail value of the event, and as such it’s essential that we get this right in the future.
Working on “things” rather than “a thing”
Closely related to data preparation, we learned very quickly that we’re not ideally positioned in the Library to deal with the concept of “things”. We’re not bad at “a thing” – in Library-land, this would be the equivalent of someone getting a book out of the stacks. If a customer walked into a library and requested a whole shelf of books, or 100 books, each coming from a different shelf, we would struggle. We simply don’t have systems in place to allow this to happen without some significant changes to current norms. The same could be argued about digital objects. Events like this thrive on data being presented to attendees in simple ways, and free of domain specific wrappers and bindings that would otherwise slow down their useful application against tools and workflows that are from outside the Library context.
We learned, very quickly, that Elastic Search is an incredibly powerful and versatile tool...
The million dollar question is, “will we run the event again”? We really hope so. It was well attended, and we hope the attendance demonstrated the interest in the public for New Zealand digital heritage collections.
In terms of what we set out to achieve, we didn’t get into half of the problems or collections that we thought would be attractive to attendees, so with some refinement (as discussed above) we hope that simply re-running a similar event in a few months would easily see the same positive outcomes, with the rest of the identified issues.
Of course, we’re really interested to hear your thoughts on the subject. If you’re a coder, what would you want to see or do or achieve at a hackathon? If you’re a content curator, what do you think would be useful? Please do get in touch if you have any suggestions.
Teaser image: HeritagePreserve's very own Feijoa Jam – thanks Jonathan, it was delicious!
Artwork Greig Roulston produced to support the event. Derived from Eph-F-MEAT-Gear-035.