On the Commons

Put a bird on it

A big old owl face!Detail of Morepork, Spiloglaux novae-zealandiae and Laughing owl, Sceloglaux novae-zelandiae, by John Gerrard Keulemans. See it on Flickr.

Our collection of entirely free images on Flickr just got 3,500 high-resolution additions. Thanks, Python!

Everything we’ve uploaded to that account is yours to use in whatever way you want. These are images that have no copyright, donor restrictions, or anything else that might get in the way. They’ve also been uploaded at the highest resolution we have.

You can download the images, use them online, print them, turn them into lovely tea towels or linoleum.

(We’d love it if you linked back to us when you do, you don’t have to.)

The new uploads came from our free download pool, which is full of – you guessed it – images you can download for free. So why did I spend far too long writing terrible code to pipe them all up to Flickr?

First, there’s a ton of people on Flickr who are never going to come to our site. Why not get our stuff right in front of their faces?

Secondly, our interface for downloading these images isn’t great. You need to log in with RealMe and take the image you want through our whole image ordering process. On Flickr, zooming in to see details or downloading the size you want is far easier.

Lastly, I wanted to show that sharing our open content is possible at a large scale. I’m hoping this is going to make it easier for us to open up and share more from the collections.

Details of five images uploaded to Flickr.Details of Greymouth and Kumara Tramway; Wellington Corporation Tramways ticket; Sarah Ann Featon, Yellow Kowhai; Soldiers repairing a car; Wellington Physical Training School.

What’s next?

We’ll be releasing more images, at the collection level, as they’re checked and cleared. That’ll probably mean repeatedly needing to upload dozens (or thousands?) of images in a go.

When that happens I’ll have a much cleaner script that is easier to use, and more reliable. From start to finish, it should:

  • Go over the list of images we’ve picked out for inclusion
  • Check if an image has already been uploaded, discard it if so
  • Get the list of ID numbers we’re working with and send it off to NDHA so we can get the high-res tiffs
  • Rename the files with their DigitalNZ identifiers
  • Smoothly handle authentication with Flickr
  • Grab all the metadata
  • Sort images into their collections
  • Upload!
  • Mark each upload as done so we can run it in batches

Have a look at the code on GitHub

Read on for technical muckery!

Bleep bloop

Moving the files and all their info had several steps:

  • Identify the free downloads
  • Check if any were already on Flickr, and see what was on Flickr but not in the Free Downloads
  • Get the source files
  • Get their metadata
  • Upload them with all that metadata

And to avoid getting lethal clicking finger strain or dying of boredom, do it all automatically 3,500 times over.

2 APIs and a bunch of messy Python

To write this code I picked Python, because it’s the language I’m most familiar with, and it’s well supported with helpful modules and advice around the web.

Step one was building a list to tell me what’s actually in the free download pool. Happily, natlib.govt.nz runs off the DigitalNZ API, making that information findable via API.

import requests

dnz_api_key = My DigitalNZ API key

def create_free_download_list():
    facet = "and[atl_free_download]=True"
    result_count_url = "http://api.digitalnz.org/v3/records.json?api_key=%s&%s&per_page=0" % (dnz_api_key, facet)
    results_count_json = requests.get(result_count_url).json()
    count = results_count_json['search']['result_count']
    pages = count / 100 + 2
    with open('freedownloads.txt', 'w') as f:
        for page in range(1, pages):
            search_url = "http://api.digitalnz.org/v3/records.json?api_key=%s&%s&per_page=100&page=%d" % (dnz_api_key, facet, page)
            search_json = requests.get(search_url).json()
            for result in search_json['search']['results']:
                dnz_id = result['id']
                f.write('%s\n' % dnz_id)

(My query was a simple lookup of items within a specific ‘collection’. For more complex stuff, you might like to use Chris McDowall’s pydnz wrapper.)

I now had a list of DigitalNZ identifiers, and needed to check if any of them were already up on Flickr. I knew that the existing uploads included descriptions that linked back to their source record. If I found a match, I could strike it from the list.

This was a two-step process. First, I pulled down the description data for each item, using Flickr’s own API:

def get_photo_info(i):
    all_ids = get_pool_ids()
    request_url = build_flickr_request(str(i))
    data = untangle.parse(request_url)
    for description in data.rsp.photo.description:
        photo_description = description.cdata

Then, I processed the image’s description to look for a url, using a regular expression. If it was a natlib.govt.nz url, I checked the identifier against my list.

def get_photo_info(i):
        natlib_id = find_natlib_id(photo_description)

def find_natlib_id(description):
        natlib_url = re.search('(?Phttps?://[^"]+)', description).group("url")
        if '/records/' in natlib_url:
            natlib_id = natlib_url[-8:]
            return natlib_id
            return False
        return False

From there I worked out if the ID was already in my list of images in the pool and wrote it to the appropriate place.

def get_photo_info(i):
        natlib_id = find_natlib_id(photo_description)
        if natlib_id:
            test_photo(natlib_id, all_ids)
            with open('nonatlibrecord.txt', 'a') as f:
                    """ % int(i))

def test_photo(natlib_id, all_ids):
    if int(natlib_id) in all_ids:
        print "YAY"
        with open('imagesinboth.txt', 'a') as f:
                """ % int(natlib_id))
        with open('flickrcommons.txt', 'a') as f:
                """ % int(natlib_id))

In the end, I had four lists:

  • In both places (Already done! Yay!)
  • In free downloads but not Flickr Commons (Let’s get this uploaded!)
  • In Flickr Commons but not free downloads (These need adding to the free downloads pool!)
  • Unknown (Take a closer look – it’s either a detail or missing a link to the source!)

My main concern was the second list, the 3,500 images that needed uploading with their metadata.

Let’s get the images

Our digitised images are much bigger than the versions you see on this site, and those high-res versions are what I wanted to share.

The versions held by the National Digital Heritage Archive are very large, very detailed tiffs, exactly right for pushing out to Flickr.

This is when I realised that the DigitalNZ identifiers I’d used in my list of files weren’t going to be any help. Those identifiers are applied after the NDHA have already taken the items in and preserved them (which is also after the cataloguing librarians have added their reference numbers). They have their own identification schema: Intellectual Entities, or IEs.

To get the image files, I needed to tell the NDHA what IE numbers I wanted. That information lives in DNZ record metadata, in the dc_identifier field. A quick call to the DNZ API, asking for a specific field:

dnz_api_key = My DigitalNZ API key

def get_IE_number(dnz_id):
    api_url = 'http://api.digitalnz.org.nz/v3/records/%s.json?api_key=%s&fields=dc_identifier' % (str(dnz_id), dnz_api_key)
    response = requests.get(api_url).json()
    for i in response['record']['dc_identifier']:
        if 'ndha' in i:
            id_string = i
            IE_number = id_string[7:]
            return IE_number
            return None

And after a few thousand iterations I could ask for the right files.

(As an aside, the NDHA usually delivers access copies as jp2000 files. Before asking for the tiffs, I grabbed some jp2s and turned them into jpgs using Imagemagick. However, it turns out jp2s don’t retain the metadata tiffs keep. That could turn out to be valuable information for a user one day, so I went back for the tiffs.)

Authentication is a pain, let someone else do it

That might be bad advice. I’m new to this.

At this point I hit a brick wall of getting my head around Flickr’s API. First, because I was actually looking at out of date documentation. Then, on finding the updated documentation, I had no idea how to implement their authentication requirements.

I got lucky: there are many Python wrappers for Flickr, giving you access to the functions of the API without having to think as hard about it. I used python-flickr-api and was up and running right away.

(Another aside: along with the DNZ API key, we’ve now got a Flickr API key and a Flickr secret key. Keeping that stuff in your script’s not a great idea, so I stored them as environment variables. Here’s some info on setting environment variables in Mac OS Yosemite, which can be called in a Python script like so: dnz_api_key = os.environ['DNZ_API_KEY'])

Finally, the time had come to upload some images.

Metadata is better data

I wanted to make sure the images kept their context: basic catalogue information and a link back to the source.

Within the uploading process I looked this information up with another DigitalNZ query:

def get_metadata(dnz_id):
    parameters = ['title','description']
    api_url = 'http://api.digitalnz.org/v3/records/%s.json?api_key=%s' % (str(dnz_id), dnz_api_key)
    response = requests.get(api_url).json()
    title = response['record']['title']
    description = response['record']['description']
    return {'title': title, 'description': description}

And scraping the record’s page for the existing, nicely formatted citation. I extracted this from the page with BeautifulSoup:

import requests
from bs4 import BeautifulSoup

def add_citation(description, dnz_id):
    natlib_url = 'http://natlib.govt.nz/records/%s' % str(dnz_id)
    natlib_page = requests.get(natlib_url)
    natlib_html = natlib_page.text
    soup = BeautifulSoup(natlib_html)
    citation = soup.find("div", {"class": "usage"})
    citation = citation.span.text
    description = '%s \n\n %s' % (description, citation)
    return description

Those functions are used in the course of actually uploading each item:

def upload(dnz_id, IE_number):
        photo_path = ('files/%s.tiff' % IE_number)
        metadata = get_metadata(dnz_id)
        title = metadata['title']
        description = add_citation(metadata['description'], dnz_id)
        flickr_api.upload(photo_file = photo_path, title = title, description = description)
        print '%s uploaded!' % title
        status = open('files/%s.status' % IE_number, 'a')
        return True
        print "Unexpected error!", sys.exc_info()[0]
        failure = open('files/%s.failure' % IE_number, 'a')

And bam: 100 gigs of images make it out into the wider world!

Now, go explore, download, and turn these fantastic pictures into whatever you want.


I got a bunch of help from:

  • Chris McDowall
  • Michael Lascarides
  • Dan Charles
  • Cynthia Wu
  • Greig Roulston
  • Jay Buzenberg

By Reuben Schrader

Reuben is your friendly neighbourhood online editor.

Post a Comment

(will not be published) * indicates required field
Harry Chapman November 23rd at 9:02PM

Hi Reuben - thanks for the post and for your efforts. I strongly encourage you to also upload the images to Wikimedia Commons too if possible :-)

Siobhan Leachman November 26th at 6:59PM

Hi Ruben

I've been doing some machine tagging of the images following the "how to" in both http://blpublicdomain.wikispaces.com/Machine+tags and also http://biodivlib.wikispaces.com/file/view/Tagging_Tutorial_and_FAQ.pdf/514108856/Tagging_Tutorial_and_FAQ.pdf Hopefully this will ensure the images are more widely used. Thanks for doing this and may the Library continue to add to Flickr commons!

Tara Calishain November 29th at 7:27AM

Great stuff! Thanks for taking the time to document how you used Python to get this done. It'll be in today's (Saturday's) ResearchBuzz.

Jessamyn November 30th at 11:03AM

This is the BEST. Thanks for not only doing it but explaining how and sharing the code.

Jane December 1st at 12:26PM

If these are "free to download" why does yahoo require me to login to view the various sizes? Or create a yahoo account.

Reuben SchraderNational Library December 1st at 1:30PM

Hi Jane, that's really annoying - it never occured to me the images would be walled off like that, so I didn't test it.

I guess we have to consider this a step in the open access direction, and think about some other ways we can make the images available.

Amy Joseph December 1st at 2:46PM

Great post, Reuben. Awesome outcome, and cool to see your process set out.