“I could have confronted a million monkeys in the Himalayas…”June 27th, 2012 By maxine
I’m getting a bit sick of all these Dacobra movies and TV shows
“DACOBRA, like Sherlock Holmes, Captain Kettle, Dr. Nikola, and other celebrated names in modern fiction, will become a household name all the world over, and you will do well not to miss the opportunity our columns will afford of making the early acquaintance of this great new character.”
DACOBRA! The name was splashed across the pages of the Saturday supplement to the Auckland Star, promising adventure and something vaguely Eastern. The edition of March 8, 1902, proudly touted its new serial, a sure “sensation in the literary world” and source of “unusual pleasure”.
Serialised fiction was a staple of both metropolitan and local papers, particularly in the decades around the turn of the century. The stories were the length of hefty novels – often because they were, chopped up for regular reading.
The newspapers of New Zealand had hundreds of them, and now so does Papers Past.
From the printing press to your screen
I was curious about what it would take to pull one of these stories out of Papers Past, where the original pages have been scanned and transcribed with OCR, and turn it into something you could read all together.
Turns out, it still takes a lot. No one’s going to be piping these stories straight to their Kindle any time soon. Even with all the technological underpinnings, the path from A to B requires an awful lot of human intervention.
The trouble with w’s, and other problems
Papers Past isn’t just scanned – it comes with a text equivalent that makes it searchable. That’s been created by running the scans through optical character recognition, or OCR. It does pretty well, but it’s not a perfect process, and in some cases you get very odd results.
Though the sheer variety of mistakes makes it hard to do anything programmatic (like automatically turning every instance of nr:u|=#\j into monkey ), there are some patterns that can smooth the path.
Large chunks of this story were having real trouble with w , frequently turning it into Av and producing sentences like “Ave Avish you Avould Awash Avith Avater”. I could try replacing every Av with w , but of course there are entirely legitimate words like avuncular and flavour and moshav.
More piecemeal, but less likely to lead to other problems, was replacing particular words throughout the text. Just swapping out Avould , Avich , and Avhere with would , which , and where saved me possibly hundreds of edits. What I couldn’t avoid were the hundreds of dots and scratches the OCR process had decided were commas, periods, or semicolons.
On occasion, columns hadn’t been properly separated before running the pages through the transcription. The text became a solid clump of half sentences running into each other, and it’s really annoying. At a couple of points I found it easier to transcribe the text myself instead of fix the machine-made version. Robot uprising avoided.
Even more troubling, there’s a whole Saturday supplement (where the story ran in the Star) missing from the database. The collections can only be as good as what’s collected, of course, and apparently that week’s bonus pages never made it to the Library in the first place.
Luckily, the Auckland Star wasn’t as widely available as the Sunday Star-Times of today. The Hawera & Normanby Star therefore had a reason to bring DACOBRA to a whole new audience, and their supplements made it into Papers Past.
Actually, Papers Past is really really good
It’s not all bad news, of course. It’s phenomenal to me that this 110 year old newspaper is even more accessible to me than it was to its original reader. The processes, software, and flat-out work that has gone into Papers Past blows me away. Even as someone doing research with historic materials 10 years ago, I’m envious of what the kids today have to play with. For example, you could go in, yank out and consolidate a serialised story, clean it up, and fire it out to the world…
Papers Past sits on a platform called Veridian, which also underlies a lot of full-text newspaper collections out there. Later versions (which we’re eying up, and which others like Trove are currently using) include some very nice features that would help my book-extraction out, like user corrections.
Papers Past could be an even better resource if users could correct errors they find. Right now, messed-up text isn’t just ugly, it means search results are less useful, and it could be made a whole lot better with the contributions of our amazing and dedicated users. It’d be like Distributed Proofreaders, but with newspapers.
Mr. Maxwell, monkey-shooter
So was it worth it? Is the story all I’d hoped for? Good grief no. It’s actually quite bad. Unsympathetic characters with muddled motivations, a mystery that delivers little interest when revealed, and an utter lack of economy in the writing (yes, I’m one to talk).
Lionel Maxwell’s an unsympathetic character, but I can’t tell if that’s the author’s intent. Aside from the depressingly unsurprising racism, sexism, colonialism, and ignore-the-plebs-ism that comes naturally to characters of the day, our sculptor-hero is also arrogant in his art, unreflective in his understanding of the world, and generally a bad friend.
Maybe the plot or world-building could have made up for it. As I made my way through the early chapters, I thought I was in for something almost Lovecraftian; a well-travelled scholarly type finds some truth in those weird old tales in a strange, out of the way part of the world. Well, Scotland. But it just never gets that weird, or that horror-laden.
Even the inherent funniness of monkeys, which pop up throughout the story, is chilled somewhat by the inclination of characters to kill them every chance they get.
Still, please do give it a go, especially since I spent so long cleaning it up for you. Maybe on a quiet Saturday morning, when you’re done with the paper.
I made the PDF in Apple’s iBooks Author, before realising it doesn’t export to ePub. Darn it.
This isn’t a pure transcript – I fixed a lot of typos from the original newspaper. Ideally the text version on Papers Past would be faithful to the source, though.