Web Harvest 2010: One week

How much can you download in a week?

After seven days of harvesting we've collected over 2.6TB of data from in excess of 50 million URLs. The current average crawl rate is 149 URLs per second.

We now estimate the final collection will be between 4 and 5 TB compressed (compared with about 3TB compressed in 2008).

On a technical level, everything is going well, except that a hard disk failed over the weekend and we lost a log file. No data was lost because all downloaded content is immediately backed up to a data repository. We hope to recover the log file too, when it's all over.

There's so much data coming in that it is hard to track exactly what is being harvested in real time, but here's the top ten reported media types:

#urls mime-types
31,567,006 text/html
6,908,737 image/jpeg
1,642,548 image/gif
563,463 image/png
510,400 application/pdf
311,524 text/xml
247,833 text/plain
196,715 text/css
178,576 application/rss+xml
123,638 no-type

By Gordon Paynter

Gordon likes libraries.

Post a Comment

(will not be published) * indicates required field

Be the first to comment.