Web Harvest 2010: One weekMay 21st, 2010 By Gordon Paynter
How much can you download in a week?
After seven days of harvesting we've collected over 2.6TB of data from in excess of 50 million URLs. The current average crawl rate is 149 URLs per second.
We now estimate the final collection will be between 4 and 5 TB compressed (compared with about 3TB compressed in 2008).
On a technical level, everything is going well, except that a hard disk failed over the weekend and we lost a log file. No data was lost because all downloaded content is immediately backed up to a data repository. We hope to recover the log file too, when it's all over.
There's so much data coming in that it is hard to track exactly what is being harvested in real time, but here's the top ten reported media types: