[A recent post, cross-posted from Peter’s own blog.]
Towards the end of 2013 the UK saw a public controversy seemingly made to showcase the value of web archives. The Conservative Party, in what I still think was nothing more than a housekeeping exercise, moved an archive of older political speeches to a harder-to-find part of their site, and applied the robots.txt protocol to the content. As I wrote for the UK Web Archive blog at the time:
Firstly, the copies held by the Internet Archive (archive.org) were not erased or deleted – all that happened is that access to the resources was blocked. Due to the legal environment in which the Internet Archive operates, they have adopted a policy that allows web sites to use robots.txt to directly control whether the archived copies can be made available. The robots.txt protocol has no legal force but the observance of it is part of good manners in interaction online. It requests that search engines and other web crawlers such as those used by web archives do not visit or index the page. The Internet Archive policy extends the same courtesy to playback.
At some point after the content in question was removed from the original website, the party added the content in question to their robots.txt file. As the practice of the Internet Archive is to observe robots.txt retrospectively, it began to withhold its copies, which had been made before the party implemented robots.txt on the archive of speeches. Since then, the party has reversed that decision, and the Internet Archive copies are live once again.
As public engagement lead for the UK Web Archive at the time, I was happily able to use the episode to draw attention to holdings of the same content in UKWA that were not retrospectively affected by a change to the robots.txt of the original site.
This week I’ve been prompted to think about another aspect of this issue by my own research. I’ve had occasion to spend some time looking at archived content from a political organisation in the UK, the values of which I deplore but which as scholars we need to understand. The UK Web Archive holds some data from this particular domain, but only back to 2005, and the earlier content is only available in the Internet Archive.
Some time ago I mused on a possible ‘Heisenberg principle of web archiving‘ – the idea that, as public consciousness of web archiving steadily grows, the consciousness of that fact begins to affect the behaviour of the live web. In 2012 it was hard to see how we might observe any such trend, and I don’t think we’re any closer to being able to do so. But the Conservative party episode highlights the vulnerability of content in the Internet Archive to a change in robots.txt policy by an organisation with something to hide and a new-found understanding of how web archiving works.
Put simply: the content I’ve been citing this week could later today disappear from view if the organisation concerned wanted it to, and was to come to understand how to make it happen. It is possible, in short, effectively to delete the archive – which is rather terrifying.
In the UK, at least, the danger of this is removed for content published after 2013, due to the provisions of Non-Print Legal Deposit. (And this is yet another argument for legal deposit provisions in every jurisdiction worldwide). In the meantime, as scholars, we are left with the uneasy awareness that the more we draw attention to the archive, the greater the danger to which it is exposed.