[A guest post from Federico Nanni, who a PhD student in Science, Technology and Society at the Centre for the History of Universities and Science of the University of Bologna.]
The University of Bologna is considered by historians to be the world’s oldest university in terms of continuous operation. At the university’s Centre for the History of Universities and Science, our research group works on different projects focused on understanding the history of this academic institution and its long-term socio-political relationships. Several sources and methods have been used for studying its past, from quantitative data analysis to online databases of scientific publications.
Since the introduction of the World Wide Web, a new and different kind of primary source has become available for researchers: born digital documents, materials which have been shared primarily online and which will become increasingly useful for historians interested in studying recent history.
However, these sources are already more difficult to preserve compared to traditional ones. And this is true especially for what concerns the University of Bologna’s digital past. In fact, Italy does not have a national archive for the preservation of its web-sphere and furthermore “Unibo.it” has been excluded from the Wayback Machine.
For this reason, I have focused my research (a forthcoming piece on this will be available in Digital Humanities Quarterly) primarily on understanding how to deal and solve this specific issue in order to reconstruct the University of Bologna’s digital past and to understand if these materials are able to offer us a new perspective on the recent history of this institution.
In order to understand the reasons of the removal of Unibo.it, my first step was to find, in the exclusion-policy of the Internet Archive, information related to the message “This URL has been excluded from the Wayback Machine”, which appeared when searching “http://www.unibo.it”.
As described in the Internet Archive’s FAQ section, the most common reason for this exclusion is when a website explicitly requests to not be crawled by adding the string “User-agent: ia_archiver Disallow: /” to its robots.txt file. However, it is also explained that “Sometimes a website owner will contact us directly and ask us to stop crawling or archiving a site, and we endeavor to comply with these requests. When you come across a “blocked site error” message, that means that a site owner has made such a request and it has been honoured. Currently there is no way to exclude only a portion of a site, or to exclude archiving a site for a particular time period only. When a URL has been excluded at direct owner request from being archived, that exclusion is retroactive and permanent”.
When a website has not been archived due to robots.txt limitations a specific message is displayed. This is different from the one that appeared when searching the University of Bologna website, as you can see in the figures below. Therefore, the only possible conclusion is that someone explicitly requested to remove the University of Bologna website (or more likely, only a specific part of it) from the Internet Archive.
For this reason, I decided to consult CeSIA, the team that has supervised Unibo.it during the last few years, regarding this issue. However, they did not submit any removal request to the Internet Archive and they were not aware of anyone submitting it.
To clarify this issue and discover whether the website of this institution has been somehow preserved during the last twenty years, I further decided to contact the Internet Archive team at firstname.lastname@example.org (as suggested in the FAQ section).
Thanks to the efforts of Mauro Amico (CeSIA), Raffaele Messuti (AlmaDL – Unibo), Christopher Butler (Internet Archive) and Giovanni Damiola (Internet Archive), we began to collaborate at the end of March 2015. As Butler told us, this case was really similar to another one that involved the New York Government Websites.
With their help, I discovered that a removal request regarding the main website and a list of specific subdomains had been submitted to the Wayback Machine in April 2002.
With our efforts, the university main website became available again on the Wayback Machine on the 13th of April 2015. However, both the Internet Archive and CeSIA have no trace of the email requests. For this reason, CeSIA decided to keep the other URLs in the list excluded from the Wayback Machine for the moment, as it is possible that this request was made for a specific legal reason.
In 2002 the administration of Unibo.it changed completely, during a general re-organization of the university’s digital presence. Therefore, it is entirely obscure who, in that very same month, could have sent this specific request, and for which reason.
However, it is evident that this request was made by someone who knew how the Internet Archive exclusion policy works, as he/she explicitly declared a specific list of subdomains to remove (in fact, the Internet Archive excludes based on URLs and their subsections – not subdomains). It could be possible that the author obtained this specific knowledge by contacting directly the Internet Archive and asking for clarification.
Even if thirteen years have passed, my assumption was that someone involved in the administration of the website would have remembered at least this email exchange with a team of digital archivists in San Francisco. So, between April and June 2015 I conducted a series of interviews with several of the people involved in the Unibo.it website, pre and post the 2002 reorganization. However, no one had memories or old emails related to this specific issue.
As the specificity of the request is the only hint that could help me identify its author, I decided to analyze the different urls in more detail. The majority of them are server addresses (identified by “alma.unibo”), while the other pages are subdomains of the main website, for example estero.unibo.it (probably dedicated to international collaborations).
My questions now are: why someone wanted to exclude exactly these pages and not all the department pages, which had an extremely active presence at that time? Why exactly these four subdomains and not the digital magazine Alma2000 (alma2000.unibo.it) or the e-learning platform (www.elearning.unibo.it)? It could be possible that this precise selection is related to a specific reason, that could offer us a better understanding on the use and the purpose of this platform in those years.
To conclude, I would like to also point out how strange this specific impasse is: given that we don’t know the reason of the request I cannot have the permission from CeSIA, the current administrator, to analyze the snapshots of these URLs. However, at the same time, we are not able to find anyone who remembers sending the request and not a single proof of it has been preserved. In my opinion, this depicts perfectly a new level of difficulties that future historians will encounter while investigating our past in the archives.
Federico Nanni is a PhD student in Science, Technology and Society at the Centre for the History of Universities and Science of the University of Bologna. His research is focused on understanding how to combine methodologies from different fields of study in order to face both the scarcity and the abundance of born digital sources related to the recent history of Italian universities.
2 thoughts on “On the trace of a website’s lost past”
So it looks like archive.org doesn’t really delete anything, even when asked. That’s a shame.