Nick Ruest, Anna St-Onge, and myself wrote a piece in the open-access journal Digital Studies / Le champ numérique, “The Great WARC Adventure: Using SIPS, AIPS, and DIPS to Document SLAPPS.” The deliberately acronym-heavy title underlies a piece that does the following:
takes readers through the process of creating a web archive using open-source tools;
preserving and providing access to the web archive;
and enabling some basic analysis on the collection from the perspective of a historian.
While the long publishing time meant that some of our more recent approaches to analyzing web archives – warcbase, for example – didn’t make it in, the article hopefully provides a useful conceptual approach to working with web archives.
We thought that this post from December 2015 was still relevant today. In short, it shows how you can take web archive network files generated by our research team and analyze them yourselves using the open-source Gephi package.
Let us assume that the internet is here to stay. And that it becomes still more pivotal to have solid scholarly knowledge about the development of the internet of the past with a view to understanding the internet of the present and of the future; on the one hand, past events constitute important preconditions for todays internet, and, on the other, the mechanisms behind the developments in the past may prove very helpful for understanding what is about to happen with the internet today.
For more than four decades the internet has continued to grow and spread to an extent where today it is an indispensable element in the communicative infrastructure of many countries. Although the history of the internet has not been very predominant within the academic literature an increased number of books and journal articles within the last decade attest to the fact that internet historiography is an emerging field of study within internet studies, as well as within studies of culture, media, communication, and technology.
However, in the main the historical studies of the internet have been published in journals related to a variety of disciplines, and these journals do only rarely publish articles with a clear historical focus. Therefore, the editors of Internet Histories found that there was a need for a journal where the history of the internet and digital cultures is the main focus, a journal where historical studies are presented, and theoretical and methodological issues are debated with a view to constituting the history of the internet as a field of study in its own right.
Internet Histories embraces empirical as well as theoretical and methodological studies within the field of the history of the internet broadly conceived — from early computer networks, Usenet and Bulletin Board Systems, to everyday Internet with the web through the emergence of new forms of internet with mobile phones and tablet computers, social media, and the internet of things. The journal will also provide the premier outlet for cutting-edge research in the closely related area of histories of digital cultures.
The title of the journal, Internet Histories, suggests there is not one single and fixed Internet history going straight from Arpanet to the Internet as we know it today, from United States to a world-wide network. Rather, there are multiple local, regional and national paths and a variety of ways that the internet has been imagined, designed, used, shaped, and regulated around the world. Internet Histories aims to publish a range of scholarship that examines the global and internetworked nature of the digital world as well as situated histories that account for diverse local contexts.
They were both fascinating talks, available via YouTube above. Richard’s talk was fascinating in that it explored what Big Data means for historians – and recounted his experience of working with the Archive Team torrent. To me, the talk really underscored the importance of doing web history: the web really is the record of our lives today, and we need to hope that there are people there to back up this sort of information!
It was followed by David Bohnett, who explained the idea behind GeoCities, some of the technical challenges he faced, and really what it was like to preside over such explosive growth during the dot com era. As somebody who’s explored ideas of GeoCities as a community before, I was interested to hear so much emphasis placed in his talk upon the neighbourhood structure, volunteer community leaders, and what this all meant for bringing people together. As a writer on this topic, it was pretty interesting and reassuring to hear that my own ideas weren’t off kilter!
I was also surprised, although perhaps I shouldn’t have been, with his attitude towards the closure of GeoCities in 2009 by Yahoo! (which bought it in 1999) – that it was “better shut down than to go on as this abandoned version of its former self.” Fair enough, I suppose, but again – to echo Richard’s opening talk – thank god that Archive Team and the Internet Archive were there to preserve this information…
Anyways, check the video out for yourself if you’re interested.
Niels Brügger and myself have sent this out to a few listservs, so decided to cross-post this here on my blog as well. Do let me know if you have any questions!
The web has now been with us for almost 25 years: new media is simply not that new anymore. It has developed to become an inherent part of our social, cultural, political, and social lives, and is accordingly leaving behind a detailed documentary record of society and events since the advent of widespread web archiving in 1996. These two key points lie at the heart of our in-preparation SAGE Handbook of Web History: that the history of the web itself needs to be studied, but also that its value as an incomparable historical record needs to be inquired as well. Within the last decade, considerable interest in the history of the Web has emerged. However, there is…
The web archiving community is a great one, but it can sometimes be a bit confusing to enter. Unlike communities such as the Digital Humanities, which has developed aggregation services like DH Now, the web archiving community is a bit more dispersed. But fear not, there are a few places to visit to get a quick sense of what’s going on. Here I just want to give a quick rundown of how you can learn about web archiving on social media, from technical walkthroughs, and from blogs.
I’m sure I’m missing stuff – let us all know in the comments!
A substantial amount of web archiving scholarship happens online. I use Twitter (I’m at @ianmilligan1), for example, as a key way to share research findings and ideas that I have as my project comes together. I usually try to hashtag them with: #webarchiving. This means that all tweets that people use “#webarchiving” with will show up in that specific timeline.
Since the introduction of the World Wide Web, a new and different kind of primary source has become available for researchers: born digital documents, materials which have been shared primarily online and which will become increasingly useful for historians interested in studying recent history.
However, these sources are already more difficult to preserve compared to traditional ones. And this is true especially for what concerns the University of Bologna’s digital past. In fact, Italy does not have a national archive for the preservation of its web-sphere and furthermore “Unibo.it” has been excluded from the Wayback Machine.
For this reason, I have focused my research (a forthcoming piece on this will be available in Digital Humanities Quarterly) primarily on understanding how to deal and solve this specific issue in order to reconstruct the University of Bologna’s digital past and to understand if these materials are able to offer us a new perspective on the recent history of this institution.
In order to understand the reasons of the removal of Unibo.it, my first step was to find, in the exclusion-policy of the Internet Archive, information related to the message “This URL has been excluded from the Wayback Machine”, which appeared when searching “http://www.unibo.it”.
As described in the Internet Archive’s FAQ section, the most common reason for this exclusion is when a website explicitly requests to not be crawled by adding the string “User-agent: ia_archiver Disallow: /” to its robots.txt file. However, it is also explained that “Sometimes a website owner will contact us directly and ask us to stop crawling or archiving a site, and we endeavor to comply with these requests. When you come across a “blocked site error” message, that means that a site owner has made such a request and it has been honoured. Currently there is no way to exclude only a portion of a site, or to exclude archiving a site for a particular time period only. When a URL has been excluded at direct owner request from being archived, that exclusion is retroactive and permanent”.
When a website has not been archived due to robots.txt limitations a specific message is displayed. This is different from the one that appeared when searching the University of Bologna website, as you can see in the figures below. Therefore, the only possible conclusion is that someone explicitly requested to remove the University of Bologna website (or more likely, only a specific part of it) from the Internet Archive.
For this reason, I decided to consult CeSIA, the team that has supervised Unibo.it during the last few years, regarding this issue. However, they did not submit any removal request to the Internet Archive and they were not aware of anyone submitting it.
To clarify this issue and discover whether the website of this institution has been somehow preserved during the last twenty years, I further decided to contact the Internet Archive team at email@example.com (as suggested in the FAQ section).
Thanks to the efforts of Mauro Amico (CeSIA), Raffaele Messuti (AlmaDL – Unibo), Christopher Butler (Internet Archive) and Giovanni Damiola (Internet Archive), we began to collaborate at the end of March 2015. As Butler told us, this case was really similar to another one that involved the New York Government Websites.
With their help, I discovered that a removal request regarding the main website and a list of specific subdomains had been submitted to the Wayback Machine in April 2002.
With our efforts, the university main website became available again on the Wayback Machine on the 13th of April 2015. However, both the Internet Archive and CeSIA have no trace of the email requests. For this reason, CeSIA decided to keep the other URLs in the list excluded from the Wayback Machine for the moment, as it is possible that this request was made for a specific legal reason.
In 2002 the administration of Unibo.it changed completely, during a general re-organization of the university’s digital presence. Therefore, it is entirely obscure who, in that very same month, could have sent this specific request, and for which reason.
However, it is evident that this request was made by someone who knew how the Internet Archive exclusion policy works, as he/she explicitly declared a specific list of subdomains to remove (in fact, the Internet Archive excludes based on URLs and their subsections – not subdomains). It could be possible that the author obtained this specific knowledge by contacting directly the Internet Archive and asking for clarification.
Even if thirteen years have passed, my assumption was that someone involved in the administration of the website would have remembered at least this email exchange with a team of digital archivists in San Francisco. So, between April and June 2015 I conducted a series of interviews with several of the people involved in the Unibo.it website, pre and post the 2002 reorganization. However, no one had memories or old emails related to this specific issue.
As the specificity of the request is the only hint that could help me identify its author, I decided to analyze the different urls in more detail. The majority of them are server addresses (identified by “alma.unibo”), while the other pages are subdomains of the main website, for example estero.unibo.it (probably dedicated to international collaborations).
My questions now are: why someone wanted to exclude exactly these pages and not all the department pages, which had an extremely active presence at that time? Why exactly these four subdomains and not the digital magazine Alma2000 (alma2000.unibo.it) or the e-learning platform (www.elearning.unibo.it)? It could be possible that this precise selection is related to a specific reason, that could offer us a better understanding on the use and the purpose of this platform in those years.
To conclude, I would like to also point out how strange this specific impasse is: given that we don’t know the reason of the request I cannot have the permission from CeSIA, the current administrator, to analyze the snapshots of these URLs. However, at the same time, we are not able to find anyone who remembers sending the request and not a single proof of it has been preserved. In my opinion, this depicts perfectly a new level of difficulties that future historians will encounter while investigating our past in the archives.
Federico Nanni is a PhD student in Science, Technology and Society at the Centre for the History of Universities and Science of the University of Bologna. His research is focused on understanding how to combine methodologies from different fields of study in order to face both the scarcity and the abundance of born digital sources related to the recent history of Italian universities.
I’d like to imagine humanists or social scientists who want to use web archives are often in the same position I was four years ago: confronted with opaque ARC and WARC files, downloading them onto your computer, and not really knowing what to do with them (apart from maybe unzipping them and exploring them manually). Our goal is to change that: to give easy to follow walkthroughs that can allow users to do the basic things to get started:
Link visualizations to explore networks, finding central hubs, communities, and so forth;
Textual analysis to extract specific text, facilitating subsequent topic modelling, entity extraction, keyword search, and close reading;
Overall statistics to find over- and under-represented domains, platforms, or content types;
And basic n-gram-style navigation to monitor and explore change over time.
All of this is relatively easy for web archive experts to do, but still difficult for endusers.
The Warcbase wiki, still under development, aims to fix that. Please visit, comment, fork, and we hope to develop it alongside all of you.
I had a fantastic time at the International Internet Preservation Consortium’s Annual General Meeting this year, held on the beautiful campus of Stanford University (with a day trip down to the Internet Archive in San Francisco). It’s hard to write these sorts of recaps: I had such an amazing time, my head filled with great ideas, that it’s difficult to give everything the justice that they deserve. Many of the presentation slide decks are available on the schedule, and videos will be forthcoming.
My main takeaways: we’re continuing to see the development of sophisticated access tools to these repositories, coupled with increasingly exciting and sophisticated researcher use of them. There’s a recognition that context matters when understanding archived webpages, a phrase that came up a few times throughout the event. Crucially, there was a lot of energy in the room: there’s a real enthusiasm towards making these as accessible as possible and facilitating their use. I wasn’t exaggerating when I noted to one of the organizers that I wish every conference was like this: leaving me on my flight home with lots of fantastic ideas, hope for the future, and excitement about what can be done. As the recent “Conference Manifesto” in the New York Times noted, that’s not the experience at all conferences!
Historians who work with, or who are thinking about working with, web archives will be excited about the announcement that Archive-It Research Services made on March 17th. They’re widely expanding the sort of data that they provide to researchers. As they put it in their announcement:
The service will allow any Archive-It partner to give users, researchers, scholars, developers, and other patrons easily-analyzed datasets that contain key metadata elements, link graphs, named entities, and other data derived from the resources within their collections. By supporting access in aggregate to partner archives, ARS will facilitate new types of use, research, and analysis of the significant historical records from the web that Archive-It partners are working to collect, preserve, and make accessible.
They’re making three types of datasets available, the first of which are WAT files, which contain metadata about websites.
From WATs, you can get metadata descriptions for websites, the links that they point towards, the anchor text of those links, and crawl information. Ian Milligan, one of the co-authors of this blog, has been using WAT files to analyze Canadian political history websites: see some results here and here (including a guided video tour of some results).
LGA and WANE files are unfamiliar to the two authors of this blog, although they look to be very useful! LGA files accelerate the ability to do longitudinal link analysis from WAT files. The examples they give are actually from the Canadian Political Interest Groups collection! Finally, WANE files use Stanford NER to extract information relating to people, organizations, and locations. Using derived text from a web archive, Milligan plotted all the locations mentioned in GeoCities – you can see the results here.
To get these files, consult the service details page. In short, if you’re an Archive-It partner you can order it internally through your dashboard. For the rest of us researchers, you just need to send an e-mail in with some information, and start the process.
In short: an amazing move that’s really going to unlock these files. WARC files are really big – too big, for most systems – whereas the LGA, WAT, and WANE model is going to unlock accessible web archive research. Kudos to Archive-It.