Monthly Archives: July 2015

Conference dispatches from Aarhus: Web Archives as Scholarly Sources

Some belated reflections on the excellent recent conference at Aarhus University in Denmark, on Web Archives as Scholarly Sources: Issues, Practices and Perspectives (see the abstracts on the conference site).

As well as an opportunity to speak myself, it was a great chance to catch up with what is a genuinely global community of specialists, even if (as one might expect) the European countries were particularly well represented this time. It was also particularly pleasing to see a genuine intermixing of scholars with the librarians and archivists whose task it is to provide scholars with their sources. As a result, the papers were an eclectic mix of method, tools, infrastructure and research findings; a combination not often achieved.

Although there were too many excellent papers to mention them all here, I draw out a few to illustrate this eclecticism. There were discussions of research method as applied both in close reading of small amounts of material (Gebeil, Nanni), and to very large datasets (Goel and Bailey). As well as this, we heard about emerging new tools for better harvesting of web content, and of providing access to the archived content ( Huurdeman).

Particularly good to see were the first signs of work that was beginning to go beyond discussions of method (“the work I am about to do”) to posit research conclusions, even if still tentative at this stage (Musso amongst others), and critical reflection on the way in which the archived web is used (Megan Sapnar Ankerson). It was also intriguing to see an increased focus on the understanding of the nature of a national domain, particularly in Anat Ben-David‘s ingenious reconstruction of the defunct .yu domain of the former Yugoslavia. Good to see too was the beginnings of a reintegration of social networks into the picture (Milligan, Weller, McCarthy) difficult to archive though they are; and some attention to the web before 1996 and the Internet Archive (Kevin Driscoll on BBS).

All in all, it was an excellent conference in all its aspects, and congratulations to Niels Brügger and the organising team for pulling it off.

Have Web Collections? Want Link and Text Analysis?

(x-posted with

The Warcbase wiki in action!
The Warcbase wiki in action!

The Web Archives for Historical Research Group has been busy: working on getting the Shine front end running on Archive-It collections (a soft launch is underway here if you want to play with old Canadian websites), setting up Warcbase on our collections, and digging manually through the GeoCities torrent for close readings of various neighbourhoods.

One collaboration has been really fruitful. Working with Jimmy Lin, a computer scientist who has just joined the University of Waterloo’s David Cheriton School of Computer Science, we’ve been working on scripts, workflows, and implementations of his warcbase platform. Visit the warcbase wiki here. Interdisciplinary collaboration is amazing!

I’d like to imagine humanists or social scientists who want to use web archives are often in the same position I was four years ago: confronted with opaque ARC and WARC files, downloading them onto your computer, and not really knowing what to do with them (apart from maybe unzipping them and exploring them manually). Our goal is to change that: to give easy to follow walkthroughs that can allow users to do the basic things to get started:

  • Screen Shot 2015-06-05 at 11.51.29 AM
    A dynamic visualization generated with warcbase and Gephi

    Link visualizations to explore networks, finding central hubs, communities, and so forth;

  • Textual analysis to extract specific text, facilitating subsequent topic modelling, entity extraction, keyword search, and close reading;
  • Overall statistics to find over- and under-represented domains, platforms, or content types;
  • And basic n-gram-style navigation to monitor and explore change over time.

All of this is relatively easy for web archive experts to do, but still difficult for endusers.

The Warcbase wiki, still under development, aims to fix that. Please visit, comment, fork, and we hope to develop it alongside all of you.