Monthly Archives: April 2015

A Heisenberg Principle of web archiving ?

It’s been great to see the historical perspective being represented at this week’s General Assembly of the IIPC in Stanford. Following the Twitter hashtag at #iipcGA15, this older post came to mind. The comprehensive domain-wide archiving under UK Non-Print Legal Deposit that it refers to is now two years old; and 2015 has seen a significant upswing in attention being paid to web archiving in the press. So: do we yet know what the effect of widespread web archiving will be on the behaviour of those being archived? I don’t think we do; and historians of the future will need to know.


Whatever it means to real scientists, the famous ‘uncertainty principle’ of Werner Heisenberg is sometime popularly taken to mean that it is impossible closely to observe something without in some way altering it. It’s also a conundrum that has faced anthropologists when observing cultures far removed from their own: how far does the consciousness of being observed alter the behaviour of the subject ?

I’ve been publishing in print in the traditional way for some years now, and everyone knows that books are (in theory) permanent, that they find their way into libraries; and so one writes conscious that the words cannot be unwritten. Writing for the web, however, has had a more transient aesthetic: I can write with the freedom that comes from knowing that (in a site I control) I can retrospectively edit at will, should I choose to. There are good scholarly reasons not to, to do…

View original post 236 more words

Archive-It Research Services: Exciting New Developments

Named Entity Recognition results on a corpus of tens of thousands of web archived pages: possible now with Archive-It’s WANE File

Historians who work with, or who are thinking about working with, web archives will be excited about the announcement that Archive-It Research Services made on March 17th. They’re widely expanding the sort of data that they provide to researchers. As they put it in their announcement:

The service will allow any Archive-It partner to give users, researchers, scholars, developers, and other patrons easily-analyzed datasets that contain key metadata elements, link graphs, named entities, and other data derived from the resources within their collections. By supporting access in aggregate to partner archives, ARS will facilitate new types of use, research, and analysis of the significant historical records from the web that Archive-It partners are working to collect, preserve, and make accessible.

They’re making three types of datasets available, the first of which are WAT files, which contain metadata about websites.

From WATs, you can get metadata descriptions for websites, the links that they point towards, the anchor text of those links, and crawl information. Ian Milligan, one of the co-authors of this blog, has been using WAT files to analyze Canadian political history websites: see some results here and here (including a guided video tour of some results).

LGA and WANE files are unfamiliar to the two authors of this blog, although they look to be very useful! LGA files accelerate the ability to do longitudinal link analysis from WAT files. The examples they give are actually from the Canadian Political Interest Groups collection! Finally, WANE files use Stanford NER to extract information relating to people, organizations, and locations. Using derived text from a web archive, Milligan plotted all the locations mentioned in GeoCities – you can see the results here.

To get these files, consult the service details page. In short, if you’re an Archive-It partner you can order it internally through your dashboard. For the rest of us researchers, you just need to send an e-mail in with some information, and start the process.

In short: an amazing move that’s really going to unlock these files. WARC files are really big – too big, for most systems – whereas the LGA, WAT, and WANE model is going to unlock accessible web archive research. Kudos to Archive-It.

ReSAW: Research Infrastructure for the Study of Archived Web Materials

Historians based in Europe in particular should know about ReSAW, a Europe-wide network of scholars and institutions interested in the archived web. Co-ordinated by Niels Brügger (Aarhus University, NetLab & the Centre for Internet Studies), at present it is largely sustained by the efforts of its members, but there are plans for expansion in the next few years.

The next ReSAW event is a major conference on Web Archives as Scholarly Sources: Issues, Practices and Perspectives, which will take place in Denmark in June, and at which both Peter and Ian will be presenting papers. Booking is now open.

As well as the conference, there is a cluster of pilot research projects which may be of interest to historians. These include examinations of patterns of commemoration online, through to the traces left by the Eurovision Song Contest.