Category Archives: Uncategorized

A Heisenberg Principle of web archiving ?

It’s been great to see the historical perspective being represented at this week’s General Assembly of the IIPC in Stanford. Following the Twitter hashtag at #iipcGA15, this older post came to mind. The comprehensive domain-wide archiving under UK Non-Print Legal Deposit that it refers to is now two years old; and 2015 has seen a significant upswing in attention being paid to web archiving in the press. So: do we yet know what the effect of widespread web archiving will be on the behaviour of those being archived? I don’t think we do; and historians of the future will need to know.


Whatever it means to real scientists, the famous ‘uncertainty principle’ of Werner Heisenberg is sometime popularly taken to mean that it is impossible closely to observe something without in some way altering it. It’s also a conundrum that has faced anthropologists when observing cultures far removed from their own: how far does the consciousness of being observed alter the behaviour of the subject ?

I’ve been publishing in print in the traditional way for some years now, and everyone knows that books are (in theory) permanent, that they find their way into libraries; and so one writes conscious that the words cannot be unwritten. Writing for the web, however, has had a more transient aesthetic: I can write with the freedom that comes from knowing that (in a site I control) I can retrospectively edit at will, should I choose to. There are good scholarly reasons not to, to do…

View original post 236 more words

Archive-It Research Services: Exciting New Developments

Named Entity Recognition results on a corpus of tens of thousands of web archived pages: possible now with Archive-It’s WANE File

Historians who work with, or who are thinking about working with, web archives will be excited about the announcement that Archive-It Research Services made on March 17th. They’re widely expanding the sort of data that they provide to researchers. As they put it in their announcement:

The service will allow any Archive-It partner to give users, researchers, scholars, developers, and other patrons easily-analyzed datasets that contain key metadata elements, link graphs, named entities, and other data derived from the resources within their collections. By supporting access in aggregate to partner archives, ARS will facilitate new types of use, research, and analysis of the significant historical records from the web that Archive-It partners are working to collect, preserve, and make accessible.

They’re making three types of datasets available, the first of which are WAT files, which contain metadata about websites.

From WATs, you can get metadata descriptions for websites, the links that they point towards, the anchor text of those links, and crawl information. Ian Milligan, one of the co-authors of this blog, has been using WAT files to analyze Canadian political history websites: see some results here and here (including a guided video tour of some results).

LGA and WANE files are unfamiliar to the two authors of this blog, although they look to be very useful! LGA files accelerate the ability to do longitudinal link analysis from WAT files. The examples they give are actually from the Canadian Political Interest Groups collection! Finally, WANE files use Stanford NER to extract information relating to people, organizations, and locations. Using derived text from a web archive, Milligan plotted all the locations mentioned in GeoCities – you can see the results here.

To get these files, consult the service details page. In short, if you’re an Archive-It partner you can order it internally through your dashboard. For the rest of us researchers, you just need to send an e-mail in with some information, and start the process.

In short: an amazing move that’s really going to unlock these files. WARC files are really big – too big, for most systems – whereas the LGA, WAT, and WANE model is going to unlock accessible web archive research. Kudos to Archive-It.

ReSAW: Research Infrastructure for the Study of Archived Web Materials

Historians based in Europe in particular should know about ReSAW, a Europe-wide network of scholars and institutions interested in the archived web. Co-ordinated by Niels Brügger (Aarhus University, NetLab & the Centre for Internet Studies), at present it is largely sustained by the efforts of its members, but there are plans for expansion in the next few years.

The next ReSAW event is a major conference on Web Archives as Scholarly Sources: Issues, Practices and Perspectives, which will take place in Denmark in June, and at which both Peter and Ian will be presenting papers. Booking is now open.

As well as the conference, there is a cluster of pilot research projects which may be of interest to historians. These include examinations of patterns of commemoration online, through to the traces left by the Eurovision Song Contest.

How fast does the web change and decay? Some evidence

One of the prime motivations behind Web Archives for Historians is our consciousness of how quickly the web changes and decays, and of the particular shape this gives to the archived web with which historians will need to work. However, just how this happens is not very well documented, and so I draw attention here to some introductory resources about the problem.

As historians, we need to know about patterns in which content disappears, but also about the rate at which it disappears.  One recent paper (2012) by Hany M. SalahEldeen and Michael L. Nelson looked at how quickly resources shared on social media about particular news events had disappeared, and found that:

after the first year of publishing, nearly 11% of shared resources will be lost and after that we will continue to lose 0.02% per day.

Taking a different approach, using ten year’s worth of archived content in the UK Web Archive, the British Library’s technical lead Andy Jackson took a sampling approach to plot not only the rate of disappearance of content, but also the degree to which it had changed between 2004 and 2014. Readers may be interested in both the methods Andy used, and some important caveats about how the selection of content in the archive may have influenced the trends. But, the headline is that the fraction of content that is both still online and unchanged after those ten years is so small it hardly be seen on the graph. Even for content that was archived only a year before, the proportion that is live and unchanged is less than 10%.

In their different ways, both studies point to the same issue: that the live web changes and disappears very quickly. Historians need both to grasp how it happens, as well as to begin to think about what kind of archive this leaves us with.

The Historian of the Web : Crawler, Browser or Lurker?

[A special guest post by Valérie Schafer and Francesca Musiani.]

“They program us, we re-program them. They segment us, we move around. They accelerate, we linger. We can always be smarter than our machines.”
[Louise Merzeau, « Le Flâneur Impatient », Médium, Rythmes, 
n°41, 2014/4, p. 20-29]

In his blog post of January 22, 2015, “The Promise of WebARChive Files”, Ian Milligan noted:

Not only does the Agence nationale de la recherche project Web90 take this idea seriously – but its opposite, as well: “You can’t do justice to the World Wide Web if you do not consider the 1990s”, we argue.

Building on this core idea, the project aims at providing elements of reflection about the context of Web development in the Nineties (e.g. tariffs, strategies and offers put forward by ISPs, the birth of web design, the emergence of e-commerce and personal pages, the transition from the Minitel to the Web, the notorious legal controversies that ISPs and hosting services needed to face, or national State-driven policies). Also, Web90 wishes to map the “French” Web (defined as the .fr domain, even if, we are aware, this does not account for all French websites on its own, let alone French Web browsing patterns), and to reconstruct Web browsing users’ experience in light of such factors as the emergence of “graphic scenarios” that emerged as a result of the evolution of interfaces. These issues call for the simultaneous adoption of several methods, very different but nonetheless complementary, and not equally suited to provide answers to all questions.

Web archives as big data…
This was the title of the conference organised in December 2014 by the Big UK Domain Data for the Arts and Humanities project. The information ‘deluge’ may appear less threatening for the scholar of the Web of the Nineties, despite an important growth, in the second half of the decade, of the number of domain names and hosts. However, what was already an abundance needs to be managed, as do the missing pieces – images or sites that were not preserved, or very fleetingly or superficially so. The Digital Humanities and their tools will prove useful to face the massive amounts of Web data, provided that historians are ready to enter the “black box” of tools and instruments, as Ian Milligan showed, and also the “black box” of Web archiving, as Axis 3 of the Web90 project shows. Indeed, beyond the understanding of tools there is the nature of collecting procedures, its periodicity and its actors to engage with as well as the representations underpinning the constitution of archives.

Black boxes …
Let us take two examples. The first is developed in the article “Quand la communication devient patrimoine…” [When Communication Becomes Heritage], co-written by Camille Paloque-Berges and Valérie Schafer, forthcoming in Hermès. The article addresses, amongst other things, the vision of Web and digital heritage that informs the actions and strategies of the Archive Team. In stating, on its home page, that “History is our future… And we’ve been trashing our history”, and describing itself as “a loose collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage”, Archive Team clearly states its interest in “ordinary” forms of communication that are, nonetheless, already qualified as digital heritage.

Mélanie Dulong de Rosnay, member of the Web90 team, has shown in Réseaux de production collaborative de connaissances, building on Elinor Ostrom’s theories, how the notion of common good is today “materialized” a core feature of peer production networks. We can probably observe, in the communitarian implementation of this preservation infrastructure, an alternative management of informational common goods as heritage, as well as a movement of re-appropriation by users.

Born-digital heritage, as well as digital heritage, as we have recently noted in our Call for Papers for the RESET journal, “ […] call for empirical investigations on both its publics – existing or expected/envisaged – and its promoters, producers, preservers. […] The controversies that these policies raise (e.g. those that concern the ‘right to be forgotten’ and the right to memory), as well as the interactions of public authorities with preservation institutions (or among these institutions themselves), are interesting to analyze for the light they shed on the socio-technical and political dimensions of ‘digital heritage’, as it becomes institutionalized. The practices and procedures contributing to the shaping and the legitimization of digital heritage entail a number of choices, trials, tests, intertwined ‘scales of action’, and a social ‘work’ undertaken by a variety of actors, including professional associations, amateurs, the public at large, libraries, museums, research groups volunteering to be in charge of specific archiving tasks or initiating preservation policies, international institutions or clusters of entities such as UNESCO or the International Internet Preservation Consortium”.

The second example is drawn from Louise Merzeau’s work. On the occasion of the general assembly of the IIPC, in May 2014, she showed the link between archiving models, epistemological models and research models, and the retro-active feedback between these elements. As such, entering the black box seems crucial; however, historians will need to avoid several obstacles.

… and their temptations
Trying to transform historians into computer scientists is, in our opinion, an idea as risky as its contrary, i.e., internalist and machine-focused approaches that informed the early days of computing history written by practitioners. However, not to improve historians’ digital literacy would be just as disastrous.

Similarly, it seems very important to us not to mingle different roles to the point of confusion: if historians and their colleagues from other social and human sciences are not computer scientists, they are not archivists either. Institutional mediation (while frustrating at times, as it implies choices and gaps) guarantees sustainability and accessibility: a long-term vision that researchers’ archiving practices can only partially satisfy, unless access to data, and their deposit is completely re-thought, in history and other social and human sciences – not unlike what other communities have previously done (e.g. GenBank).

The naïve way of data-driven science, which has led some to believe that we are assisting in the “end of theory”, should also be avoided. The related risks have been underlined in other disciplines, for example by Bruno Strasser. As Antoine Prost remarked in his Twelve Lessons on History, there is no document without an underlying question. The questions asked by historians turn the traces left by the past into sources and documents. Big data and computational methods are not always, or not entirely, appropriate to provide answers and more so, to formulate questions.

Web explorers… “Small is beautiful”!
In the Web90 project, so as to account for the set of conditions that have shaped the Web experience of Internet users in the Nineties – especially their browsing habits – we have chosen to study on one hand the general framework and the power of ‘massification’. But on the other hand, and in parallel, we wish to ‘stay close to the archive’, and eventually to follow paths previously traced by others (e.g. directories or closed spaces such as Infonie), open doors such as ISP portals or the Yahoo! Directory, or those sites that were recommended by the press or by guidebooks, such as the 1998 Guide du Routard de l’Internet.) Of course, in this domain, Web archives are precious assets, but the word “browser” takes on here its full original etymological sense: to encompass our mobilization of printed sources (press archives, State-driven reports, guidebooks for the general public) but also audio-visual archives, oral testimonies, or newsgroups. These sources invite historians to become lurkers around exchanges past. But very often, they soon need to emerge from this “passive” status…

A Usenet newsgroups research focused on female presence on the Web of the Nineties (carried out for this conference) has allowed us to retrace a post by a “Guillermito El Loco”, followed by twenty-five other posts, on the subject: “The Web is by far too masculine. What should we do?” With the objective of encouraging feminine presence and visibility on the Web, he proposes to list “girl Web pages” authored by French women. Two days later, as he has contacted the women he wishes to list on his site, reactions are quite nuanced and mixed: “For now, I had around thirty responses, four or five of them negative, sometimes with fairly violent reactions… it might actually disgust you from being a feminist!” The site, the link of which is mentioned in the post, has luckily been preserved by the Internet Archive, and opens a door towards “feminine Web pages” and the profiles of their authors. Guillermito has compiled an alphabetical database of these profiles and collected the links, a fairly good proportion of which are active in the Wayback Machine. There is, we believe, no need to argue further for the potential of such a corpus – peculiar, modest but unique – for our subject of study.

“The historian of tomorrow will be a programmer, or will be no more”, stated historian Emmanuel Le Roy Ladurie in 1973. Soon after, he left for the Occitan village Montaillou, of which he recomposed the day-to-day history, while distancing himself from measures and statistics… Do we, can we, see in this anecdote – as argued by Valérie Schafer and Benjamin G. Thierry in a forthcoming article – a harbinger for the use of Web archives? The promise of digital archives and tools leaves the door wide open for historians as regards methodology and its plurality. We do not wish to assume anything about which approaches will ultimately be privileged. However, between quantitative and qualitative, subjectivism and scientific requirements, sampling or claims of comprehensiveness, will we witness ancient quarrels “reloaded”? Or, on the contrary, will the legacy of historiography allow us to move beyond these dichotomies… and beyond binary oppositions?

Fascinating Interplay About Discovering Content in Web Archives

Web archives have arrived, at least in the pages of high-profile publications such as the Washington Post and the New Yorker.

An especially fascinating exchange took place in mid-February. Gareth Millward, a research fellow in the Centre for History in Public Health at the London School of Hygiene and Tropical Medicine, published “I tried to use the Internet to do historical research. It was nearly impossible” with the Washington Post. In it, he explained the difficulties of navigating extremely large web archives: search queries returned useless results, not sorted in an ideal fashion (or at all), and that instead historians may need to find smaller circumscribed corpuses or explore metadata.

The response by Andy Jackson, Web Archiving Technical Lead at the British Library, on the British Library’s Web Archive blog was equally illuminating. His piece, “Building a ‘Historical Search Engine’ is No Easy Thing,” is a must-read. He pointed out the different use cases that historians have: simply replicating Google (which excels at letting us know what we need to know in an extremely contemporary context) won’t make sense when querying large bodies of web archived material. He walks us through the various steps of the search engine, and concludes by arguing that we need to think of Macroscopes rather than of search engines (sidenote: having just finished copyedits on a co-authored book subtitled The Historian’s Macroscope, I’m inclined to agree with this metaphor!).

These two pieces join a third high-profile piece, “The Cobweb: Can the Internet be Archived?” by Harvard historian Jill Lepore. This was a fascinating exploration of the current state and recent history of web archiving, and is well worth your time.

Religion, social media and the web archive

Peter reblogs here a post on the ways in which his own study of contemporary religious history needs to come to terms with the ways in which social media content is (and is not) captured by traditional web archiving. As historians, we will need to understand how social media content is being archived, and the ways in which different archives of web-delivered content will need to be interrogated *together* to reconstruct the communication of individuals and organisations.


Late last year I was delighted to be invited to be one of four keynote speakers at a workshop on religion and social media at the International AAAI Conference on Web and Social Media in Oxford in May. Here are some initial thoughts on what I intend to say.

There has been an interesting upswing recently in scholarly interest in the ways in which religious people, and the organisations in which they gather together, represent themselves and communicate with others on social media. However, this work has been conducted relatively independently from the emerging body of scholarship on the archived web. Image by , CC BY 2.0 , CC BY 2.0

There are some reasons for this. First is the fact that much of the scholarship on social media tends to be focussed very firmly on the present. As such, data tends to be gathered directly from social media platforms “to order”, to match the…

View original post 353 more words

Milligan Presentation: “The Promise of WebARChive Files”

This paper was given at the American Historical Association’s annual meeting in New York City on January 5th, 2015. It was part of the Text Analysis, Visualization, and Historical Interpretation panel. My thanks to my co-presenters and especially Micki Kaufman who organized the panel.

The text that follows may not be exactly what I said, but is based on my speaking notes with a bit of memory filling in here and there.

AHA Talk.001

AHA Talk.002

Hello everybody, I’d like to begin with a somewhat provocative opening:

I believe that historians are unprepared to engage with the quantity of digital sources that will fundamentally transform their trade. Web archives are going to transform the work we do for a few main reasons: Continue reading Milligan Presentation: “The Promise of WebARChive Files”

Welcome to our Blog!

By Peter Webster and Ian Milligan

The first stage of our project, Web Archives for Historians, has concluded. In just under a year, we’ve amassed a healthy bibliography (about twenty works) that fall within the scope of our bibliography – works written by historians covering topics such as: (a) reflections on the need for web preservation, and its current state in different countries and globally as a whole; (b) how historians could, should or should not use web archives; (c) examples of actual uses of web archives as primary sources.

We’ve probably reached the ceiling on this front, however! There aren’t that many historians who are actively working in this area (yet, we dare say). And so we now want Web Archives for Historians to transition into an active blog that will:

    • Aggregate content by historians or for historians on web archives (similar to the Web Archiving Roundtable) – some of this will come from our own blogs (Peter and Ian), but also from a list of blogs that we’ll be following;
    • draw attention to talks and slides that we spot at scholarly conferences or in publication venues;


  • carry commissioned posts (eventually).

Our mandate will be to include:

  1. examples of work done using web archives;
  2. historical method in the web archive;
  3. news of significant new projects, tools, data or web services;
  4. contemporary history using the live web (as core source material, rather than just incidentally).

We hope that you join us by following along with your RSS feed, on Twitter, or just by popping by now and again.