Monthly Archives: March 2015

How fast does the web change and decay? Some evidence

One of the prime motivations behind Web Archives for Historians is our consciousness of how quickly the web changes and decays, and of the particular shape this gives to the archived web with which historians will need to work. However, just how this happens is not very well documented, and so I draw attention here to some introductory resources about the problem.

As historians, we need to know about patterns in which content disappears, but also about the rate at which it disappears.  One recent paper (2012) by Hany M. SalahEldeen and Michael L. Nelson looked at how quickly resources shared on social media about particular news events had disappeared, and found that:

after the first year of publishing, nearly 11% of shared resources will be lost and after that we will continue to lose 0.02% per day.

Taking a different approach, using ten year’s worth of archived content in the UK Web Archive, the British Library’s technical lead Andy Jackson took a sampling approach to plot not only the rate of disappearance of content, but also the degree to which it had changed between 2004 and 2014. Readers may be interested in both the methods Andy used, and some important caveats about how the selection of content in the archive may have influenced the trends. But, the headline is that the fraction of content that is both still online and unchanged after those ten years is so small it hardly be seen on the graph. Even for content that was archived only a year before, the proportion that is live and unchanged is less than 10%.

In their different ways, both studies point to the same issue: that the live web changes and disappears very quickly. Historians need both to grasp how it happens, as well as to begin to think about what kind of archive this leaves us with.

The Historian of the Web : Crawler, Browser or Lurker?

[A special guest post by Valérie Schafer and Francesca Musiani.]

“They program us, we re-program them. They segment us, we move around. They accelerate, we linger. We can always be smarter than our machines.”
[Louise Merzeau, « Le Flâneur Impatient », Médium, Rythmes, 
n°41, 2014/4, p. 20-29]

In his blog post of January 22, 2015, “The Promise of WebARChive Files”, Ian Milligan noted:

Not only does the Agence nationale de la recherche project Web90 take this idea seriously – but its opposite, as well: “You can’t do justice to the World Wide Web if you do not consider the 1990s”, we argue.

Building on this core idea, the project aims at providing elements of reflection about the context of Web development in the Nineties (e.g. tariffs, strategies and offers put forward by ISPs, the birth of web design, the emergence of e-commerce and personal pages, the transition from the Minitel to the Web, the notorious legal controversies that ISPs and hosting services needed to face, or national State-driven policies). Also, Web90 wishes to map the “French” Web (defined as the .fr domain, even if, we are aware, this does not account for all French websites on its own, let alone French Web browsing patterns), and to reconstruct Web browsing users’ experience in light of such factors as the emergence of “graphic scenarios” that emerged as a result of the evolution of interfaces. These issues call for the simultaneous adoption of several methods, very different but nonetheless complementary, and not equally suited to provide answers to all questions.

Web archives as big data…
This was the title of the conference organised in December 2014 by the Big UK Domain Data for the Arts and Humanities project. The information ‘deluge’ may appear less threatening for the scholar of the Web of the Nineties, despite an important growth, in the second half of the decade, of the number of domain names and hosts. However, what was already an abundance needs to be managed, as do the missing pieces – images or sites that were not preserved, or very fleetingly or superficially so. The Digital Humanities and their tools will prove useful to face the massive amounts of Web data, provided that historians are ready to enter the “black box” of tools and instruments, as Ian Milligan showed, and also the “black box” of Web archiving, as Axis 3 of the Web90 project shows. Indeed, beyond the understanding of tools there is the nature of collecting procedures, its periodicity and its actors to engage with as well as the representations underpinning the constitution of archives.

Black boxes …
Let us take two examples. The first is developed in the article “Quand la communication devient patrimoine…” [When Communication Becomes Heritage], co-written by Camille Paloque-Berges and Valérie Schafer, forthcoming in Hermès. The article addresses, amongst other things, the vision of Web and digital heritage that informs the actions and strategies of the Archive Team. In stating, on its home page, that “History is our future… And we’ve been trashing our history”, and describing itself as “a loose collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage”, Archive Team clearly states its interest in “ordinary” forms of communication that are, nonetheless, already qualified as digital heritage.

Mélanie Dulong de Rosnay, member of the Web90 team, has shown in Réseaux de production collaborative de connaissances, building on Elinor Ostrom’s theories, how the notion of common good is today “materialized” a core feature of peer production networks. We can probably observe, in the communitarian implementation of this preservation infrastructure, an alternative management of informational common goods as heritage, as well as a movement of re-appropriation by users.

Born-digital heritage, as well as digital heritage, as we have recently noted in our Call for Papers for the RESET journal, “ […] call for empirical investigations on both its publics – existing or expected/envisaged – and its promoters, producers, preservers. […] The controversies that these policies raise (e.g. those that concern the ‘right to be forgotten’ and the right to memory), as well as the interactions of public authorities with preservation institutions (or among these institutions themselves), are interesting to analyze for the light they shed on the socio-technical and political dimensions of ‘digital heritage’, as it becomes institutionalized. The practices and procedures contributing to the shaping and the legitimization of digital heritage entail a number of choices, trials, tests, intertwined ‘scales of action’, and a social ‘work’ undertaken by a variety of actors, including professional associations, amateurs, the public at large, libraries, museums, research groups volunteering to be in charge of specific archiving tasks or initiating preservation policies, international institutions or clusters of entities such as UNESCO or the International Internet Preservation Consortium”.

The second example is drawn from Louise Merzeau’s work. On the occasion of the general assembly of the IIPC, in May 2014, she showed the link between archiving models, epistemological models and research models, and the retro-active feedback between these elements. As such, entering the black box seems crucial; however, historians will need to avoid several obstacles.

… and their temptations
Trying to transform historians into computer scientists is, in our opinion, an idea as risky as its contrary, i.e., internalist and machine-focused approaches that informed the early days of computing history written by practitioners. However, not to improve historians’ digital literacy would be just as disastrous.

Similarly, it seems very important to us not to mingle different roles to the point of confusion: if historians and their colleagues from other social and human sciences are not computer scientists, they are not archivists either. Institutional mediation (while frustrating at times, as it implies choices and gaps) guarantees sustainability and accessibility: a long-term vision that researchers’ archiving practices can only partially satisfy, unless access to data, and their deposit is completely re-thought, in history and other social and human sciences – not unlike what other communities have previously done (e.g. GenBank).

The naïve way of data-driven science, which has led some to believe that we are assisting in the “end of theory”, should also be avoided. The related risks have been underlined in other disciplines, for example by Bruno Strasser. As Antoine Prost remarked in his Twelve Lessons on History, there is no document without an underlying question. The questions asked by historians turn the traces left by the past into sources and documents. Big data and computational methods are not always, or not entirely, appropriate to provide answers and more so, to formulate questions.

Web explorers… “Small is beautiful”!
In the Web90 project, so as to account for the set of conditions that have shaped the Web experience of Internet users in the Nineties – especially their browsing habits – we have chosen to study on one hand the general framework and the power of ‘massification’. But on the other hand, and in parallel, we wish to ‘stay close to the archive’, and eventually to follow paths previously traced by others (e.g. directories or closed spaces such as Infonie), open doors such as ISP portals or the Yahoo! Directory, or those sites that were recommended by the press or by guidebooks, such as the 1998 Guide du Routard de l’Internet.) Of course, in this domain, Web archives are precious assets, but the word “browser” takes on here its full original etymological sense: to encompass our mobilization of printed sources (press archives, State-driven reports, guidebooks for the general public) but also audio-visual archives, oral testimonies, or newsgroups. These sources invite historians to become lurkers around exchanges past. But very often, they soon need to emerge from this “passive” status…

A Usenet newsgroups research focused on female presence on the Web of the Nineties (carried out for this conference) has allowed us to retrace a post by a “Guillermito El Loco”, followed by twenty-five other posts, on the subject: “The Web is by far too masculine. What should we do?” With the objective of encouraging feminine presence and visibility on the Web, he proposes to list “girl Web pages” authored by French women. Two days later, as he has contacted the women he wishes to list on his site, reactions are quite nuanced and mixed: “For now, I had around thirty responses, four or five of them negative, sometimes with fairly violent reactions… it might actually disgust you from being a feminist!” The site, the link of which is mentioned in the post, has luckily been preserved by the Internet Archive, and opens a door towards “feminine Web pages” and the profiles of their authors. Guillermito has compiled an alphabetical database of these profiles and collected the links, a fairly good proportion of which are active in the Wayback Machine. There is, we believe, no need to argue further for the potential of such a corpus – peculiar, modest but unique – for our subject of study.

“The historian of tomorrow will be a programmer, or will be no more”, stated historian Emmanuel Le Roy Ladurie in 1973. Soon after, he left for the Occitan village Montaillou, of which he recomposed the day-to-day history, while distancing himself from measures and statistics… Do we, can we, see in this anecdote – as argued by Valérie Schafer and Benjamin G. Thierry in a forthcoming article – a harbinger for the use of Web archives? The promise of digital archives and tools leaves the door wide open for historians as regards methodology and its plurality. We do not wish to assume anything about which approaches will ultimately be privileged. However, between quantitative and qualitative, subjectivism and scientific requirements, sampling or claims of comprehensiveness, will we witness ancient quarrels “reloaded”? Or, on the contrary, will the legacy of historiography allow us to move beyond these dichotomies… and beyond binary oppositions?

Fascinating Interplay About Discovering Content in Web Archives

Web archives have arrived, at least in the pages of high-profile publications such as the Washington Post and the New Yorker.

An especially fascinating exchange took place in mid-February. Gareth Millward, a research fellow in the Centre for History in Public Health at the London School of Hygiene and Tropical Medicine, published “I tried to use the Internet to do historical research. It was nearly impossible” with the Washington Post. In it, he explained the difficulties of navigating extremely large web archives: search queries returned useless results, not sorted in an ideal fashion (or at all), and that instead historians may need to find smaller circumscribed corpuses or explore metadata.

The response by Andy Jackson, Web Archiving Technical Lead at the British Library, on the British Library’s Web Archive blog was equally illuminating. His piece, “Building a ‘Historical Search Engine’ is No Easy Thing,” is a must-read. He pointed out the different use cases that historians have: simply replicating Google (which excels at letting us know what we need to know in an extremely contemporary context) won’t make sense when querying large bodies of web archived material. He walks us through the various steps of the search engine, and concludes by arguing that we need to think of Macroscopes rather than of search engines (sidenote: having just finished copyedits on a co-authored book subtitled The Historian’s Macroscope, I’m inclined to agree with this metaphor!).

These two pieces join a third high-profile piece, “The Cobweb: Can the Internet be Archived?” by Harvard historian Jill Lepore. This was a fascinating exploration of the current state and recent history of web archiving, and is well worth your time.