Conference dispatches from Aarhus: Web Archives as Scholarly Sources

Some belated reflections on the excellent recent conference at Aarhus University in Denmark, on Web Archives as Scholarly Sources: Issues, Practices and Perspectives (see the abstracts on the conference site).

As well as an opportunity to speak myself, it was a great chance to catch up with what is a genuinely global community of specialists, even if (as one might expect) the European countries were particularly well represented this time. It was also particularly pleasing to see a genuine intermixing of scholars with the librarians and archivists whose task it is to provide scholars with their sources. As a result, the papers were an eclectic mix of method, tools, infrastructure and research findings; a combination not often achieved.

Although there were too many excellent papers to mention them all here, I draw out a few to illustrate this eclecticism. There were discussions of research method as applied both in close reading of small amounts of material (Gebeil, Nanni), and to very large datasets (Goel and Bailey). As well as this, we heard about emerging new tools for better harvesting of web content, and of providing access to the archived content ( Huurdeman).

Particularly good to see were the first signs of work that was beginning to go beyond discussions of method (“the work I am about to do”) to posit research conclusions, even if still tentative at this stage (Musso amongst others), and critical reflection on the way in which the archived web is used (Megan Sapnar Ankerson). It was also intriguing to see an increased focus on the understanding of the nature of a national domain, particularly in Anat Ben-David‘s ingenious reconstruction of the defunct .yu domain of the former Yugoslavia. Good to see too was the beginnings of a reintegration of social networks into the picture (Milligan, Weller, McCarthy) difficult to archive though they are; and some attention to the web before 1996 and the Internet Archive (Kevin Driscoll on BBS).

All in all, it was an excellent conference in all its aspects, and congratulations to Niels Brügger and the organising team for pulling it off.

Have Web Collections? Want Link and Text Analysis?

(x-posted with

The Warcbase wiki in action!
The Warcbase wiki in action!

The Web Archives for Historical Research Group has been busy: working on getting the Shine front end running on Archive-It collections (a soft launch is underway here if you want to play with old Canadian websites), setting up Warcbase on our collections, and digging manually through the GeoCities torrent for close readings of various neighbourhoods.

One collaboration has been really fruitful. Working with Jimmy Lin, a computer scientist who has just joined the University of Waterloo’s David Cheriton School of Computer Science, we’ve been working on scripts, workflows, and implementations of his warcbase platform. Visit the warcbase wiki here. Interdisciplinary collaboration is amazing!

I’d like to imagine humanists or social scientists who want to use web archives are often in the same position I was four years ago: confronted with opaque ARC and WARC files, downloading them onto your computer, and not really knowing what to do with them (apart from maybe unzipping them and exploring them manually). Our goal is to change that: to give easy to follow walkthroughs that can allow users to do the basic things to get started:

  • Screen Shot 2015-06-05 at 11.51.29 AM
    A dynamic visualization generated with warcbase and Gephi

    Link visualizations to explore networks, finding central hubs, communities, and so forth;

  • Textual analysis to extract specific text, facilitating subsequent topic modelling, entity extraction, keyword search, and close reading;
  • Overall statistics to find over- and under-represented domains, platforms, or content types;
  • And basic n-gram-style navigation to monitor and explore change over time.

All of this is relatively easy for web archive experts to do, but still difficult for endusers.

The Warcbase wiki, still under development, aims to fix that. Please visit, comment, fork, and we hope to develop it alongside all of you.

The UK Web Archive, born-digital sources, and rethinking the future of research

[A guest post by Professor Tim Hitchcock of the University of Susssex. It is derived from a short talk given at a doctoral training event at the British Library in May 2015, focused on using the UK Web Archive.  It was written with PhD students in mind, but really forms a meditation on the opportunities created when working with the archived web  rather than print.  While lightly edited, the text retains the tics and repetitions of public presentation. We’re very grateful to Tim for permission to repost this, which first appeared on Historyonics. Tim is to be found on Twitter @TimHitchcock ]

I normally work on properly dead people of the sort that do not really appear in the UK Web Archive – most of them eighteenth-century beggars and criminals. And in many respects the object of study for people like me – interlocutors of the long dead –  has not changed that much in the last twenty years.  For most of us, the ‘object of study’ remains text.  Of course the ‘digital’ and the online has changed the nature of that text.  How we find things – the conundrums of search – that in turn shape the questions we ask – has been transformed by digitisation.  And a series of new conundrums have been added to all the old ones – does, for instance, ‘big data’ and new forms of visualisation, imply a new ‘open eyed’ interrogation of data?  Are we being subtly encouraged to abandon older social science ‘models’, for something new?   And if we are, should these new approaches take the form of ‘scientific’ interrogation, looking for ‘natural’ patterns – following the lead of the Culturomics movement; or perhaps take the form of a re-engagement with the longue durée – in answer to the pleas of the History Manifesto.   Or perhaps we should be seeking a return to ‘close reading’ combined with a radical contextualisation – looking at the individual word, person, word and thing – in its wider context, preserving focus across the spectrum.

And of course, the online and the digital also raises issues about history writing as a genre and form of publication.   Open access, linked data, open data, the ‘crisis’ of the monograph, and the opportunities of multi-modal forms of publication, all challenge us to think again about the kind of writing we do, as a  literary form.  Why not do your PhD as a graphic novel? Why not insist on publishing the research data with your literary over-lay?  Why not do something different?  Why not self-publish?

These are conundrums all – but conundrums largely of the ‘textual humanities’.  Ironically, all these conundrums have not had much effect on the academy and the kind of scholarship the academy values.  The world of academic writing is largely, and boringly, the same as it was thirty years ago.  How we do it has changed, but what it looks like feels very familiar.

But the born digital is different.  Arguably, the sorts of things I do, history writing focused on the  properly dead, looks ‘conservative’ because it necessarily engages with the categories of knowing that dominated the nineteenth and twentieth centuries – these were centuries of text, organised into libraries of books, and commentated on by cadres of increasingly professional historians.  The born digital – and most importantly the UK web archive – is just different.  It sings to a different tune, and demands different questions – and if anywhere is going to change practise, it should be here.

Somewhat to my frustration, I don’t work on the web as an ‘object of study’ –  and therefore feel uncertain about what it can answer and how its form is shaping the conversation; but I did want to suggest that the web itself and more particularly the UK Web Archive provides an opportunity to re-think what is possible, and to rethink what it is we are asking; how we might ask it, and to what purpose.

And I suppose the way I want to frame this is to suggest that the web itself brings on to a single screen, a series of forms of data that can be subject to lots of different forms of analysis.  A few years ago, when APIs were first being advocated as a component of web design, the comment that really struck me, was that the web itself is a form of API, and that by extension the Web Archive is subject to the same kind of ‘re-imagination’ and re-purposing that an API allows for a single site or source.

As a result, you can – if you want – treat a web page as simple text – and apply all the tools of distant reading of text – that wonderful sense that millions of words can consumed in a single gulp.   You can apply ‘topic modelling’, and Latent Semantic Analysis; or Word Frequency/Inverse Document Frequency measures.  Or, even more simply; you can count words, and look for outliers – stare hard at the word on the web!

But you can also go well beyond this.  In performance art, in geography and archaeology, in music and linguistics, new forms of reading are emerging with each passing year that seem to me to significantly challenge our sense of the ‘object of study’ – both traditional text and web page.  In part, this is simply a reflection of the fact that all our senses and measures are suddenly open to new forms of analysis and representation. When everything is digital – when all forms of stuff come to us down a single pipeline –  everything can be read in a new way.

Consider for a moment the ‘LIVE’ project from the Royal Veterinary College in London, and their ‘haptic simulator’.  In this instance they have developed a full scale ‘haptic’ representation of a cow in labour, facing a difficult birth, which allows students to physically engage and experience the process of
manipulating a calf in situ.  I haven’t
had a chance to try this, but I am told that it is a mind-altering experience.  It suggests that reading can be different; and should include the haptic – the feel and heft of a thing in your hand.  This is being coded for
millions of objects through 3d scanning; but we do not yet have an effective way of incorporating that 3d text into how we read the past.

The same could be said of the aural – that weird world of sound on which we continually impose the order of language, music and meaning; but which is in fact a stream of sensations filtered through place and culture.
Projects like the Virtual St Paul’s Cross which allows you to ‘hear’ John Donne’s sermons from the 1620s, from different vantage points around the yard, changes how we imagine them, and moves from ‘text’ to something much more complex and powerful.  And begins to navigate that normally unbridgeable space between text and the material world.  And if you think about this in relation to music and speech online – you end up with something different on a massive scale.

One of my current projects is to create a sound scape of the courtroom at the Old
Bailey – to re-create the aural experience of the defendant – what it felt like
to speak to power, and what it felt like to have power spoken at you from the
bench. And in turn, to use that knowledge to assess who was more effective in
their dealings with the court, and whether, having a bit of shirt to you, for
instance, effected your experience of transportation or imprisonment.  And the point of the project is to simply add a few more variables to the ones we can securely derive from text.

It is an attempt to add just a couple of more columns to a spreadsheet of almost infinite categories of knowing.  And you could keep going – weather, sunlight, temperature, the presence of the smells and reeks of other bodies.  Ever more layers to the sense of place.  In part, this is what the gaming industries have been doing from the beginning, but it also becomes possible to turn that creativity on its head, and make it serve a different purpose.

In the work of people such as Ian Gregory, we can see the beginnings of new ways of reading both the landscape, and the textual leavings of dead.  Bob Shoemaker, Matthew Davies and I (with a lot of other people) tried to do something similar with Old Bailey material, and the geography of London in the Locating London’s Past project.

This map is simply colours blue, red and yellow mapped against brown and green.  I have absolutely no idea what this mapping actually means, but it did force me to think differently about the feel and experience of the city.  And I want to be able to do the same for all
the text captured in the UK domain name.

All of which is to state the obvious.  There are lots of new readings that change how we connect with historical evidence – whether that is text, or something more interesting.    In creating new digital forms of inherited culture – the stuff of the dead – we naturally innovate, and naturally enough,
discover ever changing readings.  But the Web Archive, challenges us to do a lot more; and to begin to unpick what you might start pulling together from this near infinite archive.

In other words, the tools of text are there, and arguably moving in the right direction,
but there are several more dimensions we can exploit when the object of study
is itself an encoding.

Each web page, for instance, embodies a dozen different forms.  Text is obvious, but it is important to
remember that each component of the text – each word and letter, on a web page – is itself a complex composite.  What happens when you divide text by font or font size; weight, colour, kerning, formatting etc.  By location – in the header, or the body, or wherever the CSS sends it; or more subtly by where it appears to a users’ eye – in the middle of a line – or at the end. Suddenly, to all the forms of analysis we have associated with ‘distant reading’ there are five or six further columns in the spread sheet – five or six new variables to investigate in that ‘big data’ eye-opened sort of way.

And that is just the text.  The page itself is both a single image, and a collection of them – each with their own properties.  And one of the great things that is coming out of image research is that we can begin to automate the process of analysing those screens as ‘images’.  Colour, layout, face recognition etc.  Each page, is suddenly ten images in one – all available as a new variable; a new column in the spreadsheet of analysis.  And, of course, the same could be said of embedded audio and video.

And all of that is before we even look under the bonnet.  The code, the links, the meta data for each page – in part we can think of these as just another iteration of the text; but more imaginatively, we can think about it as more variables in the mix.

But, of course, that in itself misunderstands the web and the Web Archive.  The commonplace metaphor I have been using up till now is of a ‘page’ – and is the intellectual equivalent of skeumorphism – relying on material world metaphors to understand the online.

But these aren’t pages at all, they are collections of code and data that generate in to
an experience in real time.  They do not exist until they are used – if a website in the forest is never accessed, it does not exists.  The web archive therefore is not an archive of ‘objects’ in the traditional sense, but a snapshot from a moving film of possibilities.  At its most abstract, what the UK Web Archive has done, is spirit in to being the very object it seeks to capture – and of course, we all know that in doing so, the capturing itself changes the object.  Schrödinger’s cat may be alive or dead, but its box is definitely open, and we have visited our observations upon its content.

So to add to all the layers of stuff that can fill your spreadsheet, there also needs to be columns for time and use; re-use and republication.  And all this is before we seek to change the metaphor and talk about networks of connections, instead of pages on a website.

Where I end up is seriously jealous of the possibilities; and seriously wondering what the
‘object of study’ might be.  In the nature of an archive, the UK Web Archive imagines itself as an ‘object of study’; created in the service of an imaginary scholar.  The question it raises is how do we turn something we really can’t understand, cannot really capture as an object of study, to serious purpose?  How do we think at one and the same time of the web as alive and dead, as code, text, and image – all in dynamic conversation one with the other.  And even if we can hold all that at once, what is it we are asking?

IIPC 2015 Recap

logoI had a fantastic time at the International Internet Preservation Consortium’s Annual General Meeting this year, held on the beautiful campus of Stanford University (with a day trip down to the Internet Archive in San Francisco). It’s hard to write these sorts of recaps: I had such an amazing time, my head filled with great ideas, that it’s difficult to give everything the justice that they deserve. Many of the presentation slide decks are available on the schedule, and videos will be forthcoming.

My main takeaways: we’re continuing to see the development of sophisticated access tools to these repositories, coupled with increasingly exciting and sophisticated researcher use of them. There’s a recognition that context matters when understanding archived webpages, a phrase that came up a few times throughout the event. Crucially, there was a lot of energy in the room: there’s a real enthusiasm towards making these as accessible as possible and facilitating their use. I wasn’t exaggerating when I noted to one of the organizers that I wish every conference was like this: leaving me on my flight home with lots of fantastic ideas, hope for the future, and excitement about what can be done. As the recent “Conference Manifesto” in the New York Times noted, that’s not the experience at all conferences!

Read one for a short day-by-day breakdown, with apologies for presentations I couldn’t include or didn’t give full justice to: Continue reading IIPC 2015 Recap

A Heisenberg Principle of web archiving ?


It’s been great to see the historical perspective being represented at this week’s General Assembly of the IIPC in Stanford. Following the Twitter hashtag at #iipcGA15, this older post came to mind. The comprehensive domain-wide archiving under UK Non-Print Legal Deposit that it refers to is now two years old; and 2015 has seen a significant upswing in attention being paid to web archiving in the press. So: do we yet know what the effect of widespread web archiving will be on the behaviour of those being archived? I don’t think we do; and historians of the future will need to know.

Originally posted on Webstory: Peter Webster's blog:

Whatever it means to real scientists, the famous ‘uncertainty principle’ of Werner Heisenberg is sometime popularly taken to mean that it is impossible closely to observe something without in some way altering it. It’s also a conundrum that has faced anthropologists when observing cultures far removed from their own: how far does the consciousness of being observed alter the behaviour of the subject ?

I’ve been publishing in print in the traditional way for some years now, and everyone knows that books are (in theory) permanent, that they find their way into libraries; and so one writes conscious that the words cannot be unwritten. Writing for the web, however, has had a more transient aesthetic: I can write with the freedom that comes from knowing that (in a site I control) I can retrospectively edit at will, should I choose to. There are good scholarly reasons not to, to do…

View original 236 more words

Archive-It Research Services: Exciting New Developments

Named Entity Recognition results on a corpus of tens of thousands of web archived pages: possible now with Archive-It’s WANE File

Historians who work with, or who are thinking about working with, web archives will be excited about the announcement that Archive-It Research Services made on March 17th. They’re widely expanding the sort of data that they provide to researchers. As they put it in their announcement:

The service will allow any Archive-It partner to give users, researchers, scholars, developers, and other patrons easily-analyzed datasets that contain key metadata elements, link graphs, named entities, and other data derived from the resources within their collections. By supporting access in aggregate to partner archives, ARS will facilitate new types of use, research, and analysis of the significant historical records from the web that Archive-It partners are working to collect, preserve, and make accessible.

They’re making three types of datasets available, the first of which are WAT files, which contain metadata about websites.

From WATs, you can get metadata descriptions for websites, the links that they point towards, the anchor text of those links, and crawl information. Ian Milligan, one of the co-authors of this blog, has been using WAT files to analyze Canadian political history websites: see some results here and here (including a guided video tour of some results).

LGA and WANE files are unfamiliar to the two authors of this blog, although they look to be very useful! LGA files accelerate the ability to do longitudinal link analysis from WAT files. The examples they give are actually from the Canadian Political Interest Groups collection! Finally, WANE files use Stanford NER to extract information relating to people, organizations, and locations. Using derived text from a web archive, Milligan plotted all the locations mentioned in GeoCities – you can see the results here.

To get these files, consult the service details page. In short, if you’re an Archive-It partner you can order it internally through your dashboard. For the rest of us researchers, you just need to send an e-mail in with some information, and start the process.

In short: an amazing move that’s really going to unlock these files. WARC files are really big – too big, for most systems – whereas the LGA, WAT, and WANE model is going to unlock accessible web archive research. Kudos to Archive-It.

ReSAW: Research Infrastructure for the Study of Archived Web Materials

Historians based in Europe in particular should know about ReSAW, a Europe-wide network of scholars and institutions interested in the archived web. Co-ordinated by Niels Brügger (Aarhus University, NetLab & the Centre for Internet Studies), at present it is largely sustained by the efforts of its members, but there are plans for expansion in the next few years.

The next ReSAW event is a major conference on Web Archives as Scholarly Sources: Issues, Practices and Perspectives, which will take place in Denmark in June, and at which both Peter and Ian will be presenting papers. Booking is now open.

As well as the conference, there is a cluster of pilot research projects which may be of interest to historians. These include examinations of patterns of commemoration online, through to the traces left by the Eurovision Song Contest.

How fast does the web change and decay? Some evidence

One of the prime motivations behind Web Archives for Historians is our consciousness of how quickly the web changes and decays, and of the particular shape this gives to the archived web with which historians will need to work. However, just how this happens is not very well documented, and so I draw attention here to some introductory resources about the problem.

As historians, we need to know about patterns in which content disappears, but also about the rate at which it disappears.  One recent paper (2012) by Hany M. SalahEldeen and Michael L. Nelson looked at how quickly resources shared on social media about particular news events had disappeared, and found that:

after the first year of publishing, nearly 11% of shared resources will be lost and after that we will continue to lose 0.02% per day.

Taking a different approach, using ten year’s worth of archived content in the UK Web Archive, the British Library’s technical lead Andy Jackson took a sampling approach to plot not only the rate of disappearance of content, but also the degree to which it had changed between 2004 and 2014. Readers may be interested in both the methods Andy used, and some important caveats about how the selection of content in the archive may have influenced the trends. But, the headline is that the fraction of content that is both still online and unchanged after those ten years is so small it hardly be seen on the graph. Even for content that was archived only a year before, the proportion that is live and unchanged is less than 10%.

In their different ways, both studies point to the same issue: that the live web changes and disappears very quickly. Historians need both to grasp how it happens, as well as to begin to think about what kind of archive this leaves us with.

The Historian of the Web : Crawler, Browser or Lurker?

[A special guest post by Valérie Schafer and Francesca Musiani.]

“They program us, we re-program them. They segment us, we move around. They accelerate, we linger. We can always be smarter than our machines.”
[Louise Merzeau, « Le Flâneur Impatient », Médium, Rythmes, 
n°41, 2014/4, p. 20-29]

In his blog post of January 22, 2015, “The Promise of WebARChive Files”, Ian Milligan noted:

Not only does the Agence nationale de la recherche project Web90 take this idea seriously – but its opposite, as well: “You can’t do justice to the World Wide Web if you do not consider the 1990s”, we argue.

Building on this core idea, the project aims at providing elements of reflection about the context of Web development in the Nineties (e.g. tariffs, strategies and offers put forward by ISPs, the birth of web design, the emergence of e-commerce and personal pages, the transition from the Minitel to the Web, the notorious legal controversies that ISPs and hosting services needed to face, or national State-driven policies). Also, Web90 wishes to map the “French” Web (defined as the .fr domain, even if, we are aware, this does not account for all French websites on its own, let alone French Web browsing patterns), and to reconstruct Web browsing users’ experience in light of such factors as the emergence of “graphic scenarios” that emerged as a result of the evolution of interfaces. These issues call for the simultaneous adoption of several methods, very different but nonetheless complementary, and not equally suited to provide answers to all questions.

Web archives as big data…
This was the title of the conference organised in December 2014 by the Big UK Domain Data for the Arts and Humanities project. The information ‘deluge’ may appear less threatening for the scholar of the Web of the Nineties, despite an important growth, in the second half of the decade, of the number of domain names and hosts. However, what was already an abundance needs to be managed, as do the missing pieces – images or sites that were not preserved, or very fleetingly or superficially so. The Digital Humanities and their tools will prove useful to face the massive amounts of Web data, provided that historians are ready to enter the “black box” of tools and instruments, as Ian Milligan showed, and also the “black box” of Web archiving, as Axis 3 of the Web90 project shows. Indeed, beyond the understanding of tools there is the nature of collecting procedures, its periodicity and its actors to engage with as well as the representations underpinning the constitution of archives.

Black boxes …
Let us take two examples. The first is developed in the article “Quand la communication devient patrimoine…” [When Communication Becomes Heritage], co-written by Camille Paloque-Berges and Valérie Schafer, forthcoming in Hermès. The article addresses, amongst other things, the vision of Web and digital heritage that informs the actions and strategies of the Archive Team. In stating, on its home page, that “History is our future… And we’ve been trashing our history”, and describing itself as “a loose collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage”, Archive Team clearly states its interest in “ordinary” forms of communication that are, nonetheless, already qualified as digital heritage.

Mélanie Dulong de Rosnay, member of the Web90 team, has shown in Réseaux de production collaborative de connaissances, building on Elinor Ostrom’s theories, how the notion of common good is today “materialized” a core feature of peer production networks. We can probably observe, in the communitarian implementation of this preservation infrastructure, an alternative management of informational common goods as heritage, as well as a movement of re-appropriation by users.

Born-digital heritage, as well as digital heritage, as we have recently noted in our Call for Papers for the RESET journal, “ […] call for empirical investigations on both its publics – existing or expected/envisaged – and its promoters, producers, preservers. […] The controversies that these policies raise (e.g. those that concern the ‘right to be forgotten’ and the right to memory), as well as the interactions of public authorities with preservation institutions (or among these institutions themselves), are interesting to analyze for the light they shed on the socio-technical and political dimensions of ‘digital heritage’, as it becomes institutionalized. The practices and procedures contributing to the shaping and the legitimization of digital heritage entail a number of choices, trials, tests, intertwined ‘scales of action’, and a social ‘work’ undertaken by a variety of actors, including professional associations, amateurs, the public at large, libraries, museums, research groups volunteering to be in charge of specific archiving tasks or initiating preservation policies, international institutions or clusters of entities such as UNESCO or the International Internet Preservation Consortium”.

The second example is drawn from Louise Merzeau’s work. On the occasion of the general assembly of the IIPC, in May 2014, she showed the link between archiving models, epistemological models and research models, and the retro-active feedback between these elements. As such, entering the black box seems crucial; however, historians will need to avoid several obstacles.

… and their temptations
Trying to transform historians into computer scientists is, in our opinion, an idea as risky as its contrary, i.e., internalist and machine-focused approaches that informed the early days of computing history written by practitioners. However, not to improve historians’ digital literacy would be just as disastrous.

Similarly, it seems very important to us not to mingle different roles to the point of confusion: if historians and their colleagues from other social and human sciences are not computer scientists, they are not archivists either. Institutional mediation (while frustrating at times, as it implies choices and gaps) guarantees sustainability and accessibility: a long-term vision that researchers’ archiving practices can only partially satisfy, unless access to data, and their deposit is completely re-thought, in history and other social and human sciences – not unlike what other communities have previously done (e.g. GenBank).

The naïve way of data-driven science, which has led some to believe that we are assisting in the “end of theory”, should also be avoided. The related risks have been underlined in other disciplines, for example by Bruno Strasser. As Antoine Prost remarked in his Twelve Lessons on History, there is no document without an underlying question. The questions asked by historians turn the traces left by the past into sources and documents. Big data and computational methods are not always, or not entirely, appropriate to provide answers and more so, to formulate questions.

Web explorers… “Small is beautiful”!
In the Web90 project, so as to account for the set of conditions that have shaped the Web experience of Internet users in the Nineties – especially their browsing habits – we have chosen to study on one hand the general framework and the power of ‘massification’. But on the other hand, and in parallel, we wish to ‘stay close to the archive’, and eventually to follow paths previously traced by others (e.g. directories or closed spaces such as Infonie), open doors such as ISP portals or the Yahoo! Directory, or those sites that were recommended by the press or by guidebooks, such as the 1998 Guide du Routard de l’Internet.) Of course, in this domain, Web archives are precious assets, but the word “browser” takes on here its full original etymological sense: to encompass our mobilization of printed sources (press archives, State-driven reports, guidebooks for the general public) but also audio-visual archives, oral testimonies, or newsgroups. These sources invite historians to become lurkers around exchanges past. But very often, they soon need to emerge from this “passive” status…

A Usenet newsgroups research focused on female presence on the Web of the Nineties (carried out for this conference) has allowed us to retrace a post by a “Guillermito El Loco”, followed by twenty-five other posts, on the subject: “The Web is by far too masculine. What should we do?” With the objective of encouraging feminine presence and visibility on the Web, he proposes to list “girl Web pages” authored by French women. Two days later, as he has contacted the women he wishes to list on his site, reactions are quite nuanced and mixed: “For now, I had around thirty responses, four or five of them negative, sometimes with fairly violent reactions… it might actually disgust you from being a feminist!” The site, the link of which is mentioned in the post, has luckily been preserved by the Internet Archive, and opens a door towards “feminine Web pages” and the profiles of their authors. Guillermito has compiled an alphabetical database of these profiles and collected the links, a fairly good proportion of which are active in the Wayback Machine. There is, we believe, no need to argue further for the potential of such a corpus – peculiar, modest but unique – for our subject of study.

“The historian of tomorrow will be a programmer, or will be no more”, stated historian Emmanuel Le Roy Ladurie in 1973. Soon after, he left for the Occitan village Montaillou, of which he recomposed the day-to-day history, while distancing himself from measures and statistics… Do we, can we, see in this anecdote – as argued by Valérie Schafer and Benjamin G. Thierry in a forthcoming article – a harbinger for the use of Web archives? The promise of digital archives and tools leaves the door wide open for historians as regards methodology and its plurality. We do not wish to assume anything about which approaches will ultimately be privileged. However, between quantitative and qualitative, subjectivism and scientific requirements, sampling or claims of comprehensiveness, will we witness ancient quarrels “reloaded”? Or, on the contrary, will the legacy of historiography allow us to move beyond these dichotomies… and beyond binary oppositions?

Fascinating Interplay About Discovering Content in Web Archives

Web archives have arrived, at least in the pages of high-profile publications such as the Washington Post and the New Yorker.

An especially fascinating exchange took place in mid-February. Gareth Millward, a research fellow in the Centre for History in Public Health at the London School of Hygiene and Tropical Medicine, published “I tried to use the Internet to do historical research. It was nearly impossible” with the Washington Post. In it, he explained the difficulties of navigating extremely large web archives: search queries returned useless results, not sorted in an ideal fashion (or at all), and that instead historians may need to find smaller circumscribed corpuses or explore metadata.

The response by Andy Jackson, Web Archiving Technical Lead at the British Library, on the British Library’s Web Archive blog was equally illuminating. His piece, “Building a ‘Historical Search Engine’ is No Easy Thing,” is a must-read. He pointed out the different use cases that historians have: simply replicating Google (which excels at letting us know what we need to know in an extremely contemporary context) won’t make sense when querying large bodies of web archived material. He walks us through the various steps of the search engine, and concludes by arguing that we need to think of Macroscopes rather than of search engines (sidenote: having just finished copyedits on a co-authored book subtitled The Historian’s Macroscope, I’m inclined to agree with this metaphor!).

These two pieces join a third high-profile piece, “The Cobweb: Can the Internet be Archived?” by Harvard historian Jill Lepore. This was a fascinating exploration of the current state and recent history of web archiving, and is well worth your time.