All posts by peterwebster

About peterwebster

Historian of twentieth century Britain; interested in digital history, open access publishing, web archives. Tweets @pj_webster

The UK Web Archive, born-digital sources, and rethinking the future of research

[A guest post by Professor Tim Hitchcock of the University of Susssex. It is derived from a short talk given at a doctoral training event at the British Library in May 2015, focused on using the UK Web Archive.  It was written with PhD students in mind, but really forms a meditation on the opportunities created when working with the archived web  rather than print.  While lightly edited, the text retains the tics and repetitions of public presentation. We’re very grateful to Tim for permission to repost this, which first appeared on Historyonics. Tim is to be found on Twitter @TimHitchcock ]

I normally work on properly dead people of the sort that do not really appear in the UK Web Archive – most of them eighteenth-century beggars and criminals. And in many respects the object of study for people like me – interlocutors of the long dead –  has not changed that much in the last twenty years.  For most of us, the ‘object of study’ remains text.  Of course the ‘digital’ and the online has changed the nature of that text.  How we find things – the conundrums of search – that in turn shape the questions we ask – has been transformed by digitisation.  And a series of new conundrums have been added to all the old ones – does, for instance, ‘big data’ and new forms of visualisation, imply a new ‘open eyed’ interrogation of data?  Are we being subtly encouraged to abandon older social science ‘models’, for something new?   And if we are, should these new approaches take the form of ‘scientific’ interrogation, looking for ‘natural’ patterns – following the lead of the Culturomics movement; or perhaps take the form of a re-engagement with the longue durée – in answer to the pleas of the History Manifesto.   Or perhaps we should be seeking a return to ‘close reading’ combined with a radical contextualisation – looking at the individual word, person, word and thing – in its wider context, preserving focus across the spectrum.

And of course, the online and the digital also raises issues about history writing as a genre and form of publication.   Open access, linked data, open data, the ‘crisis’ of the monograph, and the opportunities of multi-modal forms of publication, all challenge us to think again about the kind of writing we do, as a  literary form.  Why not do your PhD as a graphic novel? Why not insist on publishing the research data with your literary over-lay?  Why not do something different?  Why not self-publish?

These are conundrums all – but conundrums largely of the ‘textual humanities’.  Ironically, all these conundrums have not had much effect on the academy and the kind of scholarship the academy values.  The world of academic writing is largely, and boringly, the same as it was thirty years ago.  How we do it has changed, but what it looks like feels very familiar.

But the born digital is different.  Arguably, the sorts of things I do, history writing focused on the  properly dead, looks ‘conservative’ because it necessarily engages with the categories of knowing that dominated the nineteenth and twentieth centuries – these were centuries of text, organised into libraries of books, and commentated on by cadres of increasingly professional historians.  The born digital – and most importantly the UK web archive – is just different.  It sings to a different tune, and demands different questions – and if anywhere is going to change practise, it should be here.

Somewhat to my frustration, I don’t work on the web as an ‘object of study’ –  and therefore feel uncertain about what it can answer and how its form is shaping the conversation; but I did want to suggest that the web itself and more particularly the UK Web Archive provides an opportunity to re-think what is possible, and to rethink what it is we are asking; how we might ask it, and to what purpose.

And I suppose the way I want to frame this is to suggest that the web itself brings on to a single screen, a series of forms of data that can be subject to lots of different forms of analysis.  A few years ago, when APIs were first being advocated as a component of web design, the comment that really struck me, was that the web itself is a form of API, and that by extension the Web Archive is subject to the same kind of ‘re-imagination’ and re-purposing that an API allows for a single site or source.

As a result, you can – if you want – treat a web page as simple text – and apply all the tools of distant reading of text – that wonderful sense that millions of words can consumed in a single gulp.   You can apply ‘topic modelling’, and Latent Semantic Analysis; or Word Frequency/Inverse Document Frequency measures.  Or, even more simply; you can count words, and look for outliers – stare hard at the word on the web!

But you can also go well beyond this.  In performance art, in geography and archaeology, in music and linguistics, new forms of reading are emerging with each passing year that seem to me to significantly challenge our sense of the ‘object of study’ – both traditional text and web page.  In part, this is simply a reflection of the fact that all our senses and measures are suddenly open to new forms of analysis and representation. When everything is digital – when all forms of stuff come to us down a single pipeline –  everything can be read in a new way.

Consider for a moment the ‘LIVE’ project from the Royal Veterinary College in London, and their ‘haptic simulator’.  In this instance they have developed a full scale ‘haptic’ representation of a cow in labour, facing a difficult birth, which allows students to physically engage and experience the process of
manipulating a calf in situ.  I haven’t
had a chance to try this, but I am told that it is a mind-altering experience.  It suggests that reading can be different; and should include the haptic – the feel and heft of a thing in your hand.  This is being coded for
millions of objects through 3d scanning; but we do not yet have an effective way of incorporating that 3d text into how we read the past.

The same could be said of the aural – that weird world of sound on which we continually impose the order of language, music and meaning; but which is in fact a stream of sensations filtered through place and culture.
Projects like the Virtual St Paul’s Cross which allows you to ‘hear’ John Donne’s sermons from the 1620s, from different vantage points around the yard, changes how we imagine them, and moves from ‘text’ to something much more complex and powerful.  And begins to navigate that normally unbridgeable space between text and the material world.  And if you think about this in relation to music and speech online – you end up with something different on a massive scale.

One of my current projects is to create a sound scape of the courtroom at the Old
Bailey – to re-create the aural experience of the defendant – what it felt like
to speak to power, and what it felt like to have power spoken at you from the
bench. And in turn, to use that knowledge to assess who was more effective in
their dealings with the court, and whether, having a bit of shirt to you, for
instance, effected your experience of transportation or imprisonment.  And the point of the project is to simply add a few more variables to the ones we can securely derive from text.

It is an attempt to add just a couple of more columns to a spreadsheet of almost infinite categories of knowing.  And you could keep going – weather, sunlight, temperature, the presence of the smells and reeks of other bodies.  Ever more layers to the sense of place.  In part, this is what the gaming industries have been doing from the beginning, but it also becomes possible to turn that creativity on its head, and make it serve a different purpose.

In the work of people such as Ian Gregory, we can see the beginnings of new ways of reading both the landscape, and the textual leavings of dead.  Bob Shoemaker, Matthew Davies and I (with a lot of other people) tried to do something similar with Old Bailey material, and the geography of London in the Locating London’s Past project.

This map is simply colours blue, red and yellow mapped against brown and green.  I have absolutely no idea what this mapping actually means, but it did force me to think differently about the feel and experience of the city.  And I want to be able to do the same for all
the text captured in the UK domain name.

All of which is to state the obvious.  There are lots of new readings that change how we connect with historical evidence – whether that is text, or something more interesting.    In creating new digital forms of inherited culture – the stuff of the dead – we naturally innovate, and naturally enough,
discover ever changing readings.  But the Web Archive, challenges us to do a lot more; and to begin to unpick what you might start pulling together from this near infinite archive.

In other words, the tools of text are there, and arguably moving in the right direction,
but there are several more dimensions we can exploit when the object of study
is itself an encoding.

Each web page, for instance, embodies a dozen different forms.  Text is obvious, but it is important to
remember that each component of the text – each word and letter, on a web page – is itself a complex composite.  What happens when you divide text by font or font size; weight, colour, kerning, formatting etc.  By location – in the header, or the body, or wherever the CSS sends it; or more subtly by where it appears to a users’ eye – in the middle of a line – or at the end. Suddenly, to all the forms of analysis we have associated with ‘distant reading’ there are five or six further columns in the spread sheet – five or six new variables to investigate in that ‘big data’ eye-opened sort of way.

And that is just the text.  The page itself is both a single image, and a collection of them – each with their own properties.  And one of the great things that is coming out of image research is that we can begin to automate the process of analysing those screens as ‘images’.  Colour, layout, face recognition etc.  Each page, is suddenly ten images in one – all available as a new variable; a new column in the spreadsheet of analysis.  And, of course, the same could be said of embedded audio and video.

And all of that is before we even look under the bonnet.  The code, the links, the meta data for each page – in part we can think of these as just another iteration of the text; but more imaginatively, we can think about it as more variables in the mix.

But, of course, that in itself misunderstands the web and the Web Archive.  The commonplace metaphor I have been using up till now is of a ‘page’ – and is the intellectual equivalent of skeumorphism – relying on material world metaphors to understand the online.

But these aren’t pages at all, they are collections of code and data that generate in to
an experience in real time.  They do not exist until they are used – if a website in the forest is never accessed, it does not exists.  The web archive therefore is not an archive of ‘objects’ in the traditional sense, but a snapshot from a moving film of possibilities.  At its most abstract, what the UK Web Archive has done, is spirit in to being the very object it seeks to capture – and of course, we all know that in doing so, the capturing itself changes the object.  Schrödinger’s cat may be alive or dead, but its box is definitely open, and we have visited our observations upon its content.

So to add to all the layers of stuff that can fill your spreadsheet, there also needs to be columns for time and use; re-use and republication.  And all this is before we seek to change the metaphor and talk about networks of connections, instead of pages on a website.

Where I end up is seriously jealous of the possibilities; and seriously wondering what the
‘object of study’ might be.  In the nature of an archive, the UK Web Archive imagines itself as an ‘object of study’; created in the service of an imaginary scholar.  The question it raises is how do we turn something we really can’t understand, cannot really capture as an object of study, to serious purpose?  How do we think at one and the same time of the web as alive and dead, as code, text, and image – all in dynamic conversation one with the other.  And even if we can hold all that at once, what is it we are asking?


A Heisenberg Principle of web archiving ?

It’s been great to see the historical perspective being represented at this week’s General Assembly of the IIPC in Stanford. Following the Twitter hashtag at #iipcGA15, this older post came to mind. The comprehensive domain-wide archiving under UK Non-Print Legal Deposit that it refers to is now two years old; and 2015 has seen a significant upswing in attention being paid to web archiving in the press. So: do we yet know what the effect of widespread web archiving will be on the behaviour of those being archived? I don’t think we do; and historians of the future will need to know.

Webstory: Peter Webster's blog

Whatever it means to real scientists, the famous ‘uncertainty principle’ of Werner Heisenberg is sometime popularly taken to mean that it is impossible closely to observe something without in some way altering it. It’s also a conundrum that has faced anthropologists when observing cultures far removed from their own: how far does the consciousness of being observed alter the behaviour of the subject ?

I’ve been publishing in print in the traditional way for some years now, and everyone knows that books are (in theory) permanent, that they find their way into libraries; and so one writes conscious that the words cannot be unwritten. Writing for the web, however, has had a more transient aesthetic: I can write with the freedom that comes from knowing that (in a site I control) I can retrospectively edit at will, should I choose to. There are good scholarly reasons not to, to do…

View original post 236 more words

ReSAW: Research Infrastructure for the Study of Archived Web Materials

Historians based in Europe in particular should know about ReSAW, a Europe-wide network of scholars and institutions interested in the archived web. Co-ordinated by Niels Brügger (Aarhus University, NetLab & the Centre for Internet Studies), at present it is largely sustained by the efforts of its members, but there are plans for expansion in the next few years.

The next ReSAW event is a major conference on Web Archives as Scholarly Sources: Issues, Practices and Perspectives, which will take place in Denmark in June, and at which both Peter and Ian will be presenting papers. Booking is now open.

As well as the conference, there is a cluster of pilot research projects which may be of interest to historians. These include examinations of patterns of commemoration online, through to the traces left by the Eurovision Song Contest.

How fast does the web change and decay? Some evidence

One of the prime motivations behind Web Archives for Historians is our consciousness of how quickly the web changes and decays, and of the particular shape this gives to the archived web with which historians will need to work. However, just how this happens is not very well documented, and so I draw attention here to some introductory resources about the problem.

As historians, we need to know about patterns in which content disappears, but also about the rate at which it disappears.  One recent paper (2012) by Hany M. SalahEldeen and Michael L. Nelson looked at how quickly resources shared on social media about particular news events had disappeared, and found that:

after the first year of publishing, nearly 11% of shared resources will be lost and after that we will continue to lose 0.02% per day.

Taking a different approach, using ten year’s worth of archived content in the UK Web Archive, the British Library’s technical lead Andy Jackson took a sampling approach to plot not only the rate of disappearance of content, but also the degree to which it had changed between 2004 and 2014. Readers may be interested in both the methods Andy used, and some important caveats about how the selection of content in the archive may have influenced the trends. But, the headline is that the fraction of content that is both still online and unchanged after those ten years is so small it hardly be seen on the graph. Even for content that was archived only a year before, the proportion that is live and unchanged is less than 10%.

In their different ways, both studies point to the same issue: that the live web changes and disappears very quickly. Historians need both to grasp how it happens, as well as to begin to think about what kind of archive this leaves us with.

The Historian of the Web : Crawler, Browser or Lurker?

[A special guest post by Valérie Schafer and Francesca Musiani.]

“They program us, we re-program them. They segment us, we move around. They accelerate, we linger. We can always be smarter than our machines.”
[Louise Merzeau, « Le Flâneur Impatient », Médium, Rythmes, 
n°41, 2014/4, p. 20-29]

In his blog post of January 22, 2015, “The Promise of WebARChive Files”, Ian Milligan noted:

Not only does the Agence nationale de la recherche project Web90 take this idea seriously – but its opposite, as well: “You can’t do justice to the World Wide Web if you do not consider the 1990s”, we argue.

Building on this core idea, the project aims at providing elements of reflection about the context of Web development in the Nineties (e.g. tariffs, strategies and offers put forward by ISPs, the birth of web design, the emergence of e-commerce and personal pages, the transition from the Minitel to the Web, the notorious legal controversies that ISPs and hosting services needed to face, or national State-driven policies). Also, Web90 wishes to map the “French” Web (defined as the .fr domain, even if, we are aware, this does not account for all French websites on its own, let alone French Web browsing patterns), and to reconstruct Web browsing users’ experience in light of such factors as the emergence of “graphic scenarios” that emerged as a result of the evolution of interfaces. These issues call for the simultaneous adoption of several methods, very different but nonetheless complementary, and not equally suited to provide answers to all questions.

Web archives as big data…
This was the title of the conference organised in December 2014 by the Big UK Domain Data for the Arts and Humanities project. The information ‘deluge’ may appear less threatening for the scholar of the Web of the Nineties, despite an important growth, in the second half of the decade, of the number of domain names and hosts. However, what was already an abundance needs to be managed, as do the missing pieces – images or sites that were not preserved, or very fleetingly or superficially so. The Digital Humanities and their tools will prove useful to face the massive amounts of Web data, provided that historians are ready to enter the “black box” of tools and instruments, as Ian Milligan showed, and also the “black box” of Web archiving, as Axis 3 of the Web90 project shows. Indeed, beyond the understanding of tools there is the nature of collecting procedures, its periodicity and its actors to engage with as well as the representations underpinning the constitution of archives.

Black boxes …
Let us take two examples. The first is developed in the article “Quand la communication devient patrimoine…” [When Communication Becomes Heritage], co-written by Camille Paloque-Berges and Valérie Schafer, forthcoming in Hermès. The article addresses, amongst other things, the vision of Web and digital heritage that informs the actions and strategies of the Archive Team. In stating, on its home page, that “History is our future… And we’ve been trashing our history”, and describing itself as “a loose collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage”, Archive Team clearly states its interest in “ordinary” forms of communication that are, nonetheless, already qualified as digital heritage.

Mélanie Dulong de Rosnay, member of the Web90 team, has shown in Réseaux de production collaborative de connaissances, building on Elinor Ostrom’s theories, how the notion of common good is today “materialized” a core feature of peer production networks. We can probably observe, in the communitarian implementation of this preservation infrastructure, an alternative management of informational common goods as heritage, as well as a movement of re-appropriation by users.

Born-digital heritage, as well as digital heritage, as we have recently noted in our Call for Papers for the RESET journal, “ […] call for empirical investigations on both its publics – existing or expected/envisaged – and its promoters, producers, preservers. […] The controversies that these policies raise (e.g. those that concern the ‘right to be forgotten’ and the right to memory), as well as the interactions of public authorities with preservation institutions (or among these institutions themselves), are interesting to analyze for the light they shed on the socio-technical and political dimensions of ‘digital heritage’, as it becomes institutionalized. The practices and procedures contributing to the shaping and the legitimization of digital heritage entail a number of choices, trials, tests, intertwined ‘scales of action’, and a social ‘work’ undertaken by a variety of actors, including professional associations, amateurs, the public at large, libraries, museums, research groups volunteering to be in charge of specific archiving tasks or initiating preservation policies, international institutions or clusters of entities such as UNESCO or the International Internet Preservation Consortium”.

The second example is drawn from Louise Merzeau’s work. On the occasion of the general assembly of the IIPC, in May 2014, she showed the link between archiving models, epistemological models and research models, and the retro-active feedback between these elements. As such, entering the black box seems crucial; however, historians will need to avoid several obstacles.

… and their temptations
Trying to transform historians into computer scientists is, in our opinion, an idea as risky as its contrary, i.e., internalist and machine-focused approaches that informed the early days of computing history written by practitioners. However, not to improve historians’ digital literacy would be just as disastrous.

Similarly, it seems very important to us not to mingle different roles to the point of confusion: if historians and their colleagues from other social and human sciences are not computer scientists, they are not archivists either. Institutional mediation (while frustrating at times, as it implies choices and gaps) guarantees sustainability and accessibility: a long-term vision that researchers’ archiving practices can only partially satisfy, unless access to data, and their deposit is completely re-thought, in history and other social and human sciences – not unlike what other communities have previously done (e.g. GenBank).

The naïve way of data-driven science, which has led some to believe that we are assisting in the “end of theory”, should also be avoided. The related risks have been underlined in other disciplines, for example by Bruno Strasser. As Antoine Prost remarked in his Twelve Lessons on History, there is no document without an underlying question. The questions asked by historians turn the traces left by the past into sources and documents. Big data and computational methods are not always, or not entirely, appropriate to provide answers and more so, to formulate questions.

Web explorers… “Small is beautiful”!
In the Web90 project, so as to account for the set of conditions that have shaped the Web experience of Internet users in the Nineties – especially their browsing habits – we have chosen to study on one hand the general framework and the power of ‘massification’. But on the other hand, and in parallel, we wish to ‘stay close to the archive’, and eventually to follow paths previously traced by others (e.g. directories or closed spaces such as Infonie), open doors such as ISP portals or the Yahoo! Directory, or those sites that were recommended by the press or by guidebooks, such as the 1998 Guide du Routard de l’Internet.) Of course, in this domain, Web archives are precious assets, but the word “browser” takes on here its full original etymological sense: to encompass our mobilization of printed sources (press archives, State-driven reports, guidebooks for the general public) but also audio-visual archives, oral testimonies, or newsgroups. These sources invite historians to become lurkers around exchanges past. But very often, they soon need to emerge from this “passive” status…

A Usenet newsgroups research focused on female presence on the Web of the Nineties (carried out for this conference) has allowed us to retrace a post by a “Guillermito El Loco”, followed by twenty-five other posts, on the subject: “The Web is by far too masculine. What should we do?” With the objective of encouraging feminine presence and visibility on the Web, he proposes to list “girl Web pages” authored by French women. Two days later, as he has contacted the women he wishes to list on his site, reactions are quite nuanced and mixed: “For now, I had around thirty responses, four or five of them negative, sometimes with fairly violent reactions… it might actually disgust you from being a feminist!” The site, the link of which is mentioned in the post, has luckily been preserved by the Internet Archive, and opens a door towards “feminine Web pages” and the profiles of their authors. Guillermito has compiled an alphabetical database of these profiles and collected the links, a fairly good proportion of which are active in the Wayback Machine. There is, we believe, no need to argue further for the potential of such a corpus – peculiar, modest but unique – for our subject of study.

“The historian of tomorrow will be a programmer, or will be no more”, stated historian Emmanuel Le Roy Ladurie in 1973. Soon after, he left for the Occitan village Montaillou, of which he recomposed the day-to-day history, while distancing himself from measures and statistics… Do we, can we, see in this anecdote – as argued by Valérie Schafer and Benjamin G. Thierry in a forthcoming article – a harbinger for the use of Web archives? The promise of digital archives and tools leaves the door wide open for historians as regards methodology and its plurality. We do not wish to assume anything about which approaches will ultimately be privileged. However, between quantitative and qualitative, subjectivism and scientific requirements, sampling or claims of comprehensiveness, will we witness ancient quarrels “reloaded”? Or, on the contrary, will the legacy of historiography allow us to move beyond these dichotomies… and beyond binary oppositions?

Religion, social media and the web archive

Peter reblogs here a post on the ways in which his own study of contemporary religious history needs to come to terms with the ways in which social media content is (and is not) captured by traditional web archiving. As historians, we will need to understand how social media content is being archived, and the ways in which different archives of web-delivered content will need to be interrogated *together* to reconstruct the communication of individuals and organisations.

Webstory: Peter Webster's blog

Late last year I was delighted to be invited to be one of four keynote speakers at a workshop on religion and social media at the International AAAI Conference on Web and Social Media in Oxford in May. Here are some initial thoughts on what I intend to say.

There has been an interesting upswing recently in scholarly interest in the ways in which religious people, and the organisations in which they gather together, represent themselves and communicate with others on social media. However, this work has been conducted relatively independently from the emerging body of scholarship on the archived web. Image by , CC BY 2.0 , CC BY 2.0

There are some reasons for this. First is the fact that much of the scholarship on social media tends to be focussed very firmly on the present. As such, data tends to be gathered directly from social media platforms “to order”, to match the…

View original post 353 more words