So You’re a Historian Who Wants to Get Started in Web Archiving

By Ian Milligan (University of Waterloo)

(Cross-posted and adapted from an earlier post I wrote for the IIPC’s blog)

The web archiving community is a great one, but it can sometimes be a bit confusing to enter. Unlike communities such as the Digital Humanities, which has developed aggregation services like DH Now, the web archiving community is a bit more dispersed. But fear not, there are a few places to visit to get a quick sense of what’s going on. Here I just want to give a quick rundown of how you can learn about web archiving on social media, from technical walkthroughs, and from blogs.

I’m sure I’m missing stuff – let us all know in the comments!

Social Media

A substantial amount of web archiving scholarship happens online. I use Twitter (I’m at @ianmilligan1), for example, as a key way to share research findings and ideas that I have as my project comes together. I usually try to hashtag them with: #webarchiving. This means that all tweets that people use “#webarchiving” with will show up in that specific timeline.

For best results, using a Twitter client like Tweetdeck, Tweetbot, or Echofon can help you keep appraised of things. There may be Facebook groups – I actually don’t use Facebook (!) so I can’t provide much guidance there. Continue reading So You’re a Historian Who Wants to Get Started in Web Archiving

On the trace of a website’s lost past

[A guest post from Federico Nanni, who a PhD student in Science, Technology and Society at the Centre for the History of Universities and Science of the University of Bologna.]

Result from a Wayback Search for
Result from a Wayback Search for

The University of Bologna is considered by historians to be the world’s oldest university in terms of continuous operation. At the university’s Centre for the History of Universities and Science, our research group works on different projects focused on understanding the history of this academic institution and its long-term socio-political relationships. Several sources and methods have been used for studying its past, from quantitative data analysis to online databases of scientific publications.

Since the introduction of the World Wide Web, a new and different kind of primary source has become available for researchers: born digital documents, materials which have been shared primarily online and which will become increasingly useful for historians interested in studying recent history.

However, these sources are already more difficult to preserve compared to traditional ones. And this is true especially for what concerns the University of Bologna’s digital past. In fact, Italy does not have a national archive for the preservation of its web-sphere and furthermore “” has been excluded from the Wayback Machine.

For this reason, I have focused my research (a forthcoming piece on this will be available in Digital Humanities Quarterly) primarily on understanding how to deal and solve this specific issue in order to reconstruct the University of Bologna’s digital past and to understand if these materials are able to offer us a new perspective on the recent history of this institution.

In order to understand the reasons of the removal of, my first step was to find, in the exclusion-policy of the Internet Archive, information related to the message “This URL has been excluded from the Wayback Machine”, which appeared when searching “”.

As described in the Internet Archive’s FAQ section, the most common reason for this exclusion is when a website explicitly requests to not be crawled by adding the string “User-agent: ia_archiver Disallow: /” to its robots.txt file. However, it is also explained that “Sometimes a website owner will contact us directly and ask us to stop crawling or archiving a site, and we endeavor to comply with these requests. When you come across a “blocked site error” message, that means that a site owner has made such a request and it has been honoured. Currently there is no way to exclude only a portion of a site, or to exclude archiving a site for a particular time period only. When a URL has been excluded at direct owner request from being archived, that exclusion is retroactive and permanent”.

When a website has not been archived due to robots.txt limitations a specific message is displayed. This is different from the one that appeared when searching the University of Bologna website, as you can see in the figures below. Therefore, the only possible conclusion is that someone explicitly requested to remove the University of Bologna website (or more likely, only a specific part of it) from the Internet Archive.


For this reason, I decided to consult CeSIA, the team that has supervised during the last few years, regarding this issue. However, they did not submit any removal request to the Internet Archive and they were not aware of anyone submitting it.

To clarify this issue and discover whether the website of this institution has been somehow preserved during the last twenty years, I further decided to contact the Internet Archive team at (as suggested in the FAQ section).

Thanks to the efforts of Mauro Amico (CeSIA), Raffaele Messuti (AlmaDL – Unibo), Christopher Butler (Internet Archive) and Giovanni Damiola (Internet Archive), we began to collaborate at the end of March 2015. As Butler told us, this case was really similar to another one that involved the New York Government Websites.

With their help, I discovered that a removal request regarding the main website and a list of specific subdomains had been submitted to the Wayback Machine in April 2002.


With our efforts, the university main website became available again on the Wayback Machine on the 13th of April 2015. However, both the Internet Archive and CeSIA have no trace of the email requests. For this reason, CeSIA decided to keep the other URLs in the list excluded from the Wayback Machine for the moment, as it is possible that this request was made for a specific legal reason.

In 2002 the administration of changed completely, during a general re-organization of the university’s digital presence. Therefore, it is entirely obscure who, in that very same month, could have sent this specific request, and for which reason.

However, it is evident that this request was made by someone who knew how the Internet Archive exclusion policy works, as he/she explicitly declared a specific list of subdomains to remove (in fact, the Internet Archive excludes based on URLs and their subsections – not subdomains). It could be possible that the author obtained this specific knowledge by contacting directly the Internet Archive and asking for clarification.

Even if thirteen years have passed, my assumption was that someone involved in the administration of the website would have remembered at least this email exchange with a team of digital archivists in San Francisco. So, between April and June 2015 I conducted a series of interviews with several of the people involved in the website, pre and post the 2002 reorganization. However, no one had memories or old emails related to this specific issue.

As the specificity of the request is the only hint that could help me identify its author, I decided to analyze the different urls in more detail. The majority of them are server addresses (identified by “alma.unibo”), while the other pages are subdomains of the main website, for example (probably dedicated to international collaborations).

My questions now are: why someone wanted to exclude exactly these pages and not all the department pages, which had an extremely active presence at that time? Why exactly these four subdomains and not the digital magazine Alma2000 ( or the e-learning platform ( It could be possible that this precise selection is related to a specific reason, that could offer us a better understanding on the use and the purpose of this platform in those years.

To conclude, I would like to also point out how strange this specific impasse is: given that we don’t know the reason of the request I cannot have the permission from CeSIA, the current administrator, to analyze the snapshots of these URLs. However, at the same time, we are not able to find anyone who remembers sending the request and not a single proof of it has been preserved. In my opinion, this depicts perfectly a new level of difficulties that future historians will encounter while investigating our past in the archives.

Federico Nanni is a PhD student in Science, Technology and Society at the Centre for the History of Universities and Science of the University of Bologna. His research is focused on understanding how to combine methodologies from different fields of study in order to face both the scarcity and the abundance of born digital sources related to the recent history of Italian universities.

The rise and fall of text on the Web: a study using Web archives

[The following is a guest post from Anthony Cocciolo (@acocciolo), Associate Professor at Pratt Institute School of Information and Library Science, on a recently published research study]

In the summer of 2014, I became interested in studying if it was more than my mere impression that websites were beginning to present less text to end-users. Websites such as were gaining enormous popularity and using a communicative style that had more in common with children’s books (large graphics and short segments of text) than with the traditional newspaper column. I wondered if I could measure this change in any systematic way? I was interested in this change primarily for what it implied about literacy and what we ought to teach students, and more broadly about what this change meant for how humans communicate and share information, knowledge and culture.

Teaching students to become archivists at a graduate school of information and library science, and focusing on a variety of digital archiving challenges, I was quite familiar with web archives. It was immediately clear to me if I were to study this issue I would be relying on web archives, and primarily on the Internet Archive’s Wayback Machine, since it had collected such a wide scope of web pages since the 1990s.

The method devised was to select 100 popular and prominent homepages in the United States from a variety of sectors that were present in the late 1990s and are still used today. I also decided to select homepages every three years beginning in 1999, resulting in 6 captures or 600 homepages. The reason for this decision is that by 1999 the Internet Archive’s web archiving efforts were fully underway, and three years would be enough to show changes but not require a hugely repetitive dataset. URLs for webpages in the Internet Archive were selected using the Memento web service. Full webpages were saved as static PNG files.

To detect text blocks from non-text blocks, I modified a Firefox extension called Project Naptha. This extension detects text from non-text using an algorithm called the Stroke Width Transform. The percentage of text per webpage was calculated and stored in a database. A sample of detected text from non-text is shown in the figure below, which is 46.10% text.

Text detection on the White House site

Once the percentage of text for each webpage and year were computed, I used a statistical technique called a one-way ANOVA to determine whether the percentage of text on a webpage was a chance occurrence, or instead dependent on the year the Website was produced. I found that these percentages were not random occurrences but dependent on the year of production (what we would call statistically significant).

The major finding is that the amount of text rose each year from 1999 to 2005, at which point it peaked, and it has been on a decline ever since. Thus, website homepages in 2014 have 5.5% less text than they did in 2005. This is consistent with other research that uses web archives that indicate a decrease of text on the web. This pattern is illustrated below.

Mean percentage of text on pages over time

This study necessarily begs the question: what has caused this decrease in the percentage of text on the Web? Although it is difficult to make definitive conclusions, one suggestion is that the first Web boom of the late 1990s and early 2000s brought about significant enhancements to internet infrastructure, allowing for non-textual media such as video to be more easily streamed to end-users (Interestingly, the year 2005 was also the year that YouTube was launched.) This is not to suggest that text was replaced with YouTube videos but rather that a rise in multiple modes of communication became more possible with their easier delivery, such as video and audio, which may have helped unseat text from its primacy on the World Wide Web.

I think the study raises a number of interesting issues. If the World Wide Web is presenting less text to users relative to other elements, does this mean that the World Wide Web is becoming a place where deep reading is less likely to occur? Is deep reading now only happening in other places, such as e-readers or printed books (some research indicates this might be the case)? The early web was the great delivery mechanism of text, but might text be further unseated from its primacy and the web become primarily a platform for delivering audiovisual media?

If interested in this study, you can read it on the open-access publication Information Research.

The net, the web, the archive and the historian

[A guest post from Dr Gareth Millward (@MillieQED), who is Research Fellow at the London School of Hygiene and Tropical Medicine.]

One of the first things you need to get your head around when you dive into the history of the internet is that “the internet” and “the web” are not the same thing. That sounds trivial to most people who have worked in the sector for any period of time. But trust me – it isn’t.

It’s a problem because we have been archiving the web systematically for quite a long time. The British Library’s archive has pages stored from 1996 onwards. So for someone relatively new to using web archives as a scholarly source, I can access a lot of information.

As someone whose family got their first internet connection in 2000, however, I also know that there’s a lot that won’t be stored. And there is a lot that will be stored that I won’t be able to access. Internet Relay Chat, for example, was very popular when I first got access to the ‘net. From those MSN chat rooms (that were eventually shut down due to the… er… “unpleasantness”), to the use of purpose-made clients to connect with friends, chat was by its nature ephemeral. Perhaps some user would have kept a log of the conversation (and I probably have a few of those on text file somewhere). More than likely, they didn’t. Or even if they did, the chances of them surviving are slim.

The advent of Facebook and Twitter and their ilk in the mid-2000s has also complicated matters. Pretty quickly it became apparent that these social networks were culturally important and would probably need to be preserved. But the ethics of such an undertaking are complicated to say the least. It’s one thing to do a “big data” analysis of the rise and fall of the term “hope” over the 2004 US General Election. It’s another to do a “close reading” analysis of the behaviour of teenagers. Since it’s all held behind password-protected pages and servers, our old web-crawling techniques aren’t going to help. The Library of Congress is collecting Twitter. But how we will actually use it in the future remains to be seen.

Moreover, with social media, chat logs, e-mails, and various other “non-web” internet data, we cannot be certain about how systematic or representative our source base is. There is great potential for our research findings to be skewed. (Not, of course, that the web archive is objective and clean either. But I digress.)

This matters to me as a historian because I am not a computer scientist. I wouldn’t even consider myself a historian of the internet. Much like I use biographies, diaries, government papers and objects to build a story of the past, internet sources are yet another way of finding out what people said and did. A good historian would never assume a diary to be an accurate, objective account of past events. There is always an inherent bias in which data survive. Just as she would also understand that regardless of the amount of sources she collates, there will always be gaps in the evidence.

The problem, really, is twofold. First, there is so much material available it gives both the illusion of completeness and the temptation to try to use it all. Second, because it lacks the human curation element so central to “traditional” archives, it can be difficult to sift through the white noise and home in on the data that matters to our research questions.

The first part is relatively difficult to get over, but not impossible. It simply requires some discipline and better training on what internet archives can and cannot do. From there, we can apply our knowledge and discretion to only focus on the parts of the archive that will actually help us – and/or adapt our research questions accordingly.

But that second bit is always going to be a problem. Again, discipline can help. We can simply accept our fate – that we will never have it all – and focus our histories on the scraps that remain. Like Ian Milligan’s work on the archive of GeoCities. Or Kevin Driscoll’s on the history of Bulletin Board Systems. At the same time, how does a historian of the 1990s try to use these archives to try to access the people of the period? How on earth can this material be narrowed down? Will we always have to keep our “online” and “offline” research separate?

The exciting thing is that we don’t have fully developed answers to these questions yet. The scary thing is that it’s our generation of scholars that are going to have to come up with the solutions. This seems like a lot of work. If anyone is willing to do it for me, I would be forever grateful!

When just using a web archive could place it in danger

[A recent post, cross-posted from Peter’s own blog.]

Towards the end of 2013 the UK saw a public controversy seemingly made to showcase the value of web archives. The Conservative Party, in what I still think was nothing more than a housekeeping exercise, moved an archive of older political speeches to a harder-to-find part of their site, and applied the robots.txt protocol to the content. As I wrote for the UK Web Archive blog at the time:

Firstly, the copies held by the Internet Archive ( were not erased or deleted – all that happened is that access to the resources was blocked. Due to the legal environment in which the Internet Archive operates, they have adopted a policy that allows web sites to use robots.txt to directly control whether the archived copies can be made available. The robots.txt protocol has no legal force but the observance of it is part of good manners in interaction online. It requests that search engines and other web crawlers such as those used by web archives do not visit or index the page. The Internet Archive policy extends the same courtesy to playback.

At some point after the content in question was removed from the original website, the party added the content in question to their robots.txt file. As the practice of the Internet Archive is to observe robots.txt retrospectively, it began to withhold its copies, which had been made before the party implemented robots.txt on the archive of speeches. Since then, the party has reversed that decision, and the Internet Archive copies are live once again.

Courtesy of wfryer on, CC BY-SA 2.0 :
Courtesy of wfryer on, CC BY-SA 2.0 :

As public engagement lead for the UK Web Archive at the time, I was happily able to use the episode to draw attention to holdings of the same content in UKWA that were not retrospectively affected by a change to the robots.txt of the original site.

This week I’ve been prompted to think about another aspect of this issue by my own research. I’ve had occasion to spend some time looking at archived content from a political organisation in the UK, the values of which I deplore but which as scholars we need to understand. The UK Web Archive holds some data from this particular domain, but only back to 2005, and the earlier content is only available in the Internet Archive.

Some time ago I mused on a possible ‘Heisenberg principle of web archiving‘ – the idea that, as public consciousness of web archiving steadily grows, the consciousness of that fact begins to affect the behaviour of the live web. In 2012 it was hard to see how we might observe any such trend, and I don’t think we’re any closer to being able to do so. But the Conservative party episode highlights the vulnerability of content in the Internet Archive to a change in robots.txt policy by an organisation with something to hide and a new-found understanding of how web archiving works.

Put simply: the content I’ve been citing this week could later today disappear from view if the organisation concerned wanted it to, and was to come to understand how to make it happen. It is possible, in short, effectively to delete the archive – which is rather terrifying.

In the UK, at least, the danger of this is removed for content published after 2013, due to the provisions of Non-Print Legal Deposit. (And this is yet another argument for legal deposit provisions in every jurisdiction worldwide). In the meantime, as scholars, we are left with the uneasy awareness that the more we draw attention to the archive, the greater the danger to which it is exposed.

What does the web remember of its deleted past?

[A special guest post from Dr Anat Ben-David (@anatbd) of the Open University of Israel ]

On March 30 2010, the country-code top-level domain of the former Yugoslavia, .yu, was deleted from the Internet. It is said to have been the largest ccTLD ever removed. In terms of Internet governance, the domain had lost any entitlement to be part of the Internet’s root zone, after Yugoslavia dissolved. With the exception of Kosovo, all former Yugoslav republics received new ccTLDs. Technically, it was neither necessary nor possible to keep a domain of a country that no longer exists.

The consequence of the removal of the domain, which at its peak hosted about 70,000 websites, is the immediate deletion of any evidence that it was part of the Web. The oblivious live Web has simply rerouted around it. Since the .yu ccTLD is no longer part of the DNS, even if .yu websites are still hosted somewhere on a forgotten server, they cannot be recalled; search engines do not return results to queries for Websites in the .yu domain; references to old URLs on Wikipedia are broken.

My recent research uses the case of the deleted .yu domain to problematize the ties between the live and archived Web, and to both question and demonstrate the utility of Web archives as a primary source for historiography. The first problem I address relates to the politics of the live Web, which, arguably, create a structural preference for sovereign and stable states. The DNS protocol enforces ICANN’s domain delegation policy, which is derived from the ISO-3166 list of countries and territories officially recognized by the United Nations. Countries and territories recognized by the UN are therefore delegated ccTLDs, but unstable, unrecognized, dissolving or non-sovereign states cannot enjoy such formal presence on the Web, marked by the national country-code suffix. It is for this reason that the former republics of Yugoslavia (Bosnia, Macedonia, Slovenia, Croatia, Serbia and Montenegro) received new ccTLDs, but Kosovo, which is not recognized by the United Nations, did not.

While such policy influences the Web of the present, it also denies unstable and non-sovereign countries the possibility of preserving evidence of their digital past. To illustrate my point, consider an imaginary scenario whereby the top-level domain of a Western and wealthy state – say Germany, or the UK – is to be removed from the DNS system in two years. It is difficult to imagine that a loss of digital cultural heritage at such scale would go unnoticed. To prevent such imaginary scenarios from taking place, national libraries around the world work tirelessly to preserve their country’s national Webs. Yet for non-sovereign states, or in case of war-torn states that once existed but have since dissolved, such as the former Socialist Federal Republic of Yugoslavia, the removal of the country’s domain is not treated in terms of cultural heritage and preservation, but instead as a bureaucratic and technical issue.

Technically, the transition from .yu to the Serbian .rs and the Montenegrin .me was perfectly coordinated between ICANN, Serbia and Montenegro. In 2008, a two-year transitional phase was announced to allow webmasters ample time to transfer their old .yu websites to the new national domains. It is reported that migration rates were rather high. But what about the early days of the .yu domain – the websites that describe important historical events such as the NATO Bombing, the Kosovo War, the fall of Milosevic? What about the historical significance of the mailing lists and newsgroups that contributed for the first time to online reporting of war from the ground? The early history of the .yu domain – the domain that existed prior to the establishment of Serbia and Montenegro as sovereign states – was gone forever.


Thankfully, the Internet Archive has kept snapshots of the .yu domain throughout the years. However a second problem hinders historians from accessing the rare documents that can no longer be found online. That second problem relates to the structural dependence of Web archives on the live Web. Despite some critical voices in the Web archiving community, most Web archiving initiatives and most researchers still assume that the live Web is the primary access point that leads to the archive. The Wayback Machine’s interface is an example of that; one has to know the URL in order to view its archived version. The archive validates the existence of URLs of the live Web, and allows for examining their history. However if all URLs of a certain domain are removed from the live Web and leave no trace, what could lead historians, researchers, or individuals to the archived snapshots of that domain?

Taking both problems into account, I set out to reconstruct the history of the .yu domain from the Internet Archive. The challenge is guided by a larger question about the utility of Web archives for historiography. Can the Web be used as a primary source for telling its own history? What does the Web remember of its deleted past? If the live Web has no evidence of the past existence of any .yu URL, would I be able to find the former Yugoslav Web in the Internet Archive, demarcate it, and reconstruct its networked structure?

I began digging. Initially, I used various advanced search techniques to find old Websites that may contain broken links to .yu Websites. I also scraped online aggregators of scholarly articles to find old references to .yu Websites in footnotes and bibliographies. My attempts yielded about 200 URLs, certainly not enough to reconstruct the history of the entire domain from the Internet Archive.

The second option was to use offline sources – newspaper archives, printed books, and physical archives. But doing so would not rely on the Web as a primary source of narrating its history.

My diggings have eventually led me to old mailing lists. In one of them I found a treasure. On 17 February 2009, Nikola Smolenski, a Wikipedian and a Web developer, posted a message to Wikimedia’s Wikibots-L mailing list, asking fellow Wikipedians to help him replace all references to .yu URLs in the various pages of the Wikimedia project. The risk, wrote Smolenski, was ‘that readers of Wikimedia projects will not be able to access information that is now available to them’, and that ‘with massive link loss, a large number of references could no longer be evaluated by the readers and editors’. He used a Python script to generate a list of 46,102 URLs in the .yu domain that were linked from Wikimedia projects and that had to be replaced. A day before the removal of the domain, he also systematically queried Google for all URLs in the .yu domain per sub-domain, which yielded several thousand results. Smolenski’s lists are a last snapshot of the presence of the Yugoslav domain on the live Web. The day after he conducted the search, the .yu ccTLD was no longer part of the Internet root, resulting in the link loss he had anticipated.

Smolenski kindly agreed to send me the lists he generated in 2010. Using the URLs in the lists as seeds, my research assistant Adam Amram and I have built another Python script to fetch the URLs from the Internet Archive, extract all the outlinks from each archived resource, and extract from that set of links those which belonged to the .yu domain. We reiterated the method four times until no new .yu content was found. Our dataset now contains 1.92 million unique pages that were once hosted in the .yu domain between 1996 and 2010.

While the full analysis of our data is beyond the scope of this blog post, I would like to present the following visualization of the rise and fall of the networked structure of the .yu domain over time. The figure below shows the evolution of the linking structure of .yu websites in the entire reconstructed space from 1996 to 2010. Websites in the .yu domain are marked in blue, websites in all other domains are marked in gray, and the visualization shows the domain’s hyperlinked structure per year.yu_networked_structure_1996-2010_for_blog

As can be clearly seen, the internal linking structure of the domain became dense only after the end of the Milosevic regime in 2000, and it is only after the final split between Serbia and Montenegro in 2006 that the .yu domain stabilized both in terms of the number of websites and network density, followed shortly after by the dilution of the network in preparation for the replacement of the .yu domain with the new ccTLDs .rs and .me. In other words, the intra-domain linking patterns of the .yu domain are closely tied with stability and sovereignty.

As time goes by, Web archives are likely to hold more treasures of our deleted digital pasts. This makes Web archives all the more intriguing and important primary sources for historical research, despite the structural problems of the oblivious medium that they attempt to preserve.

Conference dispatches from Aarhus: Web Archives as Scholarly Sources

Some belated reflections on the excellent recent conference at Aarhus University in Denmark, on Web Archives as Scholarly Sources: Issues, Practices and Perspectives (see the abstracts on the conference site).

As well as an opportunity to speak myself, it was a great chance to catch up with what is a genuinely global community of specialists, even if (as one might expect) the European countries were particularly well represented this time. It was also particularly pleasing to see a genuine intermixing of scholars with the librarians and archivists whose task it is to provide scholars with their sources. As a result, the papers were an eclectic mix of method, tools, infrastructure and research findings; a combination not often achieved.

Although there were too many excellent papers to mention them all here, I draw out a few to illustrate this eclecticism. There were discussions of research method as applied both in close reading of small amounts of material (Gebeil, Nanni), and to very large datasets (Goel and Bailey). As well as this, we heard about emerging new tools for better harvesting of web content, and of providing access to the archived content ( Huurdeman).

Particularly good to see were the first signs of work that was beginning to go beyond discussions of method (“the work I am about to do”) to posit research conclusions, even if still tentative at this stage (Musso amongst others), and critical reflection on the way in which the archived web is used (Megan Sapnar Ankerson). It was also intriguing to see an increased focus on the understanding of the nature of a national domain, particularly in Anat Ben-David‘s ingenious reconstruction of the defunct .yu domain of the former Yugoslavia. Good to see too was the beginnings of a reintegration of social networks into the picture (Milligan, Weller, McCarthy) difficult to archive though they are; and some attention to the web before 1996 and the Internet Archive (Kevin Driscoll on BBS).

All in all, it was an excellent conference in all its aspects, and congratulations to Niels Brügger and the organising team for pulling it off.

Have Web Collections? Want Link and Text Analysis?

(x-posted with

The Warcbase wiki in action!
The Warcbase wiki in action!

The Web Archives for Historical Research Group has been busy: working on getting the Shine front end running on Archive-It collections (a soft launch is underway here if you want to play with old Canadian websites), setting up Warcbase on our collections, and digging manually through the GeoCities torrent for close readings of various neighbourhoods.

One collaboration has been really fruitful. Working with Jimmy Lin, a computer scientist who has just joined the University of Waterloo’s David Cheriton School of Computer Science, we’ve been working on scripts, workflows, and implementations of his warcbase platform. Visit the warcbase wiki here. Interdisciplinary collaboration is amazing!

I’d like to imagine humanists or social scientists who want to use web archives are often in the same position I was four years ago: confronted with opaque ARC and WARC files, downloading them onto your computer, and not really knowing what to do with them (apart from maybe unzipping them and exploring them manually). Our goal is to change that: to give easy to follow walkthroughs that can allow users to do the basic things to get started:

  • Screen Shot 2015-06-05 at 11.51.29 AM
    A dynamic visualization generated with warcbase and Gephi

    Link visualizations to explore networks, finding central hubs, communities, and so forth;

  • Textual analysis to extract specific text, facilitating subsequent topic modelling, entity extraction, keyword search, and close reading;
  • Overall statistics to find over- and under-represented domains, platforms, or content types;
  • And basic n-gram-style navigation to monitor and explore change over time.

All of this is relatively easy for web archive experts to do, but still difficult for endusers.

The Warcbase wiki, still under development, aims to fix that. Please visit, comment, fork, and we hope to develop it alongside all of you.

The UK Web Archive, born-digital sources, and rethinking the future of research

[A guest post by Professor Tim Hitchcock of the University of Susssex. It is derived from a short talk given at a doctoral training event at the British Library in May 2015, focused on using the UK Web Archive.  It was written with PhD students in mind, but really forms a meditation on the opportunities created when working with the archived web  rather than print.  While lightly edited, the text retains the tics and repetitions of public presentation. We’re very grateful to Tim for permission to repost this, which first appeared on Historyonics. Tim is to be found on Twitter @TimHitchcock ]

I normally work on properly dead people of the sort that do not really appear in the UK Web Archive – most of them eighteenth-century beggars and criminals. And in many respects the object of study for people like me – interlocutors of the long dead –  has not changed that much in the last twenty years.  For most of us, the ‘object of study’ remains text.  Of course the ‘digital’ and the online has changed the nature of that text.  How we find things – the conundrums of search – that in turn shape the questions we ask – has been transformed by digitisation.  And a series of new conundrums have been added to all the old ones – does, for instance, ‘big data’ and new forms of visualisation, imply a new ‘open eyed’ interrogation of data?  Are we being subtly encouraged to abandon older social science ‘models’, for something new?   And if we are, should these new approaches take the form of ‘scientific’ interrogation, looking for ‘natural’ patterns – following the lead of the Culturomics movement; or perhaps take the form of a re-engagement with the longue durée – in answer to the pleas of the History Manifesto.   Or perhaps we should be seeking a return to ‘close reading’ combined with a radical contextualisation – looking at the individual word, person, word and thing – in its wider context, preserving focus across the spectrum.

And of course, the online and the digital also raises issues about history writing as a genre and form of publication.   Open access, linked data, open data, the ‘crisis’ of the monograph, and the opportunities of multi-modal forms of publication, all challenge us to think again about the kind of writing we do, as a  literary form.  Why not do your PhD as a graphic novel? Why not insist on publishing the research data with your literary over-lay?  Why not do something different?  Why not self-publish?

These are conundrums all – but conundrums largely of the ‘textual humanities’.  Ironically, all these conundrums have not had much effect on the academy and the kind of scholarship the academy values.  The world of academic writing is largely, and boringly, the same as it was thirty years ago.  How we do it has changed, but what it looks like feels very familiar.

But the born digital is different.  Arguably, the sorts of things I do, history writing focused on the  properly dead, looks ‘conservative’ because it necessarily engages with the categories of knowing that dominated the nineteenth and twentieth centuries – these were centuries of text, organised into libraries of books, and commentated on by cadres of increasingly professional historians.  The born digital – and most importantly the UK web archive – is just different.  It sings to a different tune, and demands different questions – and if anywhere is going to change practise, it should be here.

Somewhat to my frustration, I don’t work on the web as an ‘object of study’ –  and therefore feel uncertain about what it can answer and how its form is shaping the conversation; but I did want to suggest that the web itself and more particularly the UK Web Archive provides an opportunity to re-think what is possible, and to rethink what it is we are asking; how we might ask it, and to what purpose.

And I suppose the way I want to frame this is to suggest that the web itself brings on to a single screen, a series of forms of data that can be subject to lots of different forms of analysis.  A few years ago, when APIs were first being advocated as a component of web design, the comment that really struck me, was that the web itself is a form of API, and that by extension the Web Archive is subject to the same kind of ‘re-imagination’ and re-purposing that an API allows for a single site or source.

As a result, you can – if you want – treat a web page as simple text – and apply all the tools of distant reading of text – that wonderful sense that millions of words can consumed in a single gulp.   You can apply ‘topic modelling’, and Latent Semantic Analysis; or Word Frequency/Inverse Document Frequency measures.  Or, even more simply; you can count words, and look for outliers – stare hard at the word on the web!

But you can also go well beyond this.  In performance art, in geography and archaeology, in music and linguistics, new forms of reading are emerging with each passing year that seem to me to significantly challenge our sense of the ‘object of study’ – both traditional text and web page.  In part, this is simply a reflection of the fact that all our senses and measures are suddenly open to new forms of analysis and representation. When everything is digital – when all forms of stuff come to us down a single pipeline –  everything can be read in a new way.

Consider for a moment the ‘LIVE’ project from the Royal Veterinary College in London, and their ‘haptic simulator’.  In this instance they have developed a full scale ‘haptic’ representation of a cow in labour, facing a difficult birth, which allows students to physically engage and experience the process of
manipulating a calf in situ.  I haven’t
had a chance to try this, but I am told that it is a mind-altering experience.  It suggests that reading can be different; and should include the haptic – the feel and heft of a thing in your hand.  This is being coded for
millions of objects through 3d scanning; but we do not yet have an effective way of incorporating that 3d text into how we read the past.

The same could be said of the aural – that weird world of sound on which we continually impose the order of language, music and meaning; but which is in fact a stream of sensations filtered through place and culture.
Projects like the Virtual St Paul’s Cross which allows you to ‘hear’ John Donne’s sermons from the 1620s, from different vantage points around the yard, changes how we imagine them, and moves from ‘text’ to something much more complex and powerful.  And begins to navigate that normally unbridgeable space between text and the material world.  And if you think about this in relation to music and speech online – you end up with something different on a massive scale.

One of my current projects is to create a sound scape of the courtroom at the Old
Bailey – to re-create the aural experience of the defendant – what it felt like
to speak to power, and what it felt like to have power spoken at you from the
bench. And in turn, to use that knowledge to assess who was more effective in
their dealings with the court, and whether, having a bit of shirt to you, for
instance, effected your experience of transportation or imprisonment.  And the point of the project is to simply add a few more variables to the ones we can securely derive from text.

It is an attempt to add just a couple of more columns to a spreadsheet of almost infinite categories of knowing.  And you could keep going – weather, sunlight, temperature, the presence of the smells and reeks of other bodies.  Ever more layers to the sense of place.  In part, this is what the gaming industries have been doing from the beginning, but it also becomes possible to turn that creativity on its head, and make it serve a different purpose.

In the work of people such as Ian Gregory, we can see the beginnings of new ways of reading both the landscape, and the textual leavings of dead.  Bob Shoemaker, Matthew Davies and I (with a lot of other people) tried to do something similar with Old Bailey material, and the geography of London in the Locating London’s Past project.

This map is simply colours blue, red and yellow mapped against brown and green.  I have absolutely no idea what this mapping actually means, but it did force me to think differently about the feel and experience of the city.  And I want to be able to do the same for all
the text captured in the UK domain name.

All of which is to state the obvious.  There are lots of new readings that change how we connect with historical evidence – whether that is text, or something more interesting.    In creating new digital forms of inherited culture – the stuff of the dead – we naturally innovate, and naturally enough,
discover ever changing readings.  But the Web Archive, challenges us to do a lot more; and to begin to unpick what you might start pulling together from this near infinite archive.

In other words, the tools of text are there, and arguably moving in the right direction,
but there are several more dimensions we can exploit when the object of study
is itself an encoding.

Each web page, for instance, embodies a dozen different forms.  Text is obvious, but it is important to
remember that each component of the text – each word and letter, on a web page – is itself a complex composite.  What happens when you divide text by font or font size; weight, colour, kerning, formatting etc.  By location – in the header, or the body, or wherever the CSS sends it; or more subtly by where it appears to a users’ eye – in the middle of a line – or at the end. Suddenly, to all the forms of analysis we have associated with ‘distant reading’ there are five or six further columns in the spread sheet – five or six new variables to investigate in that ‘big data’ eye-opened sort of way.

And that is just the text.  The page itself is both a single image, and a collection of them – each with their own properties.  And one of the great things that is coming out of image research is that we can begin to automate the process of analysing those screens as ‘images’.  Colour, layout, face recognition etc.  Each page, is suddenly ten images in one – all available as a new variable; a new column in the spreadsheet of analysis.  And, of course, the same could be said of embedded audio and video.

And all of that is before we even look under the bonnet.  The code, the links, the meta data for each page – in part we can think of these as just another iteration of the text; but more imaginatively, we can think about it as more variables in the mix.

But, of course, that in itself misunderstands the web and the Web Archive.  The commonplace metaphor I have been using up till now is of a ‘page’ – and is the intellectual equivalent of skeumorphism – relying on material world metaphors to understand the online.

But these aren’t pages at all, they are collections of code and data that generate in to
an experience in real time.  They do not exist until they are used – if a website in the forest is never accessed, it does not exists.  The web archive therefore is not an archive of ‘objects’ in the traditional sense, but a snapshot from a moving film of possibilities.  At its most abstract, what the UK Web Archive has done, is spirit in to being the very object it seeks to capture – and of course, we all know that in doing so, the capturing itself changes the object.  Schrödinger’s cat may be alive or dead, but its box is definitely open, and we have visited our observations upon its content.

So to add to all the layers of stuff that can fill your spreadsheet, there also needs to be columns for time and use; re-use and republication.  And all this is before we seek to change the metaphor and talk about networks of connections, instead of pages on a website.

Where I end up is seriously jealous of the possibilities; and seriously wondering what the
‘object of study’ might be.  In the nature of an archive, the UK Web Archive imagines itself as an ‘object of study’; created in the service of an imaginary scholar.  The question it raises is how do we turn something we really can’t understand, cannot really capture as an object of study, to serious purpose?  How do we think at one and the same time of the web as alive and dead, as code, text, and image – all in dynamic conversation one with the other.  And even if we can hold all that at once, what is it we are asking?

IIPC 2015 Recap

logoI had a fantastic time at the International Internet Preservation Consortium’s Annual General Meeting this year, held on the beautiful campus of Stanford University (with a day trip down to the Internet Archive in San Francisco). It’s hard to write these sorts of recaps: I had such an amazing time, my head filled with great ideas, that it’s difficult to give everything the justice that they deserve. Many of the presentation slide decks are available on the schedule, and videos will be forthcoming.

My main takeaways: we’re continuing to see the development of sophisticated access tools to these repositories, coupled with increasingly exciting and sophisticated researcher use of them. There’s a recognition that context matters when understanding archived webpages, a phrase that came up a few times throughout the event. Crucially, there was a lot of energy in the room: there’s a real enthusiasm towards making these as accessible as possible and facilitating their use. I wasn’t exaggerating when I noted to one of the organizers that I wish every conference was like this: leaving me on my flight home with lots of fantastic ideas, hope for the future, and excitement about what can be done. As the recent “Conference Manifesto” in the New York Times noted, that’s not the experience at all conferences!

Read one for a short day-by-day breakdown, with apologies for presentations I couldn’t include or didn’t give full justice to: Continue reading IIPC 2015 Recap