All posts by Ian Milligan

About Ian Milligan

Ian Milligan is Associate Vice-President, Research Oversight and Analysis and professor of history at the University of Waterloo.

Archive-It Research Services: Exciting New Developments

Named Entity Recognition results on a corpus of tens of thousands of web archived pages: possible now with Archive-It’s WANE File

Historians who work with, or who are thinking about working with, web archives will be excited about the announcement that Archive-It Research Services made on March 17th. They’re widely expanding the sort of data that they provide to researchers. As they put it in their announcement:

The service will allow any Archive-It partner to give users, researchers, scholars, developers, and other patrons easily-analyzed datasets that contain key metadata elements, link graphs, named entities, and other data derived from the resources within their collections. By supporting access in aggregate to partner archives, ARS will facilitate new types of use, research, and analysis of the significant historical records from the web that Archive-It partners are working to collect, preserve, and make accessible.

They’re making three types of datasets available, the first of which are WAT files, which contain metadata about websites.

From WATs, you can get metadata descriptions for websites, the links that they point towards, the anchor text of those links, and crawl information. Ian Milligan, one of the co-authors of this blog, has been using WAT files to analyze Canadian political history websites: see some results here and here (including a guided video tour of some results).

LGA and WANE files are unfamiliar to the two authors of this blog, although they look to be very useful! LGA files accelerate the ability to do longitudinal link analysis from WAT files. The examples they give are actually from the Canadian Political Interest Groups collection! Finally, WANE files use Stanford NER to extract information relating to people, organizations, and locations. Using derived text from a web archive, Milligan plotted all the locations mentioned in GeoCities – you can see the results here.

To get these files, consult the service details page. In short, if you’re an Archive-It partner you can order it internally through your dashboard. For the rest of us researchers, you just need to send an e-mail in with some information, and start the process.

In short: an amazing move that’s really going to unlock these files. WARC files are really big – too big, for most systems – whereas the LGA, WAT, and WANE model is going to unlock accessible web archive research. Kudos to Archive-It.

Fascinating Interplay About Discovering Content in Web Archives

Web archives have arrived, at least in the pages of high-profile publications such as the Washington Post and the New Yorker.

An especially fascinating exchange took place in mid-February. Gareth Millward, a research fellow in the Centre for History in Public Health at the London School of Hygiene and Tropical Medicine, published “I tried to use the Internet to do historical research. It was nearly impossible” with the Washington Post. In it, he explained the difficulties of navigating extremely large web archives: search queries returned useless results, not sorted in an ideal fashion (or at all), and that instead historians may need to find smaller circumscribed corpuses or explore metadata.

The response by Andy Jackson, Web Archiving Technical Lead at the British Library, on the British Library’s Web Archive blog was equally illuminating. His piece, “Building a ‘Historical Search Engine’ is No Easy Thing,” is a must-read. He pointed out the different use cases that historians have: simply replicating Google (which excels at letting us know what we need to know in an extremely contemporary context) won’t make sense when querying large bodies of web archived material. He walks us through the various steps of the search engine, and concludes by arguing that we need to think of Macroscopes rather than of search engines (sidenote: having just finished copyedits on a co-authored book subtitled The Historian’s Macroscope, I’m inclined to agree with this metaphor!).

These two pieces join a third high-profile piece, “The Cobweb: Can the Internet be Archived?” by Harvard historian Jill Lepore. This was a fascinating exploration of the current state and recent history of web archiving, and is well worth your time.

Milligan Presentation: “The Promise of WebARChive Files”

This paper was given at the American Historical Association’s annual meeting in New York City on January 5th, 2015. It was part of the Text Analysis, Visualization, and Historical Interpretation panel. My thanks to my co-presenters and especially Micki Kaufman who organized the panel.

The text that follows may not be exactly what I said, but is based on my speaking notes with a bit of memory filling in here and there.

AHA Talk.001

AHA Talk.002

Hello everybody, I’d like to begin with a somewhat provocative opening:

I believe that historians are unprepared to engage with the quantity of digital sources that will fundamentally transform their trade. Web archives are going to transform the work we do for a few main reasons: Continue reading Milligan Presentation: “The Promise of WebARChive Files”

Welcome to our Blog!

By Peter Webster and Ian Milligan

The first stage of our project, Web Archives for Historians, has concluded. In just under a year, we’ve amassed a healthy bibliography (about twenty works) that fall within the scope of our bibliography – works written by historians covering topics such as: (a) reflections on the need for web preservation, and its current state in different countries and globally as a whole; (b) how historians could, should or should not use web archives; (c) examples of actual uses of web archives as primary sources.

We’ve probably reached the ceiling on this front, however! There aren’t that many historians who are actively working in this area (yet, we dare say). And so we now want Web Archives for Historians to transition into an active blog that will:

    • Aggregate content by historians or for historians on web archives (similar to the Web Archiving Roundtable) – some of this will come from our own blogs (Peter and Ian), but also from a list of blogs that we’ll be following;
    • draw attention to talks and slides that we spot at scholarly conferences or in publication venues;

and

  • carry commissioned posts (eventually).

Our mandate will be to include:

  1. examples of work done using web archives;
  2. historical method in the web archive;
  3. news of significant new projects, tools, data or web services;
  4. contemporary history using the live web (as core source material, rather than just incidentally).

We hope that you join us by following along with your RSS feed, on Twitter, or just by popping by now and again.