Have Web Collections? Want Link and Text Analysis?

(x-posted with ianmilligan.ca)

The Warcbase wiki in action!
The Warcbase wiki in action!

The Web Archives for Historical Research Group has been busy: working on getting the Shine front end running on Archive-It collections (a soft launch is underway here if you want to play with old Canadian websites), setting up Warcbase on our collections, and digging manually through the GeoCities torrent for close readings of various neighbourhoods.

One collaboration has been really fruitful. Working with Jimmy Lin, a computer scientist who has just joined the University of Waterloo’s David Cheriton School of Computer Science, we’ve been working on scripts, workflows, and implementations of his warcbase platform. Visit the warcbase wiki here. Interdisciplinary collaboration is amazing!

I’d like to imagine humanists or social scientists who want to use web archives are often in the same position I was four years ago: confronted with opaque ARC and WARC files, downloading them onto your computer, and not really knowing what to do with them (apart from maybe unzipping them and exploring them manually). Our goal is to change that: to give easy to follow walkthroughs that can allow users to do the basic things to get started:

  • Screen Shot 2015-06-05 at 11.51.29 AM
    A dynamic visualization generated with warcbase and Gephi

    Link visualizations to explore networks, finding central hubs, communities, and so forth;

  • Textual analysis to extract specific text, facilitating subsequent topic modelling, entity extraction, keyword search, and close reading;
  • Overall statistics to find over- and under-represented domains, platforms, or content types;
  • And basic n-gram-style navigation to monitor and explore change over time.

All of this is relatively easy for web archive experts to do, but still difficult for endusers.

The Warcbase wiki, still under development, aims to fix that. Please visit, comment, fork, and we hope to develop it alongside all of you.

About Ian Milligan

Ian Milligan is Associate Vice-President, Research Oversight and Analysis and professor of history at the University of Waterloo.

1 thought on “Have Web Collections? Want Link and Text Analysis?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s