(x-posted with ianmilligan.ca)

The Web Archives for Historical Research Group has been busy: working on getting the Shine front end running on Archive-It collections (a soft launch is underway here if you want to play with old Canadian websites), setting up Warcbase on our collections, and digging manually through the GeoCities torrent for close readings of various neighbourhoods.
One collaboration has been really fruitful. Working with Jimmy Lin, a computer scientist who has just joined the University of Waterloo’s David Cheriton School of Computer Science, we’ve been working on scripts, workflows, and implementations of his warcbase platform. Visit the warcbase wiki here. Interdisciplinary collaboration is amazing!
I’d like to imagine humanists or social scientists who want to use web archives are often in the same position I was four years ago: confronted with opaque ARC and WARC files, downloading them onto your computer, and not really knowing what to do with them (apart from maybe unzipping them and exploring them manually). Our goal is to change that: to give easy to follow walkthroughs that can allow users to do the basic things to get started:
-
A dynamic visualization generated with warcbase and Gephi Link visualizations to explore networks, finding central hubs, communities, and so forth;
- Textual analysis to extract specific text, facilitating subsequent topic modelling, entity extraction, keyword search, and close reading;
- Overall statistics to find over- and under-represented domains, platforms, or content types;
- And basic n-gram-style navigation to monitor and explore change over time.
All of this is relatively easy for web archive experts to do, but still difficult for endusers.
The Warcbase wiki, still under development, aims to fix that. Please visit, comment, fork, and we hope to develop it alongside all of you.
1 thought on “Have Web Collections? Want Link and Text Analysis?”