Interdisciplinary Event Documentation: “The Great WARC Adventure”

screen-shot-2016-04-06-at-12-38-11-pmNick Ruest, Anna St-Onge, and myself wrote a piece in the open-access journal Digital Studies / Le champ numérique, “The Great WARC Adventure: Using SIPS, AIPS, and DIPS to Document SLAPPS.” The deliberately acronym-heavy title underlies a piece that does the following:

  • takes readers through the process of creating a web archive using open-source tools;
  • preserving and providing access to the web archive;
  • and enabling some basic analysis on the collection from the perspective of a historian.

While the long publishing time meant that some of our more recent approaches to analyzing web archives – warcbase, for example – didn’t make it in, the article hopefully provides a useful conceptual approach to working with web archives.

You can find the article here, and abstract below. We hope that you enjoy it! Continue reading Interdisciplinary Event Documentation: “The Great WARC Adventure”

From Dataverse to Gephi: Network Analysis on our Data, A Step-by-Step Walkthrough

We thought that this post from December 2015 was still relevant today. In short, it shows how you can take web archive network files generated by our research team and analyze them yourselves using the open-source Gephi package.

Even more excitingly, there’s many more Gephi files available today for your analysis. To find them, visit our network data page here: https://dataverse.scholarsportal.info/dataset.xhtml?persistentId=hdl:10864/12040. It grows on a regular basis!

Ian Milligan

Screen Shot 2015-12-10 at 4.20.20 PM Do you want to make this link graph yourself from our data? Read on.

As part of our commitment to open data – in keeping with the spirit and principles of our funding agency, as well as our core scholarly principles – our research team is beginning to release derivative data. The first dataset that we are releasing is the Canadian Political Parties and Interest Groups link graph data, available in our Scholars Portal Dataverse space.

The first file is all-links-cpp-link.graphml, a file consisting of all of the links between websites found in the our collection. It was generated using warcbase’s function that extracts links and writes them to a graph network file, documented here. The exact script used can be found here.

However, releasing data is only useful if we show people how they can use it. So! Here you go.

Video Walkthrough

This video…

View original post 769 more words

Web traffic analytics as a historical source

[This is a guest post from Marcin Wilkowski, first published at wilkowski.org. Marcin Wilkowski is a member of the Digital Humanities Laboratory at the University of Warsaw.]

I have recently got into researching the digital remains of a free Polish hosting service from the late 1990’s – free.polbox.pl. Among the copies available via the Wayback Machine, I found some pages containing historical web traffic data. How can this data be used as a historical source?

First of all, following Niels Brugger ,

when studying the web – today or in a historical perspective – we should focus our study on five different web strata: the web element, the web page, the website, the web sphere and the web as such.

The traffic analytics data of Polbox can thus be interpreted not only as an historical source for the websites published within that hosting, but also as a base for describing the technical and social environment of the websites seen beyond the interface of a browser (a web sphere). But – just as when working with any other historical sources – some evaluation should follow.

Browser summary

For example, the browser summary published on that archived page is not just information on the most commonly used browsers in the free.polbox.pl domain. In fact that summary collects information on User-Agent headers from the HTTP protocol, so among some historical browser names we can find information on the other tools used to access web pages in the late 1990’s.

The graphic Netscape browser popular at the time is in first place on the Polbox list. But look at point three and four: Teleport Pro and Wget are both tools for web content copying that enabled the users to read offline. This might be enigmatic, if one does not know how high the costs of internet connections were in Poland in the late 1990’s. In order to avoid large bills from the monopolistic internet provider, users chose to read Web content offline rather than online. One could scrape the website at work or at the university library, then copy to disc to read later on a home computer. In 1998, Polbox was giving its users only 2MB of hosting space to publish a site – while at that time, websites could be easily downloaded to 3½ inch (1.44 MB) floppy disks.

Analysed requests from January 26, 1998 to February 2, 1998 (7 days):

2382307 Netscape (compatible)
1232973 Netscape
19358 Teleport Pro
7549 Wget
7384 Java1.0.2
5720 contype
4233 GAIS Robot
2998 Lynx
2845 IBrowse
2796 Microsoft Internet Explorer
2203 Infoseek
2029 ArchitextSpider
1915 Scooter
1539 MSProxy
1298 AmigaVoyager
1228 Amiga-AWeb
1185 Web21 CustomCrawl
679 RealPlayer 5.0
655 Mosaic
590 ICValidator

World of the Web without centralisation

Some other remarks about browsers from the list. IBrowse, AWeb, and AmigaVoyager are browsers for Amiga OS. Mosaic is an early visual web browser, first released in 1993 and already obsolete 5 years later with the final release in January 1997. Lynx is a textual web browser for Unix which can still be used up until now.

And not only browsers and website scraping tools can be found on that historical user-agent list. Because in early 1998 there was no Google at all (it was founded in September that year), the list can be an useful document showing the environment of web searching tools before the times when one engine became dominant; a world of the Web without centralisation on its huge contemporary scale. On the list one can find Infoseek, WebCrawler and Excite, Scooter (for AltaVista) or Web21 CustomCrawl. However, bear in mind that the list shows only the first 20 user-agents, sorted by number of requests.

Traffic

Information about user-agents can help to examine the methods and purposes behind the usage of selected historical websites. But on the Web Server Statistics for Free Polbox WWW Server from 1998 we can find another interesting historical transfer data. The ability to find sources which enable us to compare the changes in transfer value in a given period allows us to illustrate the evolution of a historical website. What is more, by comparing transfer from weekend to transfer from weekdays we could try to interpret how access to the Web had been determined by the place from which it was accessed (work vs home, just like in this study from 2012).

And of course, historical data of web traffic can demonstrate the Web getting larger and websites becoming increasingly complex.

Thus, historical Web traffic analytics can be useful to historians of the Web, but they should be used carefully. First of all, the data is hardly complete, so the interpretations must be conservative (if any). Secondly, we cannot be certain that the data has been aggregated and then captured correctly – it’s Brugger’s document of Web fate. And, thirdly, some terms used originally in the historical analytics could be misleading – like users or browsers.

Inspirations: Niels Brügger, ‘Web History and the Web as a Historical Source’, in: Zeithistorische Forschungen/Studies in Contemporary History, Online-Ausgabe, 9 (2012), H. 2, URL: http://www.zeithistorische-forschungen.de/2-2012/id=4426, Druckausgabe: S. 316-325.

Internet Histories — a new journal

rint_cpb_bannerBy Niels Brügger

Let us assume that the internet is here to stay. And that it becomes still more pivotal to have solid scholarly knowledge about the development of the internet of the past with a view to understanding the internet of the present and of the future; on the one hand, past events constitute important preconditions for todays internet, and, on the other, the mechanisms behind the developments in the past may prove very helpful for understanding what is about to happen with the internet today.

Based on this rationale the new scholarly journal Internet Histories: Digital Technology, Culture and Society (Taylor&Francis/Routledge) has just been founded.

For more than four decades the internet has continued to grow and spread to an extent where today it is an indispensable element in the communicative infrastructure of many countries. Although the history of the internet has not been very predominant within the academic literature an increased number of books and journal articles within the last decade attest to the fact that internet historiography is an emerging field of study within internet studies, as well as within studies of culture, media, communication, and technology.

However, in the main the historical studies of the internet have been published in journals related to a variety of disciplines, and these journals do only rarely publish articles with a clear historical focus. Therefore, the editors of Internet Histories found that there was a need for a journal where the history of the internet and digital cultures is the main focus, a journal where historical studies are presented, and theoretical and methodological issues are debated with a view to constituting the history of the internet as a field of study in its own right.

Internet Histories embraces empirical as well as theoretical and methodological studies within the field of the history of the internet broadly conceived — from early computer networks, Usenet and Bulletin Board Systems, to everyday Internet with the web through the emergence of new forms of internet with mobile phones and tablet computers, social media, and the internet of things. The journal will also provide the premier outlet for cutting-edge research in the closely related area of histories of digital cultures.

The title of the journal, Internet Histories, suggests there is not one single and fixed Internet history going straight from Arpanet to the Internet as we know it today, from United States to a world-wide network. Rather, there are multiple local, regional and national paths and a variety of ways that the internet has been imagined, designed, used, shaped, and regulated around the world. Internet Histories aims to publish a range of scholarship that examines the global and internetworked nature of the digital world as well as situated histories that account for diverse local contexts.

Managing Editor
Niels Brügger, Aarhus University, Denmark

Editors
Megan Ankerson, University of Michigan, USA
Gerard Goggin, University of Sydney, Australia
Valérie Schafer, National Center for Scientific Research, France

Reviews Editor
Ian Milligan, University of Waterloo, Canada

Editorial Assistant
Asger Harlung, Aarhus University, Denmark

More information about the journal can be found at the journal website: http://www.tandfonline.com/loi/rint20

A Tale of Deleted Cities: GeoCities Event at Computer History Museum

I recently had the opportunity to attend – via Beam Telepresence robot – a talk by Richard Vijgen, creator of the 2011 “Deleted City” art exhibit, and GeoCities founder David Bohnett. The “Deleted City” was hosted in the lobby of the Computer History Museum in Mountain View, California, and the talks marked the end of the exhibit. I won’t give a full recapping of the talk, as I always find those difficult to both write and read, but will give a few impressions alongside the video!

They were both fascinating talks, available via YouTube above. Richard’s talk was fascinating in that it explored what Big Data means for historians – and recounted his experience of working with the Archive Team torrent. To me, the talk really underscored the importance of doing web history: the web really is the record of our lives today, and we need to hope that there are people there to back up this sort of information!

It was followed by David Bohnett, who explained the idea behind GeoCities, some of the technical challenges he faced, and really what it was like to preside over such explosive growth during the dot com era. As somebody who’s explored ideas of GeoCities as a community before, I was interested to hear so much emphasis placed in his talk upon the neighbourhood structure, volunteer community leaders, and what this all meant for bringing people together. As a writer on this topic, it was pretty interesting and reassuring to hear that my own ideas weren’t off kilter!

I was also surprised, although perhaps I shouldn’t have been, with his attitude towards the closure of GeoCities in 2009 by Yahoo! (which bought it in 1999) – that it was “better shut down than to go on as this abandoned version of its former self.” Fair enough, I suppose, but again – to echo Richard’s opening talk – thank god that Archive Team and the Internet Archive were there to preserve this information…

Anyways, check the video out for yourself if you’re interested.

Web archive conferences in 2017

2017 offers not one but two international conferences for scholars interested in the way we use the archived web.

There are calls for papers open now for both.

Curation and research use of the past Web
(The Web Archiving Conference of the International Internet Preservation Consortium)
Lisbon, 29-30 March 2017
Call for Papers now open, closing date 20 October 2016.

Researchers, practitioners and the archived Web
(2nd conference of ReSAW, the Europe-wide Research Infrastructure for the Study of Archived Web Materials)
London, 14-15 June 2017
Call for Papers now open, closing date 9 December 2016.

CFP: SAGE Handbook of Web History

We wanted to boost the signal here – a great opportunity for historians who have thoughts on Internet or Web histories! If you have any questions, please let Ian know.

Ian Milligan

AAEAAQAAAAAAAAS4AAAAJDgxM2QwMmVkLTZiN2QtNGVjNi1hYjFkLTgyNDJhNjAzNTZmOANiels Brügger and myself have sent this out to a few listservs, so decided to cross-post this here on my blog as well. Do let me know if you have any questions!

The web has now been with us for almost 25 years: new media is simply not that new anymore. It has developed to become an inherent part of our social, cultural, political, and social lives, and is accordingly leaving behind a detailed documentary record of society and events since the advent of widespread web archiving in 1996. These two key points lie at the heart of our in-preparation SAGE Handbook of Web History: that the history of the web itself needs to be studied, but also that its value as an incomparable historical record needs to be inquired as well. Within the last decade, considerable interest in the history of the Web has emerged. However, there is…

View original post 157 more words

Born-digital data and methods for history: new research network

Both Ian and Peter are delighted to be part of a new research network in the UK, funded by the Arts and Humanities Research Council for twelve months. There are further details on the network website, some of which are given below.

“This new research network will bring together researchers and practitioners […] to discern if there is a genuine humanities approach to born-digital data, and to establish how this might inform, complement and draw on other disciplines and practices. Over the course of three workshops […] the network will address the current state of the field; establish the most appropriate tools and methods for humanities researchers for whom born-digital material is an important primary source; discuss the ways in which researchers and archives can work together to facilitate big data research; identify the barriers to engagement with big data, particularly in relation to skills; and work to build an engaged and lasting community of interest. The focus of the network will be on history, but it will also encompass other humanities and social science disciplines. The network will also include representatives of non-humanities disciplines, including the computer, social and information sciences. Interdisciplinarity and collaborative working are essential to digital research, and particularly in such a new and complex area of investigation.

“During the 12 months of the project all members of the network will contribute to a web resource, which will present key themes and ideas to both an academic and wider audience of the interested general public. External experts from government, the media and other relevant sectors will also be invited to contribute, to ensure that the network takes account of a range of opinions and needs. The exchange of knowledge and experience that takes place at the workshops will also be distilled into a white paper, which will be published under a CC-BY licence in month 12 of the network.

What’s in a (top-level) domain name?

[This post first appeared on Peter Webster’s own blog]

I think there would be general agreement amongst web archivists that the country code top-level domain alone is not the whole of a national web. Implementations of legal deposit for the web tend to rely at least in part on the ccTLD (.uk, or .fr) as the means of defining their scope, even if supplemented by other means of selection.

However, efforts to understand the scale and patterns of national web content that lies outside national ccTLDs are in their infancy. An indication of the scale of the question is given by a recent investigation by the British Library. The @UKWebArchive team found more than 2.5 million hosts that were physically located in the UK without having .uk domain names. This would suggest that as much as a third of the UK web may lie outside its ccTLD.

And this is important to scholars, because we often tend to study questions in national terms – and it is difficult to generalise about a national web if the web archive we have is mostly made up of the ccTLD. And it is even more difficult if we don’t really understand how much national content there is outside that circle, and also which kinds of content are more or less likely to be outside the circle. Day to day, we can see that in the UK there are political parties, banks, train companies and all kinds of other organisations that ‘live’ outside .uk – but we understand almost nothing about how typical that is within any particular sector. We also understand very little about what motivates individuals and organisations to register their site in a particular national space.

So as a community of scholars we need case studies of particular sectors to understand their ‘residence patterns’, as it were: are British engineering firms (say) more or less likely to have a web domain from the ccTLD than nurseries, or taxi firms, or supermarkets? And so here is a modest attempt at just such a case study.

Anglican Ireland. (Church of Ireland, via WIkimedia Commons, CC BY-SA 3.0)
Anglican Ireland. (Church of Ireland, via Wikimedia Commons, CC BY-SA 3.0)

All the mainstream Christian churches in the island of Ireland date their origins to many years before the current political division of the island in 1921. As such, all the churches are organised on an all-Ireland basis, with organisational units that do not recognise the political border. In the case of the Church of Ireland (Anglican), although Northern Ireland lies entirely within the province of Armagh (the other province being Dublin), several of the dioceses of the province span the border, such that the bishop must cross the political border on a daily basis to minister to his various parishes.

How is this reflected on the web? In particular, where congregations in the same church are situated in either side of the border, where do their websites live – in .uk, or in .ie, or indeed in neither?

I have been assembling lists of individual congregation websites as part of a larger modelling of the Irish religious webspace, and one of these is the Presbyterian Church of Ireland. My initial list contains just over two hundred individual church sites, the vast majority of which are in Northern Ireland (as is the bulk of the membership of the church). Looking at Northern Ireland, the ‘residence pattern’ is:

.co.uk – 23%
.org.uk – 20%
.com – 17%
.org – 37%
Other – 3%

In sum, less than half of these sites – of church congregations within the United Kingdom – are ‘resident’ within the UK ccTLD. A good deal of research would need to be done to understand the choices made by individual webmasters. However, it is noteworthy that, for Protestant churches in a part of the world where religious and national identity are so closely identified, to have a UK domain seems not to be all that important.

Notes
1. My initial list (derived from one published by the PCI itself) represents only sites which the central organisation of the denomination knew existed at the time of compilation, and there are more than twice as many congregations as there are sites listed. However, it seems unlikely that that in itself can have skewed the proportions.

2. For the very small number of PCI congregations in the Republic of Ireland (that appear in the list), the situation is similar, with less than 30% of churches opting for a domain name within the .ie ccTLD. However, the number is too small (26 in all) to draw any conclusions from it.