Web traffic analytics as a historical source

[This is a guest post from Marcin Wilkowski, first published at wilkowski.org. Marcin Wilkowski is a member of the Digital Humanities Laboratory at the University of Warsaw.]

I have recently got into researching the digital remains of a free Polish hosting service from the late 1990’s – free.polbox.pl. Among the copies available via the Wayback Machine, I found some pages containing historical web traffic data. How can this data be used as a historical source?

First of all, following Niels Brugger ,

when studying the web – today or in a historical perspective – we should focus our study on five different web strata: the web element, the web page, the website, the web sphere and the web as such.

The traffic analytics data of Polbox can thus be interpreted not only as an historical source for the websites published within that hosting, but also as a base for describing the technical and social environment of the websites seen beyond the interface of a browser (a web sphere). But – just as when working with any other historical sources – some evaluation should follow.

Browser summary

For example, the browser summary published on that archived page is not just information on the most commonly used browsers in the free.polbox.pl domain. In fact that summary collects information on User-Agent headers from the HTTP protocol, so among some historical browser names we can find information on the other tools used to access web pages in the late 1990’s.

The graphic Netscape browser popular at the time is in first place on the Polbox list. But look at point three and four: Teleport Pro and Wget are both tools for web content copying that enabled the users to read offline. This might be enigmatic, if one does not know how high the costs of internet connections were in Poland in the late 1990’s. In order to avoid large bills from the monopolistic internet provider, users chose to read Web content offline rather than online. One could scrape the website at work or at the university library, then copy to disc to read later on a home computer. In 1998, Polbox was giving its users only 2MB of hosting space to publish a site – while at that time, websites could be easily downloaded to 3½ inch (1.44 MB) floppy disks.

Analysed requests from January 26, 1998 to February 2, 1998 (7 days):

2382307 Netscape (compatible)
1232973 Netscape
19358 Teleport Pro
7549 Wget
7384 Java1.0.2
5720 contype
4233 GAIS Robot
2998 Lynx
2845 IBrowse
2796 Microsoft Internet Explorer
2203 Infoseek
2029 ArchitextSpider
1915 Scooter
1539 MSProxy
1298 AmigaVoyager
1228 Amiga-AWeb
1185 Web21 CustomCrawl
679 RealPlayer 5.0
655 Mosaic
590 ICValidator

World of the Web without centralisation

Some other remarks about browsers from the list. IBrowse, AWeb, and AmigaVoyager are browsers for Amiga OS. Mosaic is an early visual web browser, first released in 1993 and already obsolete 5 years later with the final release in January 1997. Lynx is a textual web browser for Unix which can still be used up until now.

And not only browsers and website scraping tools can be found on that historical user-agent list. Because in early 1998 there was no Google at all (it was founded in September that year), the list can be an useful document showing the environment of web searching tools before the times when one engine became dominant; a world of the Web without centralisation on its huge contemporary scale. On the list one can find Infoseek, WebCrawler and Excite, Scooter (for AltaVista) or Web21 CustomCrawl. However, bear in mind that the list shows only the first 20 user-agents, sorted by number of requests.


Information about user-agents can help to examine the methods and purposes behind the usage of selected historical websites. But on the Web Server Statistics for Free Polbox WWW Server from 1998 we can find another interesting historical transfer data. The ability to find sources which enable us to compare the changes in transfer value in a given period allows us to illustrate the evolution of a historical website. What is more, by comparing transfer from weekend to transfer from weekdays we could try to interpret how access to the Web had been determined by the place from which it was accessed (work vs home, just like in this study from 2012).

And of course, historical data of web traffic can demonstrate the Web getting larger and websites becoming increasingly complex.

Thus, historical Web traffic analytics can be useful to historians of the Web, but they should be used carefully. First of all, the data is hardly complete, so the interpretations must be conservative (if any). Secondly, we cannot be certain that the data has been aggregated and then captured correctly – it’s Brugger’s document of Web fate. And, thirdly, some terms used originally in the historical analytics could be misleading – like users or browsers.

Inspirations: Niels Brügger, ‘Web History and the Web as a Historical Source’, in: Zeithistorische Forschungen/Studies in Contemporary History, Online-Ausgabe, 9 (2012), H. 2, URL: http://www.zeithistorische-forschungen.de/2-2012/id=4426, Druckausgabe: S. 316-325.

Internet Histories — a new journal

rint_cpb_bannerBy Niels Brügger

Let us assume that the internet is here to stay. And that it becomes still more pivotal to have solid scholarly knowledge about the development of the internet of the past with a view to understanding the internet of the present and of the future; on the one hand, past events constitute important preconditions for todays internet, and, on the other, the mechanisms behind the developments in the past may prove very helpful for understanding what is about to happen with the internet today.

Based on this rationale the new scholarly journal Internet Histories: Digital Technology, Culture and Society (Taylor&Francis/Routledge) has just been founded.

For more than four decades the internet has continued to grow and spread to an extent where today it is an indispensable element in the communicative infrastructure of many countries. Although the history of the internet has not been very predominant within the academic literature an increased number of books and journal articles within the last decade attest to the fact that internet historiography is an emerging field of study within internet studies, as well as within studies of culture, media, communication, and technology.

However, in the main the historical studies of the internet have been published in journals related to a variety of disciplines, and these journals do only rarely publish articles with a clear historical focus. Therefore, the editors of Internet Histories found that there was a need for a journal where the history of the internet and digital cultures is the main focus, a journal where historical studies are presented, and theoretical and methodological issues are debated with a view to constituting the history of the internet as a field of study in its own right.

Internet Histories embraces empirical as well as theoretical and methodological studies within the field of the history of the internet broadly conceived — from early computer networks, Usenet and Bulletin Board Systems, to everyday Internet with the web through the emergence of new forms of internet with mobile phones and tablet computers, social media, and the internet of things. The journal will also provide the premier outlet for cutting-edge research in the closely related area of histories of digital cultures.

The title of the journal, Internet Histories, suggests there is not one single and fixed Internet history going straight from Arpanet to the Internet as we know it today, from United States to a world-wide network. Rather, there are multiple local, regional and national paths and a variety of ways that the internet has been imagined, designed, used, shaped, and regulated around the world. Internet Histories aims to publish a range of scholarship that examines the global and internetworked nature of the digital world as well as situated histories that account for diverse local contexts.

Managing Editor
Niels Brügger, Aarhus University, Denmark

Megan Ankerson, University of Michigan, USA
Gerard Goggin, University of Sydney, Australia
Valérie Schafer, National Center for Scientific Research, France

Reviews Editor
Ian Milligan, University of Waterloo, Canada

Editorial Assistant
Asger Harlung, Aarhus University, Denmark

More information about the journal can be found at the journal website: http://www.tandfonline.com/loi/rint20

A Tale of Deleted Cities: GeoCities Event at Computer History Museum

I recently had the opportunity to attend – via Beam Telepresence robot – a talk by Richard Vijgen, creator of the 2011 “Deleted City” art exhibit, and GeoCities founder David Bohnett. The “Deleted City” was hosted in the lobby of the Computer History Museum in Mountain View, California, and the talks marked the end of the exhibit. I won’t give a full recapping of the talk, as I always find those difficult to both write and read, but will give a few impressions alongside the video!

They were both fascinating talks, available via YouTube above. Richard’s talk was fascinating in that it explored what Big Data means for historians – and recounted his experience of working with the Archive Team torrent. To me, the talk really underscored the importance of doing web history: the web really is the record of our lives today, and we need to hope that there are people there to back up this sort of information!

It was followed by David Bohnett, who explained the idea behind GeoCities, some of the technical challenges he faced, and really what it was like to preside over such explosive growth during the dot com era. As somebody who’s explored ideas of GeoCities as a community before, I was interested to hear so much emphasis placed in his talk upon the neighbourhood structure, volunteer community leaders, and what this all meant for bringing people together. As a writer on this topic, it was pretty interesting and reassuring to hear that my own ideas weren’t off kilter!

I was also surprised, although perhaps I shouldn’t have been, with his attitude towards the closure of GeoCities in 2009 by Yahoo! (which bought it in 1999) – that it was “better shut down than to go on as this abandoned version of its former self.” Fair enough, I suppose, but again – to echo Richard’s opening talk – thank god that Archive Team and the Internet Archive were there to preserve this information…

Anyways, check the video out for yourself if you’re interested.

Web archive conferences in 2017

2017 offers not one but two international conferences for scholars interested in the way we use the archived web.

There are calls for papers open now for both.

Curation and research use of the past Web
(The Web Archiving Conference of the International Internet Preservation Consortium)
Lisbon, 29-30 March 2017
Call for Papers now open, closing date 20 October 2016.

Researchers, practitioners and the archived Web
(2nd conference of ReSAW, the Europe-wide Research Infrastructure for the Study of Archived Web Materials)
London, 14-15 June 2017
Call for Papers now open, closing date 9 December 2016.

CFP: SAGE Handbook of Web History

We wanted to boost the signal here – a great opportunity for historians who have thoughts on Internet or Web histories! If you have any questions, please let Ian know.

Ian Milligan

AAEAAQAAAAAAAAS4AAAAJDgxM2QwMmVkLTZiN2QtNGVjNi1hYjFkLTgyNDJhNjAzNTZmOANiels Brügger and myself have sent this out to a few listservs, so decided to cross-post this here on my blog as well. Do let me know if you have any questions!

The web has now been with us for almost 25 years: new media is simply not that new anymore. It has developed to become an inherent part of our social, cultural, political, and social lives, and is accordingly leaving behind a detailed documentary record of society and events since the advent of widespread web archiving in 1996. These two key points lie at the heart of our in-preparation SAGE Handbook of Web History: that the history of the web itself needs to be studied, but also that its value as an incomparable historical record needs to be inquired as well. Within the last decade, considerable interest in the history of the Web has emerged. However, there is…

View original post 157 more words

Born-digital data and methods for history: new research network

Both Ian and Peter are delighted to be part of a new research network in the UK, funded by the Arts and Humanities Research Council for twelve months. There are further details on the network website, some of which are given below.

“This new research network will bring together researchers and practitioners […] to discern if there is a genuine humanities approach to born-digital data, and to establish how this might inform, complement and draw on other disciplines and practices. Over the course of three workshops […] the network will address the current state of the field; establish the most appropriate tools and methods for humanities researchers for whom born-digital material is an important primary source; discuss the ways in which researchers and archives can work together to facilitate big data research; identify the barriers to engagement with big data, particularly in relation to skills; and work to build an engaged and lasting community of interest. The focus of the network will be on history, but it will also encompass other humanities and social science disciplines. The network will also include representatives of non-humanities disciplines, including the computer, social and information sciences. Interdisciplinarity and collaborative working are essential to digital research, and particularly in such a new and complex area of investigation.

“During the 12 months of the project all members of the network will contribute to a web resource, which will present key themes and ideas to both an academic and wider audience of the interested general public. External experts from government, the media and other relevant sectors will also be invited to contribute, to ensure that the network takes account of a range of opinions and needs. The exchange of knowledge and experience that takes place at the workshops will also be distilled into a white paper, which will be published under a CC-BY licence in month 12 of the network.

What’s in a (top-level) domain name?

[This post first appeared on Peter Webster’s own blog]

I think there would be general agreement amongst web archivists that the country code top-level domain alone is not the whole of a national web. Implementations of legal deposit for the web tend to rely at least in part on the ccTLD (.uk, or .fr) as the means of defining their scope, even if supplemented by other means of selection.

However, efforts to understand the scale and patterns of national web content that lies outside national ccTLDs are in their infancy. An indication of the scale of the question is given by a recent investigation by the British Library. The @UKWebArchive team found more than 2.5 million hosts that were physically located in the UK without having .uk domain names. This would suggest that as much as a third of the UK web may lie outside its ccTLD.

And this is important to scholars, because we often tend to study questions in national terms – and it is difficult to generalise about a national web if the web archive we have is mostly made up of the ccTLD. And it is even more difficult if we don’t really understand how much national content there is outside that circle, and also which kinds of content are more or less likely to be outside the circle. Day to day, we can see that in the UK there are political parties, banks, train companies and all kinds of other organisations that ‘live’ outside .uk – but we understand almost nothing about how typical that is within any particular sector. We also understand very little about what motivates individuals and organisations to register their site in a particular national space.

So as a community of scholars we need case studies of particular sectors to understand their ‘residence patterns’, as it were: are British engineering firms (say) more or less likely to have a web domain from the ccTLD than nurseries, or taxi firms, or supermarkets? And so here is a modest attempt at just such a case study.

Anglican Ireland. (Church of Ireland, via WIkimedia Commons, CC BY-SA 3.0)
Anglican Ireland. (Church of Ireland, via Wikimedia Commons, CC BY-SA 3.0)

All the mainstream Christian churches in the island of Ireland date their origins to many years before the current political division of the island in 1921. As such, all the churches are organised on an all-Ireland basis, with organisational units that do not recognise the political border. In the case of the Church of Ireland (Anglican), although Northern Ireland lies entirely within the province of Armagh (the other province being Dublin), several of the dioceses of the province span the border, such that the bishop must cross the political border on a daily basis to minister to his various parishes.

How is this reflected on the web? In particular, where congregations in the same church are situated in either side of the border, where do their websites live – in .uk, or in .ie, or indeed in neither?

I have been assembling lists of individual congregation websites as part of a larger modelling of the Irish religious webspace, and one of these is the Presbyterian Church of Ireland. My initial list contains just over two hundred individual church sites, the vast majority of which are in Northern Ireland (as is the bulk of the membership of the church). Looking at Northern Ireland, the ‘residence pattern’ is:

.co.uk – 23%
.org.uk – 20%
.com – 17%
.org – 37%
Other – 3%

In sum, less than half of these sites – of church congregations within the United Kingdom – are ‘resident’ within the UK ccTLD. A good deal of research would need to be done to understand the choices made by individual webmasters. However, it is noteworthy that, for Protestant churches in a part of the world where religious and national identity are so closely identified, to have a UK domain seems not to be all that important.

1. My initial list (derived from one published by the PCI itself) represents only sites which the central organisation of the denomination knew existed at the time of compilation, and there are more than twice as many congregations as there are sites listed. However, it seems unlikely that that in itself can have skewed the proportions.

2. For the very small number of PCI congregations in the Republic of Ireland (that appear in the list), the situation is similar, with less than 30% of churches opting for a domain name within the .ie ccTLD. However, the number is too small (26 in all) to draw any conclusions from it.

So You’re a Historian Who Wants to Get Started in Web Archiving

By Ian Milligan (University of Waterloo)

(Cross-posted and adapted from an earlier post I wrote for the IIPC’s blog)

The web archiving community is a great one, but it can sometimes be a bit confusing to enter. Unlike communities such as the Digital Humanities, which has developed aggregation services like DH Now, the web archiving community is a bit more dispersed. But fear not, there are a few places to visit to get a quick sense of what’s going on. Here I just want to give a quick rundown of how you can learn about web archiving on social media, from technical walkthroughs, and from blogs.

I’m sure I’m missing stuff – let us all know in the comments!

Social Media

A substantial amount of web archiving scholarship happens online. I use Twitter (I’m at @ianmilligan1), for example, as a key way to share research findings and ideas that I have as my project comes together. I usually try to hashtag them with: #webarchiving. This means that all tweets that people use “#webarchiving” with will show up in that specific timeline.

For best results, using a Twitter client like Tweetdeck, Tweetbot, or Echofon can help you keep appraised of things. There may be Facebook groups – I actually don’t use Facebook (!) so I can’t provide much guidance there. Continue reading So You’re a Historian Who Wants to Get Started in Web Archiving

On the trace of a website’s lost past

[A guest post from Federico Nanni, who a PhD student in Science, Technology and Society at the Centre for the History of Universities and Science of the University of Bologna.]

Result from a Wayback Search for Unibo.it
Result from a Wayback Search for Unibo.it

The University of Bologna is considered by historians to be the world’s oldest university in terms of continuous operation. At the university’s Centre for the History of Universities and Science, our research group works on different projects focused on understanding the history of this academic institution and its long-term socio-political relationships. Several sources and methods have been used for studying its past, from quantitative data analysis to online databases of scientific publications.

Since the introduction of the World Wide Web, a new and different kind of primary source has become available for researchers: born digital documents, materials which have been shared primarily online and which will become increasingly useful for historians interested in studying recent history.

However, these sources are already more difficult to preserve compared to traditional ones. And this is true especially for what concerns the University of Bologna’s digital past. In fact, Italy does not have a national archive for the preservation of its web-sphere and furthermore “Unibo.it” has been excluded from the Wayback Machine.

For this reason, I have focused my research (a forthcoming piece on this will be available in Digital Humanities Quarterly) primarily on understanding how to deal and solve this specific issue in order to reconstruct the University of Bologna’s digital past and to understand if these materials are able to offer us a new perspective on the recent history of this institution.

In order to understand the reasons of the removal of Unibo.it, my first step was to find, in the exclusion-policy of the Internet Archive, information related to the message “This URL has been excluded from the Wayback Machine”, which appeared when searching “http://www.unibo.it”.

As described in the Internet Archive’s FAQ section, the most common reason for this exclusion is when a website explicitly requests to not be crawled by adding the string “User-agent: ia_archiver Disallow: /” to its robots.txt file. However, it is also explained that “Sometimes a website owner will contact us directly and ask us to stop crawling or archiving a site, and we endeavor to comply with these requests. When you come across a “blocked site error” message, that means that a site owner has made such a request and it has been honoured. Currently there is no way to exclude only a portion of a site, or to exclude archiving a site for a particular time period only. When a URL has been excluded at direct owner request from being archived, that exclusion is retroactive and permanent”.

When a website has not been archived due to robots.txt limitations a specific message is displayed. This is different from the one that appeared when searching the University of Bologna website, as you can see in the figures below. Therefore, the only possible conclusion is that someone explicitly requested to remove the University of Bologna website (or more likely, only a specific part of it) from the Internet Archive.


For this reason, I decided to consult CeSIA, the team that has supervised Unibo.it during the last few years, regarding this issue. However, they did not submit any removal request to the Internet Archive and they were not aware of anyone submitting it.

To clarify this issue and discover whether the website of this institution has been somehow preserved during the last twenty years, I further decided to contact the Internet Archive team at info@archive.org (as suggested in the FAQ section).

Thanks to the efforts of Mauro Amico (CeSIA), Raffaele Messuti (AlmaDL – Unibo), Christopher Butler (Internet Archive) and Giovanni Damiola (Internet Archive), we began to collaborate at the end of March 2015. As Butler told us, this case was really similar to another one that involved the New York Government Websites.

With their help, I discovered that a removal request regarding the main website and a list of specific subdomains had been submitted to the Wayback Machine in April 2002.


With our efforts, the university main website became available again on the Wayback Machine on the 13th of April 2015. However, both the Internet Archive and CeSIA have no trace of the email requests. For this reason, CeSIA decided to keep the other URLs in the list excluded from the Wayback Machine for the moment, as it is possible that this request was made for a specific legal reason.

In 2002 the administration of Unibo.it changed completely, during a general re-organization of the university’s digital presence. Therefore, it is entirely obscure who, in that very same month, could have sent this specific request, and for which reason.

However, it is evident that this request was made by someone who knew how the Internet Archive exclusion policy works, as he/she explicitly declared a specific list of subdomains to remove (in fact, the Internet Archive excludes based on URLs and their subsections – not subdomains). It could be possible that the author obtained this specific knowledge by contacting directly the Internet Archive and asking for clarification.

Even if thirteen years have passed, my assumption was that someone involved in the administration of the website would have remembered at least this email exchange with a team of digital archivists in San Francisco. So, between April and June 2015 I conducted a series of interviews with several of the people involved in the Unibo.it website, pre and post the 2002 reorganization. However, no one had memories or old emails related to this specific issue.

As the specificity of the request is the only hint that could help me identify its author, I decided to analyze the different urls in more detail. The majority of them are server addresses (identified by “alma.unibo”), while the other pages are subdomains of the main website, for example estero.unibo.it (probably dedicated to international collaborations).

My questions now are: why someone wanted to exclude exactly these pages and not all the department pages, which had an extremely active presence at that time? Why exactly these four subdomains and not the digital magazine Alma2000 (alma2000.unibo.it) or the e-learning platform (www.elearning.unibo.it)? It could be possible that this precise selection is related to a specific reason, that could offer us a better understanding on the use and the purpose of this platform in those years.

To conclude, I would like to also point out how strange this specific impasse is: given that we don’t know the reason of the request I cannot have the permission from CeSIA, the current administrator, to analyze the snapshots of these URLs. However, at the same time, we are not able to find anyone who remembers sending the request and not a single proof of it has been preserved. In my opinion, this depicts perfectly a new level of difficulties that future historians will encounter while investigating our past in the archives.

Federico Nanni is a PhD student in Science, Technology and Society at the Centre for the History of Universities and Science of the University of Bologna. His research is focused on understanding how to combine methodologies from different fields of study in order to face both the scarcity and the abundance of born digital sources related to the recent history of Italian universities.

The rise and fall of text on the Web: a study using Web archives

[The following is a guest post from Anthony Cocciolo (@acocciolo), Associate Professor at Pratt Institute School of Information and Library Science, on a recently published research study]

In the summer of 2014, I became interested in studying if it was more than my mere impression that websites were beginning to present less text to end-users. Websites such as Buzzfeed.com were gaining enormous popularity and using a communicative style that had more in common with children’s books (large graphics and short segments of text) than with the traditional newspaper column. I wondered if I could measure this change in any systematic way? I was interested in this change primarily for what it implied about literacy and what we ought to teach students, and more broadly about what this change meant for how humans communicate and share information, knowledge and culture.

Teaching students to become archivists at a graduate school of information and library science, and focusing on a variety of digital archiving challenges, I was quite familiar with web archives. It was immediately clear to me if I were to study this issue I would be relying on web archives, and primarily on the Internet Archive’s Wayback Machine, since it had collected such a wide scope of web pages since the 1990s.

The method devised was to select 100 popular and prominent homepages in the United States from a variety of sectors that were present in the late 1990s and are still used today. I also decided to select homepages every three years beginning in 1999, resulting in 6 captures or 600 homepages. The reason for this decision is that by 1999 the Internet Archive’s web archiving efforts were fully underway, and three years would be enough to show changes but not require a hugely repetitive dataset. URLs for webpages in the Internet Archive were selected using the Memento web service. Full webpages were saved as static PNG files.

To detect text blocks from non-text blocks, I modified a Firefox extension called Project Naptha. This extension detects text from non-text using an algorithm called the Stroke Width Transform. The percentage of text per webpage was calculated and stored in a database. A sample of detected text from non-text is shown in the figure below, which is 46.10% text.

Text detection on the White House site

Once the percentage of text for each webpage and year were computed, I used a statistical technique called a one-way ANOVA to determine whether the percentage of text on a webpage was a chance occurrence, or instead dependent on the year the Website was produced. I found that these percentages were not random occurrences but dependent on the year of production (what we would call statistically significant).

The major finding is that the amount of text rose each year from 1999 to 2005, at which point it peaked, and it has been on a decline ever since. Thus, website homepages in 2014 have 5.5% less text than they did in 2005. This is consistent with other research that uses web archives that indicate a decrease of text on the web. This pattern is illustrated below.

Mean percentage of text on pages over time

This study necessarily begs the question: what has caused this decrease in the percentage of text on the Web? Although it is difficult to make definitive conclusions, one suggestion is that the first Web boom of the late 1990s and early 2000s brought about significant enhancements to internet infrastructure, allowing for non-textual media such as video to be more easily streamed to end-users (Interestingly, the year 2005 was also the year that YouTube was launched.) This is not to suggest that text was replaced with YouTube videos but rather that a rise in multiple modes of communication became more possible with their easier delivery, such as video and audio, which may have helped unseat text from its primacy on the World Wide Web.

I think the study raises a number of interesting issues. If the World Wide Web is presenting less text to users relative to other elements, does this mean that the World Wide Web is becoming a place where deep reading is less likely to occur? Is deep reading now only happening in other places, such as e-readers or printed books (some research indicates this might be the case)? The early web was the great delivery mechanism of text, but might text be further unseated from its primacy and the web become primarily a platform for delivering audiovisual media?

If interested in this study, you can read it on the open-access publication Information Research.