Web traffic analytics as a historical source

[This is a guest post from Marcin Wilkowski, first published at wilkowski.org. Marcin Wilkowski is a member of the Digital Humanities Laboratory at the University of Warsaw.]

I have recently got into researching the digital remains of a free Polish hosting service from the late 1990’s – free.polbox.pl. Among the copies available via the Wayback Machine, I found some pages containing historical web traffic data. How can this data be used as a historical source?

First of all, following Niels Brugger ,

when studying the web – today or in a historical perspective – we should focus our study on five different web strata: the web element, the web page, the website, the web sphere and the web as such.

The traffic analytics data of Polbox can thus be interpreted not only as an historical source for the websites published within that hosting, but also as a base for describing the technical and social environment of the websites seen beyond the interface of a browser (a web sphere). But – just as when working with any other historical sources – some evaluation should follow.

Browser summary

For example, the browser summary published on that archived page is not just information on the most commonly used browsers in the free.polbox.pl domain. In fact that summary collects information on User-Agent headers from the HTTP protocol, so among some historical browser names we can find information on the other tools used to access web pages in the late 1990’s.

The graphic Netscape browser popular at the time is in first place on the Polbox list. But look at point three and four: Teleport Pro and Wget are both tools for web content copying that enabled the users to read offline. This might be enigmatic, if one does not know how high the costs of internet connections were in Poland in the late 1990’s. In order to avoid large bills from the monopolistic internet provider, users chose to read Web content offline rather than online. One could scrape the website at work or at the university library, then copy to disc to read later on a home computer. In 1998, Polbox was giving its users only 2MB of hosting space to publish a site – while at that time, websites could be easily downloaded to 3½ inch (1.44 MB) floppy disks.

Analysed requests from January 26, 1998 to February 2, 1998 (7 days):

2382307 Netscape (compatible)
1232973 Netscape
19358 Teleport Pro
7549 Wget
7384 Java1.0.2
5720 contype
4233 GAIS Robot
2998 Lynx
2845 IBrowse
2796 Microsoft Internet Explorer
2203 Infoseek
2029 ArchitextSpider
1915 Scooter
1539 MSProxy
1298 AmigaVoyager
1228 Amiga-AWeb
1185 Web21 CustomCrawl
679 RealPlayer 5.0
655 Mosaic
590 ICValidator

World of the Web without centralisation

Some other remarks about browsers from the list. IBrowse, AWeb, and AmigaVoyager are browsers for Amiga OS. Mosaic is an early visual web browser, first released in 1993 and already obsolete 5 years later with the final release in January 1997. Lynx is a textual web browser for Unix which can still be used up until now.

And not only browsers and website scraping tools can be found on that historical user-agent list. Because in early 1998 there was no Google at all (it was founded in September that year), the list can be an useful document showing the environment of web searching tools before the times when one engine became dominant; a world of the Web without centralisation on its huge contemporary scale. On the list one can find Infoseek, WebCrawler and Excite, Scooter (for AltaVista) or Web21 CustomCrawl. However, bear in mind that the list shows only the first 20 user-agents, sorted by number of requests.

Traffic

Information about user-agents can help to examine the methods and purposes behind the usage of selected historical websites. But on the Web Server Statistics for Free Polbox WWW Server from 1998 we can find another interesting historical transfer data. The ability to find sources which enable us to compare the changes in transfer value in a given period allows us to illustrate the evolution of a historical website. What is more, by comparing transfer from weekend to transfer from weekdays we could try to interpret how access to the Web had been determined by the place from which it was accessed (work vs home, just like in this study from 2012).

And of course, historical data of web traffic can demonstrate the Web getting larger and websites becoming increasingly complex.

Thus, historical Web traffic analytics can be useful to historians of the Web, but they should be used carefully. First of all, the data is hardly complete, so the interpretations must be conservative (if any). Secondly, we cannot be certain that the data has been aggregated and then captured correctly – it’s Brugger’s document of Web fate. And, thirdly, some terms used originally in the historical analytics could be misleading – like users or browsers.

Inspirations: Niels Brügger, ‘Web History and the Web as a Historical Source’, in: Zeithistorische Forschungen/Studies in Contemporary History, Online-Ausgabe, 9 (2012), H. 2, URL: http://www.zeithistorische-forschungen.de/2-2012/id=4426, Druckausgabe: S. 316-325.

Advertisements

About peterwebster

Historian of twentieth century Britain; interested in digital history, open access publishing, web archives. Tweets @pj_webster

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s