All posts by peterwebster

About peterwebster

Historian of twentieth century Britain; interested in digital history, open access publishing, web archives. Tweets @pj_webster

Web traffic analytics as a historical source

[This is a guest post from Marcin Wilkowski, first published at wilkowski.org. Marcin Wilkowski is a member of the Digital Humanities Laboratory at the University of Warsaw.]

I have recently got into researching the digital remains of a free Polish hosting service from the late 1990’s – free.polbox.pl. Among the copies available via the Wayback Machine, I found some pages containing historical web traffic data. How can this data be used as a historical source?

First of all, following Niels Brugger ,

when studying the web – today or in a historical perspective – we should focus our study on five different web strata: the web element, the web page, the website, the web sphere and the web as such.

The traffic analytics data of Polbox can thus be interpreted not only as an historical source for the websites published within that hosting, but also as a base for describing the technical and social environment of the websites seen beyond the interface of a browser (a web sphere). But – just as when working with any other historical sources – some evaluation should follow.

Browser summary

For example, the browser summary published on that archived page is not just information on the most commonly used browsers in the free.polbox.pl domain. In fact that summary collects information on User-Agent headers from the HTTP protocol, so among some historical browser names we can find information on the other tools used to access web pages in the late 1990’s.

The graphic Netscape browser popular at the time is in first place on the Polbox list. But look at point three and four: Teleport Pro and Wget are both tools for web content copying that enabled the users to read offline. This might be enigmatic, if one does not know how high the costs of internet connections were in Poland in the late 1990’s. In order to avoid large bills from the monopolistic internet provider, users chose to read Web content offline rather than online. One could scrape the website at work or at the university library, then copy to disc to read later on a home computer. In 1998, Polbox was giving its users only 2MB of hosting space to publish a site – while at that time, websites could be easily downloaded to 3½ inch (1.44 MB) floppy disks.

Analysed requests from January 26, 1998 to February 2, 1998 (7 days):

2382307 Netscape (compatible)
1232973 Netscape
19358 Teleport Pro
7549 Wget
7384 Java1.0.2
5720 contype
4233 GAIS Robot
2998 Lynx
2845 IBrowse
2796 Microsoft Internet Explorer
2203 Infoseek
2029 ArchitextSpider
1915 Scooter
1539 MSProxy
1298 AmigaVoyager
1228 Amiga-AWeb
1185 Web21 CustomCrawl
679 RealPlayer 5.0
655 Mosaic
590 ICValidator

World of the Web without centralisation

Some other remarks about browsers from the list. IBrowse, AWeb, and AmigaVoyager are browsers for Amiga OS. Mosaic is an early visual web browser, first released in 1993 and already obsolete 5 years later with the final release in January 1997. Lynx is a textual web browser for Unix which can still be used up until now.

And not only browsers and website scraping tools can be found on that historical user-agent list. Because in early 1998 there was no Google at all (it was founded in September that year), the list can be an useful document showing the environment of web searching tools before the times when one engine became dominant; a world of the Web without centralisation on its huge contemporary scale. On the list one can find Infoseek, WebCrawler and Excite, Scooter (for AltaVista) or Web21 CustomCrawl. However, bear in mind that the list shows only the first 20 user-agents, sorted by number of requests.

Traffic

Information about user-agents can help to examine the methods and purposes behind the usage of selected historical websites. But on the Web Server Statistics for Free Polbox WWW Server from 1998 we can find another interesting historical transfer data. The ability to find sources which enable us to compare the changes in transfer value in a given period allows us to illustrate the evolution of a historical website. What is more, by comparing transfer from weekend to transfer from weekdays we could try to interpret how access to the Web had been determined by the place from which it was accessed (work vs home, just like in this study from 2012).

And of course, historical data of web traffic can demonstrate the Web getting larger and websites becoming increasingly complex.

Thus, historical Web traffic analytics can be useful to historians of the Web, but they should be used carefully. First of all, the data is hardly complete, so the interpretations must be conservative (if any). Secondly, we cannot be certain that the data has been aggregated and then captured correctly – it’s Brugger’s document of Web fate. And, thirdly, some terms used originally in the historical analytics could be misleading – like users or browsers.

Inspirations: Niels Brügger, ‘Web History and the Web as a Historical Source’, in: Zeithistorische Forschungen/Studies in Contemporary History, Online-Ausgabe, 9 (2012), H. 2, URL: http://www.zeithistorische-forschungen.de/2-2012/id=4426, Druckausgabe: S. 316-325.

Web archive conferences in 2017

2017 offers not one but two international conferences for scholars interested in the way we use the archived web.

There are calls for papers open now for both.

Curation and research use of the past Web
(The Web Archiving Conference of the International Internet Preservation Consortium)
Lisbon, 29-30 March 2017
Call for Papers now open, closing date 20 October 2016.

Researchers, practitioners and the archived Web
(2nd conference of ReSAW, the Europe-wide Research Infrastructure for the Study of Archived Web Materials)
London, 14-15 June 2017
Call for Papers now open, closing date 9 December 2016.

Born-digital data and methods for history: new research network

Both Ian and Peter are delighted to be part of a new research network in the UK, funded by the Arts and Humanities Research Council for twelve months. There are further details on the network website, some of which are given below.

“This new research network will bring together researchers and practitioners […] to discern if there is a genuine humanities approach to born-digital data, and to establish how this might inform, complement and draw on other disciplines and practices. Over the course of three workshops […] the network will address the current state of the field; establish the most appropriate tools and methods for humanities researchers for whom born-digital material is an important primary source; discuss the ways in which researchers and archives can work together to facilitate big data research; identify the barriers to engagement with big data, particularly in relation to skills; and work to build an engaged and lasting community of interest. The focus of the network will be on history, but it will also encompass other humanities and social science disciplines. The network will also include representatives of non-humanities disciplines, including the computer, social and information sciences. Interdisciplinarity and collaborative working are essential to digital research, and particularly in such a new and complex area of investigation.

“During the 12 months of the project all members of the network will contribute to a web resource, which will present key themes and ideas to both an academic and wider audience of the interested general public. External experts from government, the media and other relevant sectors will also be invited to contribute, to ensure that the network takes account of a range of opinions and needs. The exchange of knowledge and experience that takes place at the workshops will also be distilled into a white paper, which will be published under a CC-BY licence in month 12 of the network.

What’s in a (top-level) domain name?

[This post first appeared on Peter Webster’s own blog]

I think there would be general agreement amongst web archivists that the country code top-level domain alone is not the whole of a national web. Implementations of legal deposit for the web tend to rely at least in part on the ccTLD (.uk, or .fr) as the means of defining their scope, even if supplemented by other means of selection.

However, efforts to understand the scale and patterns of national web content that lies outside national ccTLDs are in their infancy. An indication of the scale of the question is given by a recent investigation by the British Library. The @UKWebArchive team found more than 2.5 million hosts that were physically located in the UK without having .uk domain names. This would suggest that as much as a third of the UK web may lie outside its ccTLD.

And this is important to scholars, because we often tend to study questions in national terms – and it is difficult to generalise about a national web if the web archive we have is mostly made up of the ccTLD. And it is even more difficult if we don’t really understand how much national content there is outside that circle, and also which kinds of content are more or less likely to be outside the circle. Day to day, we can see that in the UK there are political parties, banks, train companies and all kinds of other organisations that ‘live’ outside .uk – but we understand almost nothing about how typical that is within any particular sector. We also understand very little about what motivates individuals and organisations to register their site in a particular national space.

So as a community of scholars we need case studies of particular sectors to understand their ‘residence patterns’, as it were: are British engineering firms (say) more or less likely to have a web domain from the ccTLD than nurseries, or taxi firms, or supermarkets? And so here is a modest attempt at just such a case study.

Anglican Ireland. (Church of Ireland, via WIkimedia Commons, CC BY-SA 3.0)
Anglican Ireland. (Church of Ireland, via Wikimedia Commons, CC BY-SA 3.0)

All the mainstream Christian churches in the island of Ireland date their origins to many years before the current political division of the island in 1921. As such, all the churches are organised on an all-Ireland basis, with organisational units that do not recognise the political border. In the case of the Church of Ireland (Anglican), although Northern Ireland lies entirely within the province of Armagh (the other province being Dublin), several of the dioceses of the province span the border, such that the bishop must cross the political border on a daily basis to minister to his various parishes.

How is this reflected on the web? In particular, where congregations in the same church are situated in either side of the border, where do their websites live – in .uk, or in .ie, or indeed in neither?

I have been assembling lists of individual congregation websites as part of a larger modelling of the Irish religious webspace, and one of these is the Presbyterian Church of Ireland. My initial list contains just over two hundred individual church sites, the vast majority of which are in Northern Ireland (as is the bulk of the membership of the church). Looking at Northern Ireland, the ‘residence pattern’ is:

.co.uk – 23%
.org.uk – 20%
.com – 17%
.org – 37%
Other – 3%

In sum, less than half of these sites – of church congregations within the United Kingdom – are ‘resident’ within the UK ccTLD. A good deal of research would need to be done to understand the choices made by individual webmasters. However, it is noteworthy that, for Protestant churches in a part of the world where religious and national identity are so closely identified, to have a UK domain seems not to be all that important.

Notes
1. My initial list (derived from one published by the PCI itself) represents only sites which the central organisation of the denomination knew existed at the time of compilation, and there are more than twice as many congregations as there are sites listed. However, it seems unlikely that that in itself can have skewed the proportions.

2. For the very small number of PCI congregations in the Republic of Ireland (that appear in the list), the situation is similar, with less than 30% of churches opting for a domain name within the .ie ccTLD. However, the number is too small (26 in all) to draw any conclusions from it.

The rise and fall of text on the Web: a study using Web archives

[The following is a guest post from Anthony Cocciolo (@acocciolo), Associate Professor at Pratt Institute School of Information and Library Science, on a recently published research study]

In the summer of 2014, I became interested in studying if it was more than my mere impression that websites were beginning to present less text to end-users. Websites such as Buzzfeed.com were gaining enormous popularity and using a communicative style that had more in common with children’s books (large graphics and short segments of text) than with the traditional newspaper column. I wondered if I could measure this change in any systematic way? I was interested in this change primarily for what it implied about literacy and what we ought to teach students, and more broadly about what this change meant for how humans communicate and share information, knowledge and culture.

Teaching students to become archivists at a graduate school of information and library science, and focusing on a variety of digital archiving challenges, I was quite familiar with web archives. It was immediately clear to me if I were to study this issue I would be relying on web archives, and primarily on the Internet Archive’s Wayback Machine, since it had collected such a wide scope of web pages since the 1990s.

The method devised was to select 100 popular and prominent homepages in the United States from a variety of sectors that were present in the late 1990s and are still used today. I also decided to select homepages every three years beginning in 1999, resulting in 6 captures or 600 homepages. The reason for this decision is that by 1999 the Internet Archive’s web archiving efforts were fully underway, and three years would be enough to show changes but not require a hugely repetitive dataset. URLs for webpages in the Internet Archive were selected using the Memento web service. Full webpages were saved as static PNG files.

To detect text blocks from non-text blocks, I modified a Firefox extension called Project Naptha. This extension detects text from non-text using an algorithm called the Stroke Width Transform. The percentage of text per webpage was calculated and stored in a database. A sample of detected text from non-text is shown in the figure below, which is 46.10% text.

Cocciolo_figure1
Text detection on the White House site

Once the percentage of text for each webpage and year were computed, I used a statistical technique called a one-way ANOVA to determine whether the percentage of text on a webpage was a chance occurrence, or instead dependent on the year the Website was produced. I found that these percentages were not random occurrences but dependent on the year of production (what we would call statistically significant).

The major finding is that the amount of text rose each year from 1999 to 2005, at which point it peaked, and it has been on a decline ever since. Thus, website homepages in 2014 have 5.5% less text than they did in 2005. This is consistent with other research that uses web archives that indicate a decrease of text on the web. This pattern is illustrated below.

Cocciolo_figure2
Mean percentage of text on pages over time

This study necessarily begs the question: what has caused this decrease in the percentage of text on the Web? Although it is difficult to make definitive conclusions, one suggestion is that the first Web boom of the late 1990s and early 2000s brought about significant enhancements to internet infrastructure, allowing for non-textual media such as video to be more easily streamed to end-users (Interestingly, the year 2005 was also the year that YouTube was launched.) This is not to suggest that text was replaced with YouTube videos but rather that a rise in multiple modes of communication became more possible with their easier delivery, such as video and audio, which may have helped unseat text from its primacy on the World Wide Web.

I think the study raises a number of interesting issues. If the World Wide Web is presenting less text to users relative to other elements, does this mean that the World Wide Web is becoming a place where deep reading is less likely to occur? Is deep reading now only happening in other places, such as e-readers or printed books (some research indicates this might be the case)? The early web was the great delivery mechanism of text, but might text be further unseated from its primacy and the web become primarily a platform for delivering audiovisual media?

If interested in this study, you can read it on the open-access publication Information Research.

The net, the web, the archive and the historian

[A guest post from Dr Gareth Millward (@MillieQED), who is Research Fellow at the London School of Hygiene and Tropical Medicine.]

One of the first things you need to get your head around when you dive into the history of the internet is that “the internet” and “the web” are not the same thing. That sounds trivial to most people who have worked in the sector for any period of time. But trust me – it isn’t.

It’s a problem because we have been archiving the web systematically for quite a long time. The British Library’s archive has pages stored from 1996 onwards. So for someone relatively new to using web archives as a scholarly source, I can access a lot of information.

As someone whose family got their first internet connection in 2000, however, I also know that there’s a lot that won’t be stored. And there is a lot that will be stored that I won’t be able to access. Internet Relay Chat, for example, was very popular when I first got access to the ‘net. From those MSN chat rooms (that were eventually shut down due to the… er… “unpleasantness”), to the use of purpose-made clients to connect with friends, chat was by its nature ephemeral. Perhaps some user would have kept a log of the conversation (and I probably have a few of those on text file somewhere). More than likely, they didn’t. Or even if they did, the chances of them surviving are slim.

The advent of Facebook and Twitter and their ilk in the mid-2000s has also complicated matters. Pretty quickly it became apparent that these social networks were culturally important and would probably need to be preserved. But the ethics of such an undertaking are complicated to say the least. It’s one thing to do a “big data” analysis of the rise and fall of the term “hope” over the 2004 US General Election. It’s another to do a “close reading” analysis of the behaviour of teenagers. Since it’s all held behind password-protected pages and servers, our old web-crawling techniques aren’t going to help. The Library of Congress is collecting Twitter. But how we will actually use it in the future remains to be seen.

Moreover, with social media, chat logs, e-mails, and various other “non-web” internet data, we cannot be certain about how systematic or representative our source base is. There is great potential for our research findings to be skewed. (Not, of course, that the web archive is objective and clean either. But I digress.)

This matters to me as a historian because I am not a computer scientist. I wouldn’t even consider myself a historian of the internet. Much like I use biographies, diaries, government papers and objects to build a story of the past, internet sources are yet another way of finding out what people said and did. A good historian would never assume a diary to be an accurate, objective account of past events. There is always an inherent bias in which data survive. Just as she would also understand that regardless of the amount of sources she collates, there will always be gaps in the evidence.

The problem, really, is twofold. First, there is so much material available it gives both the illusion of completeness and the temptation to try to use it all. Second, because it lacks the human curation element so central to “traditional” archives, it can be difficult to sift through the white noise and home in on the data that matters to our research questions.

The first part is relatively difficult to get over, but not impossible. It simply requires some discipline and better training on what internet archives can and cannot do. From there, we can apply our knowledge and discretion to only focus on the parts of the archive that will actually help us – and/or adapt our research questions accordingly.

But that second bit is always going to be a problem. Again, discipline can help. We can simply accept our fate – that we will never have it all – and focus our histories on the scraps that remain. Like Ian Milligan’s work on the archive of GeoCities. Or Kevin Driscoll’s on the history of Bulletin Board Systems. At the same time, how does a historian of the 1990s try to use these archives to try to access the people of the period? How on earth can this material be narrowed down? Will we always have to keep our “online” and “offline” research separate?

The exciting thing is that we don’t have fully developed answers to these questions yet. The scary thing is that it’s our generation of scholars that are going to have to come up with the solutions. This seems like a lot of work. If anyone is willing to do it for me, I would be forever grateful!

When just using a web archive could place it in danger

[A recent post, cross-posted from Peter’s own blog.]

Towards the end of 2013 the UK saw a public controversy seemingly made to showcase the value of web archives. The Conservative Party, in what I still think was nothing more than a housekeeping exercise, moved an archive of older political speeches to a harder-to-find part of their site, and applied the robots.txt protocol to the content. As I wrote for the UK Web Archive blog at the time:

Firstly, the copies held by the Internet Archive (archive.org) were not erased or deleted – all that happened is that access to the resources was blocked. Due to the legal environment in which the Internet Archive operates, they have adopted a policy that allows web sites to use robots.txt to directly control whether the archived copies can be made available. The robots.txt protocol has no legal force but the observance of it is part of good manners in interaction online. It requests that search engines and other web crawlers such as those used by web archives do not visit or index the page. The Internet Archive policy extends the same courtesy to playback.

At some point after the content in question was removed from the original website, the party added the content in question to their robots.txt file. As the practice of the Internet Archive is to observe robots.txt retrospectively, it began to withhold its copies, which had been made before the party implemented robots.txt on the archive of speeches. Since then, the party has reversed that decision, and the Internet Archive copies are live once again.

Courtesy of wfryer on flickr.com, CC BY-SA 2.0 : https://www.flickr.com/photos/wfryer/
Courtesy of wfryer on flickr.com, CC BY-SA 2.0 : https://www.flickr.com/photos/wfryer/

As public engagement lead for the UK Web Archive at the time, I was happily able to use the episode to draw attention to holdings of the same content in UKWA that were not retrospectively affected by a change to the robots.txt of the original site.

This week I’ve been prompted to think about another aspect of this issue by my own research. I’ve had occasion to spend some time looking at archived content from a political organisation in the UK, the values of which I deplore but which as scholars we need to understand. The UK Web Archive holds some data from this particular domain, but only back to 2005, and the earlier content is only available in the Internet Archive.

Some time ago I mused on a possible ‘Heisenberg principle of web archiving‘ – the idea that, as public consciousness of web archiving steadily grows, the consciousness of that fact begins to affect the behaviour of the live web. In 2012 it was hard to see how we might observe any such trend, and I don’t think we’re any closer to being able to do so. But the Conservative party episode highlights the vulnerability of content in the Internet Archive to a change in robots.txt policy by an organisation with something to hide and a new-found understanding of how web archiving works.

Put simply: the content I’ve been citing this week could later today disappear from view if the organisation concerned wanted it to, and was to come to understand how to make it happen. It is possible, in short, effectively to delete the archive – which is rather terrifying.

In the UK, at least, the danger of this is removed for content published after 2013, due to the provisions of Non-Print Legal Deposit. (And this is yet another argument for legal deposit provisions in every jurisdiction worldwide). In the meantime, as scholars, we are left with the uneasy awareness that the more we draw attention to the archive, the greater the danger to which it is exposed.

What does the web remember of its deleted past?

[A special guest post from Dr Anat Ben-David (@anatbd) of the Open University of Israel ]

[Update (Jan 2017): this research has recently been published in New Media and Society. A free version is available in Academia.edu ]

On March 30 2010, the country-code top-level domain of the former Yugoslavia, .yu, was deleted from the Internet. It is said to have been the largest ccTLD ever removed. In terms of Internet governance, the domain had lost any entitlement to be part of the Internet’s root zone, after Yugoslavia dissolved. With the exception of Kosovo, all former Yugoslav republics received new ccTLDs. Technically, it was neither necessary nor possible to keep a domain of a country that no longer exists.

The consequence of the removal of the domain, which at its peak hosted about 70,000 websites, is the immediate deletion of any evidence that it was part of the Web. The oblivious live Web has simply rerouted around it. Since the .yu ccTLD is no longer part of the DNS, even if .yu websites are still hosted somewhere on a forgotten server, they cannot be recalled; search engines do not return results to queries for Websites in the .yu domain; references to old URLs on Wikipedia are broken.

My recent research uses the case of the deleted .yu domain to problematize the ties between the live and archived Web, and to both question and demonstrate the utility of Web archives as a primary source for historiography. The first problem I address relates to the politics of the live Web, which, arguably, create a structural preference for sovereign and stable states. The DNS protocol enforces ICANN’s domain delegation policy, which is derived from the ISO-3166 list of countries and territories officially recognized by the United Nations. Countries and territories recognized by the UN are therefore delegated ccTLDs, but unstable, unrecognized, dissolving or non-sovereign states cannot enjoy such formal presence on the Web, marked by the national country-code suffix. It is for this reason that the former republics of Yugoslavia (Bosnia, Macedonia, Slovenia, Croatia, Serbia and Montenegro) received new ccTLDs, but Kosovo, which is not recognized by the United Nations, did not.

While such policy influences the Web of the present, it also denies unstable and non-sovereign countries the possibility of preserving evidence of their digital past. To illustrate my point, consider an imaginary scenario whereby the top-level domain of a Western and wealthy state – say Germany, or the UK – is to be removed from the DNS system in two years. It is difficult to imagine that a loss of digital cultural heritage at such scale would go unnoticed. To prevent such imaginary scenarios from taking place, national libraries around the world work tirelessly to preserve their country’s national Webs. Yet for non-sovereign states, or in case of war-torn states that once existed but have since dissolved, such as the former Socialist Federal Republic of Yugoslavia, the removal of the country’s domain is not treated in terms of cultural heritage and preservation, but instead as a bureaucratic and technical issue.

Technically, the transition from .yu to the Serbian .rs and the Montenegrin .me was perfectly coordinated between ICANN, Serbia and Montenegro. In 2008, a two-year transitional phase was announced to allow webmasters ample time to transfer their old .yu websites to the new national domains. It is reported that migration rates were rather high. But what about the early days of the .yu domain – the websites that describe important historical events such as the NATO Bombing, the Kosovo War, the fall of Milosevic? What about the historical significance of the mailing lists and newsgroups that contributed for the first time to online reporting of war from the ground? The early history of the .yu domain – the domain that existed prior to the establishment of Serbia and Montenegro as sovereign states – was gone forever.

Almost.

Thankfully, the Internet Archive has kept snapshots of the .yu domain throughout the years. However a second problem hinders historians from accessing the rare documents that can no longer be found online. That second problem relates to the structural dependence of Web archives on the live Web. Despite some critical voices in the Web archiving community, most Web archiving initiatives and most researchers still assume that the live Web is the primary access point that leads to the archive. The Wayback Machine’s interface is an example of that; one has to know the URL in order to view its archived version. The archive validates the existence of URLs of the live Web, and allows for examining their history. However if all URLs of a certain domain are removed from the live Web and leave no trace, what could lead historians, researchers, or individuals to the archived snapshots of that domain?

Taking both problems into account, I set out to reconstruct the history of the .yu domain from the Internet Archive. The challenge is guided by a larger question about the utility of Web archives for historiography. Can the Web be used as a primary source for telling its own history? What does the Web remember of its deleted past? If the live Web has no evidence of the past existence of any .yu URL, would I be able to find the former Yugoslav Web in the Internet Archive, demarcate it, and reconstruct its networked structure?

I began digging. Initially, I used various advanced search techniques to find old Websites that may contain broken links to .yu Websites. I also scraped online aggregators of scholarly articles to find old references to .yu Websites in footnotes and bibliographies. My attempts yielded about 200 URLs, certainly not enough to reconstruct the history of the entire domain from the Internet Archive.

The second option was to use offline sources – newspaper archives, printed books, and physical archives. But doing so would not rely on the Web as a primary source of narrating its history.

My diggings have eventually led me to old mailing lists. In one of them I found a treasure. On 17 February 2009, Nikola Smolenski, a Wikipedian and a Web developer, posted a message to Wikimedia’s Wikibots-L mailing list, asking fellow Wikipedians to help him replace all references to .yu URLs in the various pages of the Wikimedia project. The risk, wrote Smolenski, was ‘that readers of Wikimedia projects will not be able to access information that is now available to them’, and that ‘with massive link loss, a large number of references could no longer be evaluated by the readers and editors’. He used a Python script to generate a list of 46,102 URLs in the .yu domain that were linked from Wikimedia projects and that had to be replaced. A day before the removal of the domain, he also systematically queried Google for all URLs in the .yu domain per sub-domain, which yielded several thousand results. Smolenski’s lists are a last snapshot of the presence of the Yugoslav domain on the live Web. The day after he conducted the search, the .yu ccTLD was no longer part of the Internet root, resulting in the link loss he had anticipated.

Smolenski kindly agreed to send me the lists he generated in 2010. Using the URLs in the lists as seeds, my research assistant Adam Amram and I have built another Python script to fetch the URLs from the Internet Archive, extract all the outlinks from each archived resource, and extract from that set of links those which belonged to the .yu domain. We reiterated the method four times until no new .yu content was found. Our dataset now contains 1.92 million unique pages that were once hosted in the .yu domain between 1996 and 2010.

While the full analysis of our data is beyond the scope of this blog post, I would like to present the following visualization of the rise and fall of the networked structure of the .yu domain over time. The figure below shows the evolution of the linking structure of .yu websites in the entire reconstructed space from 1996 to 2010. Websites in the .yu domain are marked in blue, websites in all other domains are marked in gray, and the visualization shows the domain’s hyperlinked structure per year.yu_networked_structure_1996-2010_for_blog

As can be clearly seen, the internal linking structure of the domain became dense only after the end of the Milosevic regime in 2000, and it is only after the final split between Serbia and Montenegro in 2006 that the .yu domain stabilized both in terms of the number of websites and network density, followed shortly after by the dilution of the network in preparation for the replacement of the .yu domain with the new ccTLDs .rs and .me. In other words, the intra-domain linking patterns of the .yu domain are closely tied with stability and sovereignty.

As time goes by, Web archives are likely to hold more treasures of our deleted digital pasts. This makes Web archives all the more intriguing and important primary sources for historical research, despite the structural problems of the oblivious medium that they attempt to preserve.

Conference dispatches from Aarhus: Web Archives as Scholarly Sources

Some belated reflections on the excellent recent conference at Aarhus University in Denmark, on Web Archives as Scholarly Sources: Issues, Practices and Perspectives (see the abstracts on the conference site).

As well as an opportunity to speak myself, it was a great chance to catch up with what is a genuinely global community of specialists, even if (as one might expect) the European countries were particularly well represented this time. It was also particularly pleasing to see a genuine intermixing of scholars with the librarians and archivists whose task it is to provide scholars with their sources. As a result, the papers were an eclectic mix of method, tools, infrastructure and research findings; a combination not often achieved.

Although there were too many excellent papers to mention them all here, I draw out a few to illustrate this eclecticism. There were discussions of research method as applied both in close reading of small amounts of material (Gebeil, Nanni), and to very large datasets (Goel and Bailey). As well as this, we heard about emerging new tools for better harvesting of web content, and of providing access to the archived content ( Huurdeman).

Particularly good to see were the first signs of work that was beginning to go beyond discussions of method (“the work I am about to do”) to posit research conclusions, even if still tentative at this stage (Musso amongst others), and critical reflection on the way in which the archived web is used (Megan Sapnar Ankerson). It was also intriguing to see an increased focus on the understanding of the nature of a national domain, particularly in Anat Ben-David‘s ingenious reconstruction of the defunct .yu domain of the former Yugoslavia. Good to see too was the beginnings of a reintegration of social networks into the picture (Milligan, Weller, McCarthy) difficult to archive though they are; and some attention to the web before 1996 and the Internet Archive (Kevin Driscoll on BBS).

All in all, it was an excellent conference in all its aspects, and congratulations to Niels Brügger and the organising team for pulling it off.