One of the first things you need to get your head around when you dive into the history of the internet is that “the internet” and “the web” are not the same thing. That sounds trivial to most people who have worked in the sector for any period of time. But trust me – it isn’t.
It’s a problem because we have been archiving the web systematically for quite a long time. The British Library’s archive has pages stored from 1996 onwards. So for someone relatively new to using web archives as a scholarly source, I can access a lot of information.
As someone whose family got their first internet connection in 2000, however, I also know that there’s a lot that won’t be stored. And there is a lot that will be stored that I won’t be able to access. Internet Relay Chat, for example, was very popular when I first got access to the ‘net. From those MSN chat rooms (that were eventually shut down due to the… er… “unpleasantness”), to the use of purpose-made clients to connect with friends, chat was by its nature ephemeral. Perhaps some user would have kept a log of the conversation (and I probably have a few of those on text file somewhere). More than likely, they didn’t. Or even if they did, the chances of them surviving are slim.
The advent of Facebook and Twitter and their ilk in the mid-2000s has also complicated matters. Pretty quickly it became apparent that these social networks were culturally important and would probably need to be preserved. But the ethics of such an undertaking are complicated to say the least. It’s one thing to do a “big data” analysis of the rise and fall of the term “hope” over the 2004 US General Election. It’s another to do a “close reading” analysis of the behaviour of teenagers. Since it’s all held behind password-protected pages and servers, our old web-crawling techniques aren’t going to help. The Library of Congress is collecting Twitter. But how we will actually use it in the future remains to be seen.
Moreover, with social media, chat logs, e-mails, and various other “non-web” internet data, we cannot be certain about how systematic or representative our source base is. There is great potential for our research findings to be skewed. (Not, of course, that the web archive is objective and clean either. But I digress.)
This matters to me as a historian because I am not a computer scientist. I wouldn’t even consider myself a historian of the internet. Much like I use biographies, diaries, government papers and objects to build a story of the past, internet sources are yet another way of finding out what people said and did. A good historian would never assume a diary to be an accurate, objective account of past events. There is always an inherent bias in which data survive. Just as she would also understand that regardless of the amount of sources she collates, there will always be gaps in the evidence.
The problem, really, is twofold. First, there is so much material available it gives both the illusion of completeness and the temptation to try to use it all. Second, because it lacks the human curation element so central to “traditional” archives, it can be difficult to sift through the white noise and home in on the data that matters to our research questions.
The first part is relatively difficult to get over, but not impossible. It simply requires some discipline and better training on what internet archives can and cannot do. From there, we can apply our knowledge and discretion to only focus on the parts of the archive that will actually help us – and/or adapt our research questions accordingly.
But that second bit is always going to be a problem. Again, discipline can help. We can simply accept our fate – that we will never have it all – and focus our histories on the scraps that remain. Like Ian Milligan’s work on the archive of GeoCities. Or Kevin Driscoll’s on the history of Bulletin Board Systems. At the same time, how does a historian of the 1990s try to use these archives to try to access the people of the period? How on earth can this material be narrowed down? Will we always have to keep our “online” and “offline” research separate?
The exciting thing is that we don’t have fully developed answers to these questions yet. The scary thing is that it’s our generation of scholars that are going to have to come up with the solutions. This seems like a lot of work. If anyone is willing to do it for me, I would be forever grateful!