Category Archives: research reports

The rise and fall of text on the Web: a study using Web archives

[The following is a guest post from Anthony Cocciolo (@acocciolo), Associate Professor at Pratt Institute School of Information and Library Science, on a recently published research study]

In the summer of 2014, I became interested in studying if it was more than my mere impression that websites were beginning to present less text to end-users. Websites such as were gaining enormous popularity and using a communicative style that had more in common with children’s books (large graphics and short segments of text) than with the traditional newspaper column. I wondered if I could measure this change in any systematic way? I was interested in this change primarily for what it implied about literacy and what we ought to teach students, and more broadly about what this change meant for how humans communicate and share information, knowledge and culture.

Teaching students to become archivists at a graduate school of information and library science, and focusing on a variety of digital archiving challenges, I was quite familiar with web archives. It was immediately clear to me if I were to study this issue I would be relying on web archives, and primarily on the Internet Archive’s Wayback Machine, since it had collected such a wide scope of web pages since the 1990s.

The method devised was to select 100 popular and prominent homepages in the United States from a variety of sectors that were present in the late 1990s and are still used today. I also decided to select homepages every three years beginning in 1999, resulting in 6 captures or 600 homepages. The reason for this decision is that by 1999 the Internet Archive’s web archiving efforts were fully underway, and three years would be enough to show changes but not require a hugely repetitive dataset. URLs for webpages in the Internet Archive were selected using the Memento web service. Full webpages were saved as static PNG files.

To detect text blocks from non-text blocks, I modified a Firefox extension called Project Naptha. This extension detects text from non-text using an algorithm called the Stroke Width Transform. The percentage of text per webpage was calculated and stored in a database. A sample of detected text from non-text is shown in the figure below, which is 46.10% text.

Text detection on the White House site

Once the percentage of text for each webpage and year were computed, I used a statistical technique called a one-way ANOVA to determine whether the percentage of text on a webpage was a chance occurrence, or instead dependent on the year the Website was produced. I found that these percentages were not random occurrences but dependent on the year of production (what we would call statistically significant).

The major finding is that the amount of text rose each year from 1999 to 2005, at which point it peaked, and it has been on a decline ever since. Thus, website homepages in 2014 have 5.5% less text than they did in 2005. This is consistent with other research that uses web archives that indicate a decrease of text on the web. This pattern is illustrated below.

Mean percentage of text on pages over time

This study necessarily begs the question: what has caused this decrease in the percentage of text on the Web? Although it is difficult to make definitive conclusions, one suggestion is that the first Web boom of the late 1990s and early 2000s brought about significant enhancements to internet infrastructure, allowing for non-textual media such as video to be more easily streamed to end-users (Interestingly, the year 2005 was also the year that YouTube was launched.) This is not to suggest that text was replaced with YouTube videos but rather that a rise in multiple modes of communication became more possible with their easier delivery, such as video and audio, which may have helped unseat text from its primacy on the World Wide Web.

I think the study raises a number of interesting issues. If the World Wide Web is presenting less text to users relative to other elements, does this mean that the World Wide Web is becoming a place where deep reading is less likely to occur? Is deep reading now only happening in other places, such as e-readers or printed books (some research indicates this might be the case)? The early web was the great delivery mechanism of text, but might text be further unseated from its primacy and the web become primarily a platform for delivering audiovisual media?

If interested in this study, you can read it on the open-access publication Information Research.