Thursday, July 23, 2009

How to archive the web

The following are my notes and thoughts from the Web Archiving Conference held at British Library on July 21st, 2009.

The meeting was organized jointly by JISC, DPC and UK Web Archiving Consortium and attracted more than a 100 participants. The meeting was chaired by William Kilbride, Exec Director of DPC and Neil Grindley, programme manager for digital preservation for JISC. The presentations are available here.

Adrian Brown of UK Parliamentary Archives raised the interesting issue of how to preserve dynamic websites, ones that personalize on the fly. If every page on a website is individually created per user, then what version do you archive?

He also talked about versions across time. For instance, what is the best way to archive a wiki? Take a snapshot every so often or archive the full audit trail? Versioning is an issue when a site is harvested over a period of time so that there is a chance the site has been updated in-between harvests. Something he called a lack of temporal cohesion or temporal inconsistency.

Someone from the BBC noted that: "the BBC used to only record the goals in football matches and not the whole match" Now they realize how stupid this was - hence we should avoid the same pitfall by applying too much collection decision-making to archiving. This touches on one of the main issues facing web archivists: what to collect and what to discard? Most seem to make this decision on pragmatic grounds e.g. do we have permission to crawl or archive? how much budget do we have? do we have a mandate to collect a particular domain?

It strikes me that this is only a problem when there is a single collection point. The reality is that all sorts of people all over the world are archiving the web from multiple different perspectives all at the same time. If enough people / organizations do this then all of the web will be archived somewhere, sometime. So for instance, if there was a referee foundation archiving football matches for training purposes, and a football coaching organization, and the two clubs playing, then it wouldn't matter that BBC only saved the goals. The problem was that the BBC were the only ones filming the matches - a single collection point.

This touches on another main issue: the relationship between the content creator and the archivist. More on that later.

Peter Murray-Rust was quoted several times during the meeting. This is intriguing since he mostly seems to advocate against building digital archives which he thinks are effectively impossible and a waste of time. Instead we should disseminate data as widely as possible. If people are interested enough they will take copies somehow. Or as he puts it "Create and release herds of cows, not preserve hamburgers in a deep-freeze". The wider point here is that web archives should be part of the web themselves rather than hidden away in offline storage systems.

Another big issue here: access. If the archive is fully accessible then how do you know that what you find through Google is the archived version or the live version? Suppose there are multiple copies of the entire web archived through different institutions all accessible at the same time? Sounds like chaos to me. A chaos that only metadata can solve. Or so it seems to me.

I think it would help if there were metadata standards for archiving of websites. It could be a minimum set of data that is always recorded along with the archived contents. Archives could then be made interoperable either by using the same metadata schema or by exposing their metadata in some sort of data dictionary that is addressable in a standard way. If the standards are adhered to it would be possible to de-duplicate archived websites and easily identify the "live" version. It would also be easy to keep track of the versions of a website across time so that a single link could resolve to the multiple versions in the archive.

Kevin Ashley made the point that we should not only collect the contents of the web, but also that we should collect content about the web if future generations are to make sense of the archive. One simple example are the words used in websites that are archived today. Perhaps we need to archive dictionaries along with the content so that a 100 years from now people will know what the content means.

There seems to be a consensus in the web archiving community to use the WARC format to capture and store web pages. As I understand it, this is a format to package and compress the data including embedded images or pdf's, videos and so forth. When the record is accessed then it is presumably unpacked and delivered back as web pages. But what if the embedded file formats are no longer compatible with the modern operating systems or browsers? One answer to this problem is to upgrade the archive files to keep pace with new software releases. Presumably this means unpacking the WARC file, converting the embedded formats to the new versions, then repacking.

Jeffrey van der Hoeven believes that emulation is a solution to this problem. He is part of the project team that developed the Dioscuri emulator. He is currently working to provide emulation as a web service as part of the KEEP project.

If you would like to dig into the history of browsers, go to where you'll find an archive of web browsers, including the one Tim Berners-Lee built in 1991, the one called simply "WorldWideWeb".

Probably the single biggest issue facing web archivists is permissions. Obtaining permission to crawl and archive is time-consuming and fraught with legal complications. The large institutions like the British Library take great care to respect the rights of the content creators; as a result UKWAC are unable to harvest up to 70% of the sites it selects. Others operate an remove-upon-request policy. Edgar Cook of The National Library of Australia reported that they have decided to collect even without permission, they just keep the content dark if there is no permission to archive is granted. Edgar challenged the group: "are we being too timid? - hiding behind permissions as an explanation for why archives cannot be complete". Several people noted that it was difficult to reach out to the content creators; Helen Hockx-Yu said "communication with content creators is a luxury".

I wonder if this is perhaps the most important issue of all: connecting the creator to the archiver. It seems to me that to be successful both need to care about digital preservation. I think Edgar Cook is right, the danger in hiding behind permissions or hoping for strong legal deposit legislation is that it avoids the issue. Content creators need to understand that they have a part to play in keeping their own work accessible for future generations. Archive organizations have a big role to play to help them understand that. For instance, archives could issue badges for content creators to place on their web site to show that their work has been considered worthy of inclusion in an archive.

Kevin Ashley set me thinking about another idea. Suppose there was a simple self-archiving service that anyone could use for their own digital content. In return for using this tool, content creators would agree to donate their content to an archive. It would be a little like someone donating their personal library or their collection of photo's upon their death. Except this would be a living donation, archiving as the content is created in a partnership between creator and archive. Mind you, I am sure that a simple self-archiving tool will be anything but simple to create.

Indeed it is clear that web archiving is not at all easy. There are lots of questions, problems, issues and challenges and this meeting highlighted many of them. Unfortunately, there don't seem to be too many answers yet!


  1. Hi Jonathan

    Although no lesson from history can be dismissed out of hand, I think it's misleading of BBC people to present us their former flaky institutional preservation practices as though in a vacuum. One reason the decision was taken to overwrite those 'priceless' Hancocks and Dr Whos, etc, was because the tape was expensive (or at least had been adopted expressly in order to reuse, saving costs on film - see interesting discussion of the issues on!

    No preservation activity is immune to cost implications, and that's one reason selection is essential, even if sometimes they get it wrong. Storage may seem cheap, but managing it effectively isn't. There's an interesting comparison of the relative merits of selective versus whole-domain crawling in a paper by Pymm and Wallis:

    One problem is that, doing anything at all (and doing it 'right') can seem so complicated and expensive, smaller institutions with limited resources may struggle to maintain their 'insitutional record' in an ever more complex web environment. Part of our idea with ArchivePress is exactly to make it easy for institutions (we have universities in mind) to do /something/ about this, at least with potentially valuable blog content, and PMR has been supportive - institutional archives represent, after all, another 'copy' in the dissemination stream!

  2. Hello Jonathan

    Ed Pinsent of ULCC calling...

    >Probably the single biggest issue facing web archivists is permissions. Obtaining permission to crawl and archive is time-consuming and fraught with legal complications.

    I would certainly agree with the second half of that. I've been working with UKWAC since its inception, collecting project websites for the JISC. We too found permissions to be something of an overhead, until I revised the permissions form to try and make it clearer what we were asking. The way I see it, through UKWAC we're asking for permission to:

    1) Make a copy of website content
    2) Continue to make copies on a regular basis
    3) Republish the harvested content on the UKWAC website

    It's the third item - republication - that in my opinion causes the most contention, particularly when there's third-party content on the site (e.g. submitted images, or comments on a blog) which the website owner can't really sign for.

    On the other hand, if UKWAC simply copied the material with no immediate intention to republish the copies (currently, that republication takes place immediately upon submission of the gather to the archive), then perhaps we'd be in a position to start managing the 'dark archive' of which NLA were speaking. We'd still have DRM issues, for sure, but at least we could be undertaking more comprehensive collection sweeps; the republication problems could be revisited later. In the meantime we would have gathered content otherwise in danger of vanishing.

    I am always mindful of what one website owner said to me in 2006; not only did he refuse to sign a permissions form, but he also thought it was absurd for me to even ask him:

    "Neither Google, any search engine nor the Internet Archive have ever asked for permission to download, store and represent material from my web site. They rely on the permission accorded by the ROBOTS.TXT file on my web site."

    In that light, what Heritrix and other web crawlers do may not be substantially different to Google.

    There's another question underlying permissions, and it's a bigger problem: who really is the 'owner' of a website, and can that person grant 'permission' to harvest? My personal view is that much web-archiving activity is predicated on the notion of the website as a "book": an entity with a single title, a single author, and a single publisher. But there are many websites and other web resources which simply do not match that profile. Yet the library model prevails, and continues to govern collection policy, archiving procedure, metadata, and even the nature of software development to some extent.

    Disclaimer: these are just my own personal views, not those of UKWAC or the JISC.