Sunday, October 11, 2009

Notes & Thoughts from iPres 2009

This was the Sixth International Conference on Preservation of Digital Objects (iPres). These conferences "bring together researchers and practitioners from around the world to explore the latest trends, innovations, and practices in preserving our scientific and cultural digital heritage". This one (iPres 2009) was hosted by the California Digital Library (CDL) and took place in the conference centre of UCSF Mission Bay in sunny San Francisco. CDL did a really superb job in organising the event.

The iPres community is made up of people from academic libraries, national libraries, national archives, service providers (like ExLibris, Tesella, Sun, Ithaka, LOCKSS etc), web archivists, preservation researchers, and me.

The following are some notes I made during the conference and some random thoughts that occurred to me while listening to the presentations and tuning in on the #ipres09 Twitter channel. Other blogs covering this event were: Digital Curation Blog, Daves thoughts on stuff, FigoBlog (in French) and Duurzame Toegang (in Dutch). Photo's from the event here, and here.

Day 1

David Kirsch (University of Maryland) gave a thought-provoking keynote address on the need to preserve corporate records: "Do corporations have the right to be forgotten?". There are many reasons why corporate records are lost or destroyed. Often it is simply that record keeping has a low priority in the corporate world - few companies have formal policies for digital preservation. Furthermore, lawyers tend to advise companies to destroy records to avoid possible future liabilities. David argued that there is a public interest in preserving the records of corporations for research purposes. He had some great ideas on how this might be achieved. One of the ideas was to give archivists the option to claim company records in bankruptcy courts. Brilliant.

Panel discussion on sustainable digital preservation

The Blue Ribbon Task Force has been studying the issues of economic sustainability for digital preservation. Their final report is due in Jan 2010 (the interim report is available here).

Many digital preservation activities are funded through one-off grants or discretionary funding. This is obviously not a sustainable source that would guarantee long-term preservation - someone has to pay to keep the servers humming and the bits from rotting. There is also the "blank cheque" problem: few funding bodies are comfortable agreeing to support preservation of an unknown amount of digital data for an indeterminate length of time.

A few groups are beginning to provide paid-for archiving services. One of the most interesting is CDL’s easy to use Web Archiving Service.

Abby Smith noted that a key discussion area for the task force had been the handover points when stewardship of information passes from one party to another This was one of the most insightful moments of the conference for me: the importance of designing preservations systems in such a way that they can be passed on to someone else to continue the stewardship of the data. This seems to me a much more manageable problem to work on than how to preserve an infinite amount of data for an infinite length of time. Handover of stewardship was one of the main drivers behind the development of the DOI system - how to make sure that digital objects remain findable when ownership of the rights change. I wonder if the registration agency model that the International DOI Foundation (IDF) uses might be helpful here (Disclosure: I am Chair/Director of IDF).

Henry Lowood (Stanford) spoke about preserving virtual gaming worlds. My first thought was why would anyone bother to preserve virtual worlds. I realized however that the world is full of collectors of stuff; preservation of anything starts with someone who cares enough about it to spend time and money on building an archive.

One of the big challenges in preserving multi-user games is that the game environment itself has been built by many gamers. Figuring out who built what element and asking them for permission to archive is a major headache. Another problem is that it is not enough simply to take screenshots since the way the game has been played is part of the essence of what needs to be preserved. In other words, the game content and game engine are indistinguishable from one another. It seems to me that this may be the future for all content. There will be a time when the content alone is almost meaningless without the context in which it was created. If we want to preserve the content, how do we capture the context along with it? And in a world of multi-user-generated content, how will we ever find out who created what piece of content? Maybe we need a digital data donor card, where you can formally donate the data you have created in your life to the public domain, and record this fact permanently so that future archivists can mine your data with your posthumous permission.

Reinhard Altenhoner, German National Library argued that most digital preservation projects have been single initiatives, creating safe places for information. Few have looked at the wider e-infrastructure implications. In what environment do the islands of activity operate? Where is the common approach? He proposes a service layer architecture for digital preservation - decoupling system components and working on interoperability, open interfaces etc.

And that is exactly what CDL are doing. Stephen Abrams, in what was for me one of the best presentations of iPres, told us how. CDL believes that curation stewardship is a relay race, we should concentrate on doing the best job now then handing over to someone else to continue the stewardship. With this in mind, their approach favours the small and simple over the large and complex, and the re-use of existing technologies rather than creating new ones.

They have come up with “interoperable curation micro-services” - a definition of the granularity of services for a full preservation system. They group these into 4 service layers, characterized as:

1) Lots of copies keep stuff safe (providing safety through redundancy)

2) Lots of description keeps stuff meaningful (maintaining meaning through description)

3) Lots of services keep stuff useful (facilitating utility through service)

4) Lots of uses keep stuff valuable (adding value through use)

There is much more detail in their conference paper, which I should imagine will be required reading for all iPres delegates.

Pam Armstrong and Johanna Smith from Library and Archives Canada had an interesting story to tell. Some time ago, their national auditor wrote a withering report on record keeping within the Canadian government. They used this as a big stick to drive compliance on better record keeping and archiving. Very cleverly though, they also developed a useful plug-in that made it easy for the government staff to comply.

Lesson learned: if you decide to use a stick instead of a carrot, then make sure you provide some protection so that it doesn’t hurt too much!

Robert Sharpe, Tessella, talked about the results of a survey they had done for Planets (which seemed to me to be very similar to the PARSE/Insight one earlier this year...) An interesting correlation appeared to be that if an organization has a formal policy for digital preservation, they are much more likely to have funding for DP. I wondered whether the answers to this question on the survey were skewed. I mean if you had funding for digital preservation, how could you ever admit on a survey that you didn’t have a policy?

Their final conclusion was that more work needs to be done to fully understand the landscape. In my experience, that usually means the original questionnaire was not especially well thought out nor pilot tested before sending out. Perhaps the survey was more of a plea for awareness for the issues around digital preservation?

Actually I think questionnaires are a really poor way of gaining insights into complex issues - and let's face it what important issues are not complex nowadays? The interesting insights are rarely contained in yes/no answers but in the discussion that goes on in someone’s mind or within an organization in answering the question. I find that understanding why an answer was "yes" or "no" is usually much, much more interesting than the answer itself. Questionnaires also ignore the political angle - I mean what National Library tasked with digital preservation could ever answer “no” to the question “do you have a documented preservation policy?”, even if they do not?

Ulla Kejser, Danish National Library, presented a model for predicting costs. They came to the conclusion that there is strong dependency on subjective cost assessment, either in deciding how to map a framework like OAIS or simply in the prediction of the cost elements.

Interesting that they took care not to assume potential cost savings upon system deployment - they stress that to do that you need an organization that is capable of learning and re-applying that learning if you are to realize cost savings from re-use.

It seems to me that all the costing models suffer from the same drawback: when you scale the number of preserved records to a large number even a tiny estimation error will be magnified hugely. The underlying problem is that it is impossible to predict accurately something that has not been done before. This is a something that is very familiar to anyone involved in agile projects. I wonder if the estimation and planning techniques used there might be useful here.

Day 2

Micah Altman, a Social Scientist from Harvard spoke on open data. He maintains that journal articles are summaries of research and not the actual research results (I think he is missing the fact that articles also contain hypotheses and conclusions, not only summaries of work undertaken). His main point though is that researchers need access to the underlying data, which I wholeheartedly agree with. How we do this is trickier. I think the TIB in Germany have made a great start by defining persistent identifiers for scientific date - see here - but there is lots more to do. It has always been very hard to peer-review scientific data since the reviewer does not usually have access to the software needed to view the data or e.g. to run a simulation.

I think Altman also forgets that publication and dissemination is an annoying necessity for many scientists; it is not something they enjoy nor wish to spend a lot of time on, let alone preserving it. Most researchers just want to do research. Making it easy for them to cite, to archive and preserve is the key, I think.

Martha Anderson of NDIIP noted an interesting observation from Clay Shirky that each element of digital preservation has a different time dynamic and a different lifecycle. His advice was that “the longer the time frame of an element, the more social the problem”. Thus the social infrastructure around digital preservation is more important than the technical aspects. That is certainly our experience with the DOI System.

She also re-iterated the Danish point that learning organizations are key for collaboration.

There was a panel discussion on private LOCKSS Networks. PLNs are small, closed groups of institutes that use LOCKSS technology and architecture to harvest and store data in their domains. It looks like quite an interesting model, providing an open source architecture for institutes to roll their own digital preservation. I do have a concern about the LOCKSS architecture that is probably down to the fact that I haven't studied it well enough yet. I worry about the chaos of multiple copies of multiple resources, that may or may not be correctly tagged or uniquely identifiable. If I find something, how do I know if it is a copy or the original, or whether there is a difference? LOCKSS solves the problem of keeping stuff safe through redundancy, but it seems to me that in doing so it creates some new problems.

Ardys Kozbial showed that Chronopolis do a good job for their clients, with a straightforward, uncomplicated solution for multiple file types. Their dashboard for partners / data providers showing where their data is in the process queue was very impressive.

Christopher Lee showcased ContextMiner - a multi-channel agent that crawls multiple sources for multiple keyword searches. Shows just how easy web crawling has become.

Jens Ludwig, University of Goettingen, reported that Ingest was the biggest cost factor in digital preservation. This was hotly disputed during the discussion and on Twitter. He and his team have produced a guide on how digital information objects can be ingested into a digital repository in a manner that facilitates their secure storage, management and preservation. The draft is available here.

Emmanuelle Bernes, French National Library, gave a terrific presentation on the transition from Library to Digital Library and the impact on the people. She described two phases in the transition. The initial phase was characterized by “digital is different”: digital was driven by experts/early adopters, there was a separate organization, the culture was learning by doing. Now that the Library is fully digital, they have integrated the digital skills and tasks into all areas of the Library, running them as production teams; there are training programmes throughout the library open to all.

Interestingly, she described the transition as a dissemination, a spreading out of the skills learnt in the initial phase throughout the rest of the organization. The key insight for me was that since it is the people who carry the expertise, you must spread those people around the organization if you hope to spread their skills.

They provide multi-day training (7 days in total, one day of introduction, and three 2-day courses) with dedicated training curriculum on digital information management, metadata, digital libraries, digitization, digital preservation. The really clever thing they did was to open these courses up to everyone, not only those people who needed the training for their day-to-day activities, but also those who wanted to be there out of curiosity. Genius.

They have started a project to look at the impact of digital on people, processes and organization, and figure out innovative ways of doing it better.

I think there is much that other (National) Libraries could learn from the BnF, not to mention Publishers struggling with the transition from print to digital. It was a great presentation to end the conference with.

I collected the following memorable quotes during the Conference:

Henry Lowood “The world will end not with a bang but with the message: network error; server has shut down”

Rick Prelinger “Developing a 4 dimensional map of the world - how the world looked in space and time”

Martha Anderson “Collaboration is what you do when you cannot solve a problem on your own”

Adam Farquhar “Do you worry about having too many of your nodes in the same cloud?”

Jens Ludwig “You should define ingest as a transfer of responsibility and not as a technical transfer”

David Kirsch “The are more entrepreneurial ventures started in a year than there are marriages”

Pam Armstrong "Alone we can go faster, together we can go further"

and finally some random thoughts that occurred to me:

Perhaps the biggest steps forwards have been when public and private interests match. Maybe that is the key to digital preservation - finding areas where the interests meet. Tricky since the timeliness / time dynamics are so different. Is there a market for preservation? What would that be?

Struggling with the size of the problem - infinite data stored for infinity is just too big and hairy to cope with. For funding agencies it must seem like an blank cheque, very scary.

I loved the fantastic film footage from Rick Prelinger, he has a dream to recreate the world in space and time - recovering footage from the same place taken over time - see how things have developed.

Pure genius: Prelinger would like an iPhone app that knows where you are and in what direction you are looking, then shows you videos or stills of what it used to look like in the past

Scientific data is meaningless without the environment / system in which the data was created - just like the virtual world game content is meaningless without the engine with which it was created.

Persistence comes from a persistent acknowledged need matched by some persistent funding model - either a business model that is sustainable or sustainable government funding or some other investor - either way there has to be something in it for all the actors - otherwise there is no balance - so maybe identify actors, needs, what is being offered to whom for what? Also one model may not be enough - it rarely is in the rest of the world

One of the attributes of web service architecture is that services self-describe themselves and what they do. Maybe that’s a model for the distributed service model, agree the metadata for self-describing the modules, what they preserve, why and how. You could then design wrappers around non-compliant (legacy) modules to bring them into the same architecture. This is what CDL is doing for themselves, but what if the API’s they are defining become standards across the community?

Collaboration only works when the parties involved are willing and able to learn from each other. Before you plan to collaborate, take the test “are you are learning organization?”. If you cannot learn fro your own people, how can you expect to learn from collaboration with others?