Notes on digital preservation and Jason Scott's JCDL keynote

June 18, 2012
Tags: archives, curation, DC, library, preservation, tech

I was fortunate to be able to spend some time in Washington, D.C., for the Joint Conference on Digital Libraries (JCDL 2012) last week. I went as part of a group from the Center for Informatics Research in Science and Scholarship at GSLIS, where I currently work on the DH Curation Guide. The attendees were a diverse bunch: many from libraries, many from computer science, and many international.

The three keynote speakers were my favorite part of the proceedings. Jason Scott, the first, is a historian of the internet and an advocate for large-scale web preservation on behalf of web users. He coordinates the Archive Team and runs, which I'd just stumbled across a few months ago. His talk, entitled "All You Cared about Is Gone and All Your Friends Are Dead: The Fun Frolic of Preservation Activism," was the first talk of the conference. Two points stuck with me:

  • Mass deletion of people's web content should be a bigger deal. Millions of web users have made websites using services like Geocities or AOL Hometown that were later shut down entirely for business reasons. But the millions of users' websites are "data that are not just business [or disposable kilobytes to a corporation], but data that human beings have made, not understanding" that companies are very good at and quick with deleting files. When a site like Geocities shuts down and deletes those files after a short period of time, Scott equates that to eviction — except there is no legislation to protect users who may have spent hours on their sites and whose only mistake may have been not checking their email or knowing how to save their stuff. Using "convenient" no/low-cost services, your valuable stuff can be deleted without your consent. As Scott says, "We have no contingency when it dies." (Unless you are lucky enough to have your site preserved by the Internet Archive and the Archive Team.) Increasing value is being placed on our data as our lives become more digital (think of family photos that only exist as .jpgs), yet this data is still subject to the winds of the economy or the whims of a CEO.
  • Simple web services can be black boxes. The current trend in tech is simplicity, a sea change in design spurred on by Apple's design ethos and the small screens of our mobile devices. Simple means visually uncluttered and easy for everyone to understand. I can't explain how Dropbox works so seamlessly, but I'm not about to delve into what happens to my data on their end or what their long-term plan is, as long as I can use the service now at my convenience. Data services like Facebook, Flickr, Twitter, iCloud, and the thousands of others are easy to use because they are black boxes. Your data may or may not be safe; it may or may not be exportable. Scott: "We have done so much to remove users' understanding of the underpinnings" of these companies to whom personal data is entrusted. I enjoyed his analogy: when airplanes were first invented, every person on the aircraft knew how to fly it. Now only two out of hundreds do. Horse_ebooks phrases it most succinctly:

Motivated by the desire to save our most recent and prolific cultural record, Scott and other web archivists like him have taken to downloading websites at risk, en masse. For example, when the web site service Tabblo was unceremoniously shut down and gave its 1.2 million users 15 days to download their stuff with no export function, 49 archivists downloaded all of Tabblo in 36 hours. Here's a wiki page that details how. (Scott: "We're not hackers!… We do crash servers sometimes, but that's just because they don't plan well.")


Some of the other CIRSS students and I had a late-night discussion about the value of these massive archival efforts. Websites like those hosted by Tabblo, Geocities, Delicious, Ning, any of the sites on Archive Team's Deathwatch list — how valuable are those, really? Most are just made by a person expecting to forget about it after a number of years. Do they really need to be preserved in perpetuity? A few arguments we touched on that night, which I have expanded more on:

Argument A: Large-scale, long-term preservation is too much effort for mostly worthless content.
Digital data are not invisible, lightweight, or non-physical. It takes massive resources (financial, natural, human) to keep large digital archives preserved and accessible. Is it worth keeping thousands of servers and storage devices running to hold data hardly anyone will access? Is it realistic to think that we can ensure that data will be properly migrated or emulated when a format's lifetime ends? What is actually sustainable? Certainly not preserving the whole of the web forever. Moreover, a lot of web content is junk. Pure junk with little real significance. Spam, ads, broken links, emo blogs, vast quantities of offensive material — is there really an argument for saving the dross of the popular web?

Argument B: Some web content should be selected for long-term preservation using appraisal methods.
In traditional archives, materials are appraised prior to being accessioned. Appraisal involves determining the value of the acquisition — whether it should be kept in the first place, and if so, for how long. If there is too much of a generally homogeneous collection of materials, archivists may choose to sample (select enough representative documents to stand in for the whole, and toss the rest). Maybe these giant collections of web content should be sampled instead of saved in their entirety, so future historians can get a glimpse of the past web without having to curate yottabytes of data. Assigning value may be one of the most difficult and fuzziest parts of data curation, but it is necessary. Already, web preservation organizations like Archive Team use their resources collectively to rescue dying sites or, like users of Archive-It , to focus collections of web content.

Argument C: Preserve all the content!
The content of the web forms part of our cultural record, and for the next maybe 80 years some of us will have a direct personal connection to the content (like my first web site). Diving into individual websites, future historians and other just curious folk will see up close what early web content was like. The animated gifs, the pre-CSS sites, the forums. More than that, what people cared about. These ephemeral websites are our commonplace books. From a view further up, having these many TBs of data is a gold mine for emerging data mining technologies. Think of all that text! How incredibly interesting would future text mining projects be? All that data is historical data. More compassionately, all those websites were made by people, and no one but they should decide whether to keep or delete their creations. It's not sufficient to assume people will back up the data they hand to these services, and it's not nice to revel in a web user's despair when her site is totally gone with no other copy.

As for sustainability, while it does take a lot of energy and physical resources to run a non-selective archive, the Internet Archive seems to be attempting to do the impossible as their spiders capture zillions of sites every day. And at a high enough level, the act of appraisal itself can be unsustainable. A shockingly high percentage of traditional archives are crippled by a backlog of 50% or more because the acts of appraisal, arrangement, and description are so incredibly time-consuming and expensive. We may not have the time or money to determine whether something is valuable, particularly when sites that are shutting down give users two weeks to rescue their data.

I personally fall on the side of yes, it is worth it to preserve these millions of pieces of web content. The internet tells such a rich story of human activity. I am optimistic that technology and preservation practices are advancing quickly enough that we can store and preserve these records sustainably.

Readers and passersby, what are your thoughts?

