A PDF of this presentation is also available for download on CUNY Academic Works (John Jay College’s institutional repository).
Abstract: The web is fragile and littered with broken links. This poses a problem for the scholarly record (bibliographies) and one’s own academic history. I review the stats on link rot and reference rot, and I give a brief overview of web archiving and its challenges. I review some web archiving tools: the Internet Archive, Perma.cc, WebRecorder, and GitHub. I advise creators of web projects to design their websites to be accessible and archivable, and to think about preservation (afterlife) of their projects from the start of the planning stages. I advise librarians and archivists to become familiar with web archiving tools and archivability/accessibility practices so that they, in turn, can advise web project creators.
First, I’d like to thank Kathryn Frederick for inviting me to present at this wonderful conference, and to the ENY/ACRL organizers for putting it together. I have had such a great day meeting many of you and hearing about the exciting work you are doing at your institutions. I was glad to hear about preservation plans in this morning’s keynote from Marisa Parham, and to hear from Amber Billey about how linked open data can make our metadata stronger, better, and more resilient.
So I’m honored this afternoon to be presenting to you all a talk I originally titled Die Hard: the impossible, absolutely essential task of saving the web. I had visions of illustrating my points with images of Bruce Willis saving the day. But then it occurred to me later on that that would mean using in-copyright images from a major Hollywood movie without a license, and that might not be the wisest thing to do in front of a bunch of copyright buffs, so if you can, in your programs, cross out the D and write a P, because now this talk is called Pie Hard and we’re going to get through as many “pie” puns as possible in the next 40 minutes.
Edit, May 25, 2016: another belated title change: I’m appending “for scholars” to the title of this talk.
Here’s what we’re going to go over today, in a nutshell.
- I’ll be going over the basics of web preservation. I know that many of you already work with web archiving in your day to day work, so consider this a review of the basics.
- I’ll talk about recent developments in web preservation.
- And I’ll bring it around to archiving web-based scholarly projects.
I’m directing this talk to several overlapping populations in here, by the way: I’m speaking to people who use and cite information from the web in their written work, people who create web-based projects, people who teach students who create web based projects, and people who are in charge of archiving those web-based projects.
The bad news
I’ll start with a dire reminder, something my digital preservation professor in library school repeated emphatically: Everything digital dies (Jerome McDonough, class lecture, 2011).
A book or a manuscript or an artwork can sit on your shelf for years with only very minor changes — the ink may change color, or you might see a coating of dust. If you leave it alone, no problem. You can still open that book fifty or a hundred or five hundred years later and read what’s inside. But it’s not the same deal for digital items. If you leave a hard drive on your shelf, it will die all by itself. The bits will degrade over time and render the contents unreadable, if it even works when you plug it in. I’m betting many of us have experienced a hard drive failure sometime in our lives. Moreover, even if our hard drives are just fine, the software we use, the file formats — these change over time, too. I’m betting many of us have also experienced a corrupted file that just won’t open.
In the words of Neil Beagrie, just to drive my point home…
“In the right conditions papyrus or paper can survive by accident or through benign neglect for centuries, or in the case of the Dead Sea Scrolls, for thousands of years … In contrast, digital information will not survive and remain accessible by accident: it requires ongoing active management from as early in the life-cycle as possible.”
Beagrie, N. “Digital Curation for Science, Digital Libraries, and Individuals.” International Journal of Digital Curation 1.1 (2006): 3-16. Archived in the Internet Archive.
This is a fundamentally different way of preserving information than what libraries have traditionally done. To preserve something digital, regular human intervention is necessary, or we risk losing it. At home, this means that when we get a new computer, we transfer the files over from our old computer to our new one. Or we transfer home videos from old cassettes to MP4s on our computers. In a library or archive, this means that we ensure file formats can still be read. If you’ve got an old document that was saved as a Clarisworks file (remember that?), you might consider migrating it to a new file format, like to a PDF, or running emulation software, so you can view the document in its original context. And let’s be clear, digital preservation is really hard. It’s expensive, and it requires a lot of technical knowledge and elbow grease.
As hard as general digital preservation is, web preservation is even harder. Everything on the web dies faster. Links break, websites go down, web projects are abandoned, web pages change and information is deleted or lost. Encountering a broken link is so, so common that we hardly think anything of it. Imagine if that happened in your house, and things just kept disappearing. One of your shoes vanishes from your closet. An hour later your fridge has been deleted. The steps of your staircase disappear one by one. It would be chaotic.
One of my favorite quotes about web preservation:
“It is far easier to find an example of a film from 1924 than a website from 1994.”
Ankerson, M. S. “Writing Web Histories with an Eye on the Analog past.” New Media & Society 14.3(2012): 384-400. Abstract archived in the Internet Archive.
You can try this now, if you like — search for “film 1994” in Google or your local library, and you’ll find plenty. Search “website 1994” and you’ll probably only get Microsoft’s old website (archival link), which is actually just a faithful recreation (archival link).
Preserving the web is essential. I’ll touch on two big reasons and spend more time on the third.
It’s part of the historical record. Consider, for example, how officials began to realize that Malaysian Flight 17 was shot down by Ukrainian separatists who believed it was a Russian military plane rather than a passenger plane. Part of their case was a public social media post that the defense minister of the separatists wrote about having shot down a transport plane. When he realized it was a civilian plane, he deleted it, but the Internet Archive had already archived it.
(See: Carroll, K., “Kerry: Ukrainian separatist ‘bragged’ on social media about shooting down Malaysia Flight 17.” PolitiFact, July 20th, 2014. Archived at perma.cc/9CD3-SS2G.)
It’s part of our personal record. And sometimes what’s on the web is much smaller and more personal. This web page may not be important to anybody else in the world but me. This is my very first web page, created for an 8th grade history project almost 15 years ago — this is the web project that sparked my interest in web development in the first place. Thank goodness for the Internet Archive, which somehow crawled and saved my little page. The web is a place for the big and the small, the global and the intimate. Our cultural heritage may include these otherwise ephemeral materials.
It’s part of the scholarly record. And then, of course, there’s what we might talk about most today, which is the scholarly record. In writing, we support claims we make with evidence so that our readers can decide for themselves if what we’re saying is true. But what happens when that evidence that we cite disappears? What if you can’t follow the paper trail to back up a claim?
The web & the scholarly record
There have been many studies of the brief, wondrous life of links cited in academic articles. These links can be references to other academic works or to sites on the open web, for instance an article may cite a news story or social media post.
Law: “70% of the URLs in Harvard Law Review (HLR), the Harvard Journal of Law & Technology (JOLT) and the Harvard Human Rights Journal (HRJ) … [and] 50% of the URLs within U.S. Supreme Court opinions suffer reference rot…”
(Jonathan Zittrain, Kendra Albert & Lawrence Lessig. Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations, Harvard Law Review 127.4 (2014). Archived at perma.cc/S5HC-9RAS.)
STM: “The vast majority of STM articles that contain references to web at large resources do suffer from reference rot. The infection rate between 2005 and 2012 oscillates between 70% and 80%.”
(Klein, Martin, et al. “Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot.” PLoS ONE 9.12 (2014). Archived in the Internet Archive.)
LIS: “49.53% of URL citations were not accessible [in LIS publications 2008-12]…”
(Kumar, D. Vinay, Kumar, B.T. Sampath, and Parameshwarappa, D.R.. “URLs Link Rot: Implications For Electronic Publishing.” World Digital Libraries 8.1 (2015): 59-66. Print. Abstract archived in the Internet Archive.)
And by the way, these numbers are probably all higher by now, as at least a full year has passed since these findings were published.
What does it mean for web content to change so quickly, like sand beneath our feet? This is the blessing and the curse of the web. Infinitely changeable, to keep up with history as it’s made. And infinitely unreliable. You cannot step in the same web twice.
So before I go on, some terms to keep in mind. These were defined in the Klein et al paper I cited above. We hear about link rot often — these are broken links that lead to your browser saying something like “404 error.” But there’s also content drift, in which the link technically works, it takes you to a page, but the page is different from what it was when first cited. This might be as simple as someone added an update to the bottom of a news article, or it might be a custom 404 page that is branded like the rest of the site but doesn’t show the content. Lastly, there is reference rot, a kind of specific content drift. In this, the part that is cited by article or book you’re reading does not match what the author originally cited.
This isn’t purely in the hypothetical, by the way. I had trouble accessing several web resources that were listed in the bibliographies of articles about web preservation. Which strikes me as a terrible kind of irony.
Web archiving & information literacy
When I think about web archiving, I think also about our students. We impress upon them the importance of citation, citation, citation. We teach them citation formats by style! We teach them power tools like RefWorks and Zotero! But I don’t think I’ve ever told them that citing web sources by URL might be unreliable, that the sources they rest their arguments on might disappear or suffer reference rot, and that this is a problem for their research. Or, as the incomparable Jill Lepore has it…
“The footnote, a landmark in the history of civilization, took centuries to invent and to spread. It has taken mere years nearly to destroy.”
What’s more, at my institution, like many of yours, I’m sure, we’re training our students in becoming public scholars by encouraging them to conduct web-based scholarship. They curate their e-portfolios, they collaborate on group projects like a class blog or Omeka site. Yet we don’t always tell them that their own web project may be doomed to disappear, that the record of their online scholarship might be riddled with holes by the time they’re on the job market.
Web archiving now
Of course, as I have alluded to before, the Internet Archive has been hard at work for 20 years trying to save the web. They are the trailblazers of web preservation, and their coverage is commendably large. Everything from my 8th grade history project to all of our institutional webpages, I am certain. I really can’t say enough about how much they have done to preserve the history of the web — it’s truly incredible.
Buuut as with just about any far-reaching project, there are a few quirks, some things their tech can’t quite handle. In some saved pages, current content will sometimes make its way into the page, like advertisements. Sometimes this live content makes it hard to see what on the page has actually been captured. Conversely, it’s clear that some content has not been captured. Sometimes images just don’t show up for some reason, and video isn’t collected either. Lastly, it’s unpredictable. It’s hard to tell what will and won’t be captured. And part of the reason why is…
Complex websites are not archivable
And just as fast as the Internet Archive is updating its technologies to save every part of the website, more complicated bells and whistles are being thought up in web developer wonderland. A small website used to consist of a handful of files, like you see on the left. Whatever files you had, that’s all that was ever on your site. Easy enough for a crawler to archive.
But today, not only are websites much fatter in terms of how many files it takes to run a website, but it’s also calling out to many other places on the web — your institution’s homepage, for instance, might display a video from YouTube or tweets from Twitter. Your library tool might use an API from the Library of Congress. None of these things is bundled into your site, and some of them might not be archivable with the Internet Archive’s current technology. Plus, even if you keep things local, your fancy website still might not be fully archivable, because there might a whole other part of the site that isn’t visible unless the user does something. This could be as simple as a search bar. That will break in an archived website. So you see, as websites rely on more complex technologies, they become harder to save.
In other words, the plainer a website is, the easier it is to save. Which is not something your marketing department or eager DH grad student wants to hear.
As it turns out, plain websites have another thing going for them. They are accessible. That is, they’re easy to read on a number of devices, plus people who use assistive technologies, like text-to-speech readers, can use your website easily. Archivability and accessibility go hand in hand. So it follows that…
“Lack of accessibility makes content more difficult for crawlers to capture. Websites that are trend leaders, unfortunately, set a bad precedent for facilitating archivability.”
Mat Kelly, Justin F. Brunelle, Michele C. Weigle, Michael L. Nelson. “On the Change in Archivability of Websites Over Time.” Theory and Practice of Digital Libraries (TPDL) 2013, Valletta, Malta. Retrieved from arXiv. Abstract also archived at perma.cc/7TAJ-6GT3.
There is a tension, then, for scholars who want to create preservable web projects that are also cutting-edge.
The web is sensitive to human behavior
Another bit of bad news. What we said before, about how everything on the web dies faster? That is partly because websites are fragile! They cannot survive when someone is hired at another university and their old web space is cut off. Or if the grant runs out and the website disappears. Or someone decides their webpage should be deleted, even though other people are depending on it. Or someone forgets about the site, or it gets hacked, or a major part of the site breaks and no one has time to fix it…
And these very human reasons for the death of web projects, sometimes they are unavoidable. Sometimes you just cannot keep your website online, sometimes you don’t have the staff or time to keep updating the software.
A minimal preservation plan for websites that may go offline
The preservation plan is a great way to consider the afterlife of a web project. I’m not really touching on what a preservation plan that guards against these common human reasons for a web project looks like. But the preservation plan should at least include making the web project archivable, or making an archivable version. Web project creators should have an idea of what happens after the project is “done” and they’re no longer presenting it at conferences, etc. Maybe it’s okay that it goes offline — as long as it’s archived somewhere, like the Internet Archive or an institutional repository.
Constant vigilance for websites that will remain online
But if you’ve got a project that does want to remain online, you’ve got have constant vigilance! Here’s an example of a small human error with large consequences for scholarship. The DOI resolver failed last January for one day because someone forgot to renew the domain name (archival link). That is, any time you tried to go to a URL that began with dx.doi.org, as you see in many academic papers, you couldn’t get anywhere. The whole point of DOIs was that you had one reliable permalink for an academic publication! The DOI people at Crossref and the scholars who depend on them seem to have handled it well, although they might have eaten some humble pie. But it is yet another reminder that nothing on the web is reliable, even the things that purport to be.
Preservation of digital humanities projects
Because we’ve talked about digital humanities today, and it’s a field where a lot of digital scholarship is being done, and the envelope is being pushed, let’s talk touch on how web projects in DH are faring.
In my research about link rot, there wasn’t much at all about links in the citations of humanities projects, oddly — most seem to focus on the law. So I myself conducted a small study (archival link), a survey of online projects presented at the 2005 Digital Humanities conference. As you can see in this pie chart, in the ten years between the projects’ presentations and the time I went to hunt it down, just under half of the projects were inaccessible.
45% of the web-based DH projects had disappeared between 2005 and 2015. And this is a different kind of study than the ones on the previous slide, because I hunted these projects down. Many of the projects that I found that were still online had changed URLs, and I could only find the actual web site through a Google search or combing through CVs. (I will say, and this graph doesn’t show it, there were some projects that were accessible on archive.org. Certainly not all 45% of that red slice there. But some. But because it’s hard to predict when your site will be archived in the Internet Archive, it wasn’t like there was a “Goodbye cruel world” note on the site, no real way to know if it was finished or why it was going offline.)
I know it’s kind of uncouth to cite your own work, by the way, but whenever I show this to DH people, they start feeling a little panicky, which is my main goal today.
I’ll note, two of the 60 projects presented at DH 2005 were hosted on library servers, and both are still there.
So, to sum up this section I entitled “the bad news,” I’ll end this part with this quote…
“We alone, and not our technologies, are responsible for our losses, and we alone are to blame when we deliberately choose oblivion over recollection.”
Manguel, Alberto. The Library at Night. 2006. New Haven: Yale University Press. 2008. Print.
We’ve led ourselves, collectively, to the point where we’ve already created a Dark Ages of the early web, totally lost to history. And we continue to lose our work hour by hour as the web changes and morphs and escapes our grasp.
Are we feeling low yet? Are you feeling like maybe saving the web is, in fact, impossible? As though preserving the web might be some kind of… pie in the sky?
The good news
Let’s get happy and think about the berry pies we’re all going to eat this summer and start talking about the good news!
Web archiving software is getting better!
The first bit of good news is that web archiving software is getting better! I said before that the Internet Archive’s page captures were really spotty. They seem to be less spotty now, because more items are being captured. Plus, the things that were previously invisible to crawlers, like things that only appeared when a user did something, are often archivable now, to a limited extent. The paid service Archive-It gives subscribers the power to archive videos, too, although I’m not sure that the Internet Archive has that option.
And also, at the end of last year, the Internet Archive announced (archival link) it had received a $1.9m grant to develop a search engine, so you no longer have to use the exact URL. That will be a huge boon!
There are more web archiving options!
Covered here: the Internet Archive’s “Save Page Now” option, Perma.cc, Archive-It, Heritrix + Open Wayback, WebRecorder, and GitHub.
While the Internet Archive is a vast trove of content, the page you want to archive or reference may not be archived, or may not have a current archival snapshot (“memento”) of the page. On the Internet Archive’s Wayback Machine, you can submit a URL to archive on demand with the “Save Page Now” option:
But there are more options aside from the Internet Archive. You may want to have more control over what you archive. You might even want to save archived files locally, like in your institutional repository.
This screenshot is of a New Yorker article that I saved using Perma.cc. This online app was developed at the Harvard Law School Library, and it takes a snapshot of a single webpage of your choosing. You get not only a high-fidelity snapshot of the page that you can use in your browser, but you also get a high-resolution screenshot of the page so you can see what it looks like even if the archived version is missing things. AND you also get a permalink that you can then use in, say, your bibliography. So when your readers encounter your article years later, they can see exactly what you were looking at when you quoted from it. This guards against link rot, reference rot, and content drift, all at once. Right now, you can get 10 Perma links a month for free. Libraries can also sign up as ambassadors, essentially, to create organization accounts for departments and faculty members (who each get 10 links/month).
Of course, having seen what happens to shiny new web toys, we can’t be totally certain that Perma will be around for good, or that the links it gives us will stick around forever. But it will probably stick around for longer than many cited web pages in law journals, which, as you might recall, see 70% of cited URLs suffering reference drift.
For more hardcore web archiving options, you’ve got Archive-It, which is a paid service that hosts your chosen web archives. It uses the same software as IA, but you can tell it to archive this list of URLs with a given frequency, and it will do it. Many universities use Archive-It as part of their online collections. Often this is just for institutional records, like preserving snapshots of your college’s home page every week or month. But this can also be used to curate collections of web snapshots, for instance, social media posts and news articles covering the Japan earthquake in 2011. Archive-It is powerful, and it can be difficult to set up, or so I hear. You might accidentally set loose a crawler that collects too much material.
(This is the service I wish my institution could afford, by the way. I’ll talk more about bootstrapping later.)
For the extra-hardcore web archivists: you can set up your own crawler. IA and others use Heritrix (archival link) to save page snapshots as WARC files and view them using Open Wayback (archival link).
WebRecorder was launched recently by Rhizome, a collective dedicated to preserving digital art. This is like a person-powered mini-Internet Archive: it captures everything you click. So instead of setting a crawler loose on your site to get through all 50 pages, you would click through all 50 pages yourself. You’re the crawler. It’s much more controllable, but more laborious. And to be clear, it shares some of the limitations that Internet Archive has, like pulling in information from an API. To be extra-clear, because this confused me at first, it’s not a screen-capture tool; its “recording” output isn’t a video, it’s a WARC file. Once you’re done, you can download the WARC file (web archive file), which you can deposit it in your institutional repository or view yourself if you have the viewing software Open Wayback. (Here’s an example of a WebRecorder collection for John Jay College of Criminal Justice’s main website.)
At the Graduate Center of CUNY, the library asks students (private archival link, unavailable to robots.txt exclusion) who have a web component to their dissertation or thesis to use WebRecorder to create an archived capsule of their project. This gets ingested into Academic Works, CUNY’s institutional repository, along with the dissertation PDF, as a supplemental file. (See Klein, Stephen, “Digital Preservation” (2015), presented at the CUNY IT Conference. Click Download button for the .ppt file.)
The last tool I’ll mention is not exactly a web archiving tool, but one that adds a ton of documentation to archived web projects. In the past few years, we’ve seen a major uptake of GitHub, an online code repository that helps you keep track of your code using version control software called Git. I use GitHub for the code that runs our library website at John Jay, as well as some personal projects. Many, or even most, web developers use Git to keep track of their code.
And for a digital archivist, Git. Is. Amazing. Git keeps track of every change to every line of code. It’s just like Track Changes in Microsoft Word, or the edit history of a Wikipedia page. If I make a change to one of my scripts, Git notes who made this change, when exactly it was made, the parts of the code that changed, and the message that the developer wants to attach to this record. The logs that keep track of these changes are bundled in with the code. So if, say, your dissertation policies say that digital projects must include a bundle of code with their submission, you could get the code along with its very granular documentation. (Here’s an example of a thesis submitted as Git repository.) This gets me so excited!!
(I wrote about Git, GitHub, and libraries in the “Internet Connection” column for Behavorial and Social Sciences Librarian 34.3 (2015). Preprint on Academic Works.)
This concludes our whistlestop tools overview.
Web archiving is making the scholarly record stronger!
And even more good news — resources cited in scholarly papers are being archived.
This graph from a 2014 study that I mentioned earlier about reference rot in STEM articles. Within 14 days of an article’s publication, we can see that 32% of links are rotten or broken (yellow bar). But look at the brown bar — almost 25% of links were archived through a web archiving service like Internet Archive or Perma. That little grey sliver at the bottom represents content that would have been inaccessible if it hadn’t been archived. And if this graph were to stretch out beyond two weeks, I’m betting that that sliver will get bigger and bigger, and the brown bar will, too. One day, maybe the orange bar will disappear altogether.
So now we’ve got a grounding in why web archiving is important and difficult, and we’ve sped through some of the things that make it possible. Now what? How can we, as library professionals, better support the preservation of web-based scholarly projects? And how can we, as scholars, better preserve our own web-based projects?
Start overthinking your citation practices
I started this talk with alarming statistics about link rot and reference rot in citations in academic literature. It’s probably fair to say that all of us in the room have written a paper — an essay, a blog post, a scholarly article — that cited a web resource that is now gone. But we’re librarians! We believe in citing your sources!
So it might be time for us to think about a better way to cite web sources. Should you include a link to an archival copy from Perma, like in the example above and throughout this page? And/or links to the Internet Archive’s archival copies?
Should you keep an archived copy of every resource locally as .WARC files for your own records? Or just hope that when you add “Accessed on (this date)” to your citation, a future reader will be able to find a snapshot of the page on the Internet Archive around the time you were viewing it? And will you let your students in on the fact that web materials can disappear after they cite them in their papers?
Design your web projects to be archivable
“Recognizing techniques to make the archiving process easier by those that want their content preserved is a first step in guiding web development practices into producing web sites that are easier to preserve.”
Mat Kelly, Justin F. Brunelle, Michele C. Weigle, Michael L. Nelson. “On the Change in Archivability of Websites Over Time.” Theory and Practice of Digital Libraries (TPDL) 2013, Valletta, Malta. Retrieved from arXiv. Abstract archived at perma.cc/7TAJ-6GT3.
For those that create web-based projects, I hope you are sufficiently worried about your future scholarly record to start thinking about how you can plan for your project to be archivable.
I noted before that archivability correlates strongly with accessibility, and both are inversely correlated with using cutting-edge technologies. You will have to find the balance in your project. And for those that advise folks who build web-based projects, you should know the principles that guide archivability and accessibility as well! Ideally, you’ll be part of the planning stage so your input may shape how well the project can be preserved for the future.
Assemble a web archiving toolbox
For those of us in the room who might be taking in web archives as archival objects, or for those among us who dread the day we will have to do this, we should know how we’re going to save web content. That might mean we invest in a subscription to Archive-It, or do the laborious work of WebRecorder, or tell students and faculty to do this for their own work, or make sure that we know the Internet Archive is archiving our web work and that it is lossy. If the code behind the web project itself is something that might enter your library or archive, you should get to know Git.
On my part, I’m still bootstrapping it. The library where I work is overburdened and understaffed, so we can only accept web content in CUNY’s institutional repository, Academic Works. On my part, I’m seeking out connections on campus to share the good word about archivability and accessibility. I’m also using GitHub as a social network and peeking at projects other folks at John Jay and CUNY at large are doing. Maybe your MO could look as bootstrappy as mine; maybe you can get started with an Archive-It subscription; maybe you already have one!
Everything on the web dies faster.
But we’re getting better at saving the web.
We can save our own web undertakings.
And we can make the scholarly record stronger, starting today.
Cite this presentation: Davis, Robin Camille. “Die Hard: The Impossible, Absolutely Essential Task of Saving the Web for Scholars.” Eastern New York Academic and College Research Libraries Conference. Skidmore College, Saratoga Springs, NY. 23 May 2016. Closing remarks.