Using The Web Infrastructure To Preserve Web Pages
International Journal on Digital Libraries. July, 2007.
M.L. Nelson, F. McCown, J.A. Smith, and M. Klein.
No download available.
To date, most of the focus regarding digital preservation has been on
replicating copies of the resources to be preserved from the “living web” and
placing them in an archive for controlled curation. Once inside an archive, the
resources are subject to careful processes of refreshing (making additional
copies to new media) and migrating (conversion to new formats and
applications). For small numbers of resources of known value, this is a
practical and worthwhile approach to digital preservation. However, due to the
infrastructure costs (storage, networks, machines) and more importantly the
human management costs, this approach is unsuitable for web scale preservation.
The result is that difficult decisions need to be made as to what is saved and
what is not saved. We provide an overview of our ongoing research projects that
focus on using the “web infrastructure” to provide preservation capabilities
for web pages and examine the overlap these approaches have with the field of
information retrieval. The common characteristic of the projects is they
creatively employ the web infrastructure to provide shallow but broad
preservation capability for all web pages. These approaches are not intended to
replace conventional archiving approaches, but rather they focus on providing
at least some form of archival capability for the mass of web pages that may
prove to have value in the future. We characterize the preservation approaches
by the level of effort required by the web administrator: web sites are
reconstructed from the caches of search engines (“lazy preservation”); lexical
signatures are used to find the same or similar pages elsewhere on the web
(“just-in-time preservation”); resources are pushed to other sites using NNTP
newsgroups and SMTP email attachments (“shared infrastructure preservation”);
and an Apache module is used to provide OAI-PMH access to MPEG-21 DIDL
representations of web pages (“web server enhanced preservation”).
@ARTICLE{mln:ijdl07,
author = {Michael L. Nelson and Frank McCown and Joan A. Smith and Martin Klein},
title = {Using The Web Infrastructure To Preserve Web Pages},
journal = {International Journal on Digital Libraries},
year = {2007},
volume = {6},
number = {4},
pages = {327--349},
note = {doi:10.1007/s00799-007-0012-y}
}