Integrating Preservation Functions into the Web Server
Dissertation (Ph. D., Computer Science). Old Dominion University. June, 2008.
J.A. Smith
Download: Dissertation.pdf
Digital preservation of the World Wide Web poses unique challenges, different from the preserva-
tion issues facing professional Digital Libraries. The complete list of a website’s resources cannot
be cited with confidence, and the HyperText Transfer Protocol (HTTP) provides a bare minimum
of metadata with each resource transfer – HTTP is optimized for access today rather than tomor-
row. In short, the Web suffers from a counting problem and a representation problem. Refreshing
the bits, migrating from an obsolete file format to a newer format, and other classic digital preser-
vation problems also affect the Web. As digital collections devise solutions to these problems, the
Web will also benefit. But the core World Wide Web problems of Counting and Representation
need a targeted solution.
As the host of web content, the web server is uniquely positioned to assist in the preservation of
the resources it serves. It recognizes the resources it has, and knows what kind of resources they
are. This dissertation presents research in which preservation functions have been integrated into
the web server itself to produce archive-ready versions of the website’s resources. The proposed
approach addresses the Counting Problem through the use of Sitemaps, created from a combina-
tion of crawling, Sitemap tools, and log analysis. The Representation Problem is addressed by a
preservation-preparation module installed on the web server. The module enables each resource
to be packaged together with the output from a variety of relevant metadata utilities, creating the
aforementioned archive-ready version of the resource. The CRATE Model defines a simple XML
structure for the creation and delivery of such resources.
A series of experiments which evaluated CRATE, Sitemaps, and extemporaneous metadata anal-
ysis of resources are presented, along with a technical review of the MODOAI web server module
which acts as the preservation agent. The feasibility of this approach is demonstrated by a quanti-
tative analysis of its use in a commercial web testing environment.
@PHDTHESIS{jas:phd,
author = {Joan A. Smith},
title = {Integrating Preservation Functions into the Web Server},
school = {Old Dominion University, Department of Computer Science},
month = {June}
year = {2008},
address = {Norfolk, VA, USA}
}