Integrating Preservation Functions into the Web Server

Dissertation (Ph. D., Computer Science). Old Dominion University. June, 2008.
J.A. Smith
Download: Dissertation.pdf
Digital preservation of the World Wide Web poses unique challenges, different from the preserva- tion issues facing professional Digital Libraries. The complete list of a website’s resources cannot be cited with confidence, and the HyperText Transfer Protocol (HTTP) provides a bare minimum of metadata with each resource transfer – HTTP is optimized for access today rather than tomor- row. In short, the Web suffers from a counting problem and a representation problem. Refreshing the bits, migrating from an obsolete file format to a newer format, and other classic digital preser- vation problems also affect the Web. As digital collections devise solutions to these problems, the Web will also benefit. But the core World Wide Web problems of Counting and Representation need a targeted solution. As the host of web content, the web server is uniquely positioned to assist in the preservation of the resources it serves. It recognizes the resources it has, and knows what kind of resources they are. This dissertation presents research in which preservation functions have been integrated into the web server itself to produce archive-ready versions of the website’s resources. The proposed approach addresses the Counting Problem through the use of Sitemaps, created from a combina- tion of crawling, Sitemap tools, and log analysis. The Representation Problem is addressed by a preservation-preparation module installed on the web server. The module enables each resource to be packaged together with the output from a variety of relevant metadata utilities, creating the aforementioned archive-ready version of the resource. The CRATE Model defines a simple XML structure for the creation and delivery of such resources. A series of experiments which evaluated CRATE, Sitemaps, and extemporaneous metadata anal- ysis of resources are presented, along with a technical review of the MODOAI web server module which acts as the preservation agent. The feasibility of this approach is demonstrated by a quanti- tative analysis of its use in a commercial web testing environment.
@PHDTHESIS{jas:phd, author = {Joan A. Smith}, title = {Integrating Preservation Functions into the Web Server}, school = {Old Dominion University, Department of Computer Science}, month = {June} year = {2008}, address = {Norfolk, VA, USA} }