Efficient, Automatic Web Resource Harvesting
Proceedings of ACM WIDM 2006
M.L. Nelson, J.A. Smith, I. Garcia del Campo, H. Van de Sompel and X. Liu.
Download: widm140-smith.pdf
There are two problems associated with conventional web
crawling techniques: a crawler cannot know if all resources
at a non-trivial web site have been discovered and crawled
(“the counting problem”) and the human-readable format of
the resources are not always suitable for machine processing
(“the representation problem”). We introduce an approach
that solves these two problems by implementing support for
both the Open Archives Initiative Protocol for Metadata
Harvesting (OAI-PMH) and MPEG-21 Digital Item Declaration Language (DIDL)
into the web server itself. We present the Apache module “mod oai”, which
can be used to address the counting problem by listing all valid URIs at a
web server and efficiently discovering updates and additions
on subsequent crawls. Our experiments indicated comparable performance for
initial crawls, and dramatic increases in update speed mod oai can also be
used to address the representation problem by providing “preservation ready”
versions of web resources aggregated with their respective
forensic metadata in MPEG-21 DIDL format.
@article{jas:widm06,
author = {Michael L. Nelson and Joan A. Smith and Herbert {Van de Sompel} and
Xiaoming Liu and Ignacio {Garcia del Campo}},
title = {Efficient, Automatic Web Resource Harvesting},
journal={Proceedings of the eighth ACM international
workshop on web information and data management},
year = {2006},
month = {November}
}