Site Design Impact on Robots: An Examination of Search Engine Crawler Behavior at Deep and Wide Websites

D-Lib Magazine. March/April 2008.
J.A. Smith and M.L. Nelson.
No download available.
Conventional wisdom holds that search engines "prefer" sites that are wide rather than deep, and that having a site index will result in more thorough crawling by the Big Three crawlers – Google, Yahoo, and MSN. We created a series of live websites, two dot-com sites and two dot-edu sites, that were very wide and very deep. We analyzed the logs of these sites for a full year to see if the conventional wisdom holds true. We noted some interesting site access patterns by Google, Yahoo and MSN crawlers, which we include in this article as GIF animations. We found that each spider exhibited different behavior and crawl persistence. In general, width does appear to be crawled more thoroughly than depth, and providing links on one or two "index" pages improves crawler penetration. Google was quick to reach and explore the new sites, whereas MSN and Yahoo were slow to arrive, and the percentage of site coverage varied by site structure and by top-level domain. Google is clearly king of the crawl: its lowest site coverage was 99%, whereas MSN's worst coverage was 2.5% and Yahoo's worst coverage of a site was 3%.
@article{jas:dlibJan08, author = {Joan A. Smith and Michael L. Nelson}, title = {Site Design Impact on Robots: An Examination of Search Engine Crawler Behavior at Deep and Wide Websites}, journal = {{D-Lib Magazine}}, volume = {14}, number = {3/4}, month = {March/April}, year = {2008}, doi = {doi:10.1045/march2008-smith}, note = {\url{http://dlib.org/dlib/march08/smith/03smith.html}} }