|
The robots.txt file is a universally accepted resource that, in essence, tells the search engine spiders which pages they can and cannot visit. Most reputable engines look for this file before any other upon entering a new domain or IP for spidering or indexing purposes. Though their presence is not necessary for search engine spiders to locate and fully index a site, there are substantial benefits associated with creating and making available a properly formatted robots.txt file.
How to create it.
The purpose of this article is not to provide a tutorial on robots.txt coding, but rather to demonstrate the value of adding (or not adding) this file to a web site. There are a number of excellent online resources for robots.txt construction and even a few automated scripts that do the job quite well. One of my favorites is located here.
Why use it?
There are several reasons for using a robots.txt file on a domain. Although the generally accepted purpose of the robots.txt file is to exclude search engine spiders from areas of a website that would gain little or no value from being part of search results, there are other valuable uses:
Reduce Bandwidth
By preventing a robot from fully indexing a site, a webmaster can help reduce bandwidth usage. There are substantial costs involved when spiders open and close thousands (if not tens or hundreds of thousands) of individual pages and database connections on a daily basis. Such savings are most easily recognized by excluding image folders or image servers from the robots' path.
Protect Content
Because one can exclude site access on a spider-by-spider basis, it is possible to reduce opportunities for content theft by denying access to image archivers or other programs that "strip" content by using spider-like technology.
Keep Data Private
Every website has pages that are meant to be private. These could include: a batch of server log files, lead generation scripts, client reports, or test pages. Excluding spiders from these folders ensures that such private content will not be indexed by search engines and consequently get traffic from unwanted visitors.
Reduce 404 Errors
Every time a search engine spider attempts to access a robots.txt file that does not exist, a server 404 error is generated. Given the high frequency with which search engines might visit a site, the number of 404 errors recorded in the server logs could grow quite large, making it difficult to track actual broken links and server errors.
Are there drawbacks?
The number one disadvantage of creating an itemized robots.txt file for a web site is that the file itself could act as a roadmap for those with less-than-honorable intentions who want to hack into private information. Anyone with a web browser can access and view the robots.txt of any website they choose.
Because many server folders are not linked from a web site, a hacker might not know that members-only pages are located in the /abc/ folder. If you tell the robots to ignore this folder in particular, a hacker would be able to read the robots.txt file and learn of its existence. One initial solution to this situation would be to password protect the folders or files within, although this is not always either practical or possible.
Are there alternatives?
For certain applications, there are alternatives to implementing a robots.txt file. It is possible to direct some spiders through a site by formatting a robots meta tag for each individual document. Such tags, however, are not universally followed and even when followed, these meta tags will not provide all of the benefits of a robots.txt file. For example, the meta robots tag embedded in each page will not help eliminate 404 errors associated with spiders seeking the robots.txt file, nor will it exclude access to a page BEFORE the agent reaches the folder or page.
Is a robots.txt file 100% necessary?
No, but when properly applied, the robots.txt file can be the answer to a number of webmaster-related problems, including cost reduction, privatizing information, and protecting your data from theft.
By Jason Mills
www.topsitelistings.com
|