A search engine crawler or spider is a Web robot and, as such, normally chooses to follow the robots.txt file, if present. The robots.txt protocol per se was developed at the end of 1993 and even today, still remains the Web's standard for controlling how search engine robots actually access a particular Web site. Most major search engines claim to support it, but no robot, including a search engine spider, has to support it.
The purpose of the robots.txt protocol is to provide a mechanism for web servers to indicate to search engine crawlers which parts of their server should not be accessed, in other words, to prevent robots from reading certain parts of their server witch could contain sensitive or confidential information. How does this purpose relate to preventing a search engine from indexing a particular resource? Unfortunately, the general answer to this question is "It doesn't".
If the robots.txt file can be used to prevent access to certain parts of a web site, it can also prevent access to the whole site too! During my practice, on more than one instance I have found the robots.txt file to be the main culprit of why a site wasn't listed in certain search engines. One I cleared that, all was ok and the site was listed. If the robots.txt file isn't written correctly, it can cause all kinds of problems and, the worst part is, you will probably never find out about it just looking at your actual HTML code. When a client asks us to analyse a web site that has been online for about a year and is not listed in certain engines, the first place we look is the robots.txt file. Once we have corrected that and have optimized their most important
keywords and key phrases, usually the rankings go way up within the next thirty to sixty days thereafter.
More on the robots.txt file The Disallow line in a robots.txt file means "disallow reading", but that does not mean "disallow indexing". In other words a disallowed resource may be listed in a search engineÆs index, even if the search engine follows the protocol. The most obvious demonstration of this is the Google search engine. Google can add files to its index without reading them, merely by considering links to those files. In theory, Google can build an index of an entire Web site without ever visiting that site or ever retrieving its robots.txt file. In so doing it is not breaking the robots.txt protocol because it is not reading any disallowed resources, it is simply reading other web sites' links to those resources, witch Google constantly uses in its
page rank algorithm .
A web site does not necessarily need to be read in order to be indexed. To the question of how the robots.txt file can be used to prevent a search engine from listing a particular resource in its index, in practice, most search engines have placed their own interpretation on the robots.txt file which allows it to be used to prevent them from adding resources to their index. Most search engines interpret a resource being disallowed by the robots.txt file as meaning they should not add it to their index, and if it is already in their index (placed there by previous crawling activity), they remove it. This last point is important and the following example will illustrate that important subject.
The anomalies and inadequacies of the robots.txt file and robots meta tag properties are indicative of what sometimes could be a bigger problem. It is impossible to prevent any directly accessible resource on a site from being linked to by external sites, be they partner sites, competitive sites or, search engines. Even with the robots.txt file, there is no legal or technical reason why they should be used, least of all by humans creating links, for witch the standards were not even written. This may not seem a bad thing, but there are many instances when a site owner would rather a particular page would never be linked to from any other site on the Web. If such is the case, the robots.txt file will, to a certain degree help the site owner achieve his or her goals.