Robots are programs that automatically crawl the Web and retrieve documents. Web browsers like Chrome or FireFox are operated by humans and don’t automatically retrieve text from referenced documents. Robots are most often referred to as crawlers, bots, or spiders. Their job is to visit sites and request pages from them, aka “crawl” them. Search engines index the web pages that robots crawl and provide them as search engine results for users.
Search engine robots find sites to crawl based on a historical list of URLs. Any site or page indexed by a search engine is a candidate for robots to crawl. If the page has links pointing out to other pages bots will try to follow the links.
Most robots–benevolent robots–routinely check for a special text file called “robots.txt” which can be installed by the server administrator of any web site. There may be reasons you’d want to block or “exclude” a robot from visiting your site. One very common reason for exclusion is due to the large amount of bandwidth that unbridled robots can eat up. There may also be files for that you don’t want crawled and indexed by search engines for the world to see, perhaps that image file of that wild night that you and your buddies tried cat juggling.
Robots.txt Exclusion Examples
To prevent robots visiting your site put these two lines into the /robots.txt file that lives in the root directory of the server:
But you don’t always want to exclude bots from visiting an entire site. You can write a structured text file instructing robots to stay away from certain areas of the server. You can even choose which robots to allow or disallow. Here is an example of how an exclusion may be written inside your robots.txt file:
# robots.txt file for my site
The first line starts with ‘#’, and specifies a comment.
The next two lines specify that the robot called “Googlebot” is allowed to go anywhere (no trailing slash).
The next two lines instruct a robot called “hungrycrawler” that it has been completely disallowed.
The last group tells all robots to refrain from visiting URLs starting with /tmp or /cgi-bin. The “*” is a special token that refers to “any other User-agent.
Check out the robots.txt Testing Tool in Google Search Console to see how Googlebot is crawling your site.
Source: Robots.txt FAQ by Martijn Koster.