Web crawling bots need some instructions to be able to understand your website. Robots.txt is the text file that provides those instructions.
Consider Google – it has web crawlers that categorizes and archives a website. Google robots, like the robots for many other search engines, will always look for a robots.txt file on a website – long before it tries to crawl other web pages. The reason is simple: robots.txt may include some special instructions from the owner of the website which explains how Google or other web crawlers should index a website.
For example, a robots.txt file could ask the web crawler to completely ignore certain directories or indeed individual files on a website, excluding these from its crawling exercise. The website owner may have a number of motivations for excluding some content: for example, the website owner might think some content is not relevant for categorising a site, or indeed that some content is private and should not be listed on a public search engine.
It’s worth noting that where a website makes use of subdomains, each of these subdomains would need to operate a specific robots.txt file. Furthermore, not every website crawler will follow the instructions given by robots.txt – some crawlers will ignore it, some others might even use robots.txt to go against the policies of the site owner.
So, there is no guarantee that an instruction in a robots.txt file which says that specific site areas should be skipped will be followed. Also, if the site areas that are excluded in a robots.txt file are linked to from other pages, they could still be included in search results – because search crawlers would find these pages via the links.