robots.txt

Web crawling bots need some instructions to be able to understand your website. Robots.txt is the text file that provides those instructions.

Consider Google – it has web crawlers that categorizes and archives a website. Google robots, like the robots for many other search engines, will always look for a robots.txt file on a website – long before it tries to crawl other web pages. The reason is simple: robots.txt may include some special instructions from the owner of the website which explains how Google or other web crawlers should index a website.

For example, a robots.txt file could ask the web crawler to completely ignore certain directories or indeed individual files on a website, excluding these from its crawling exercise. The website owner may have a number of motivations for excluding some content: for example, the website owner might think some content is not relevant for categorising a site, or indeed that some content is private and should not be listed on a public search engine.

It’s worth noting that where a website makes use of subdomains, each of these subdomains would need to operate a specific robots.txt file. Furthermore, not every website crawler will follow the instructions given by robots.txt – some crawlers will ignore it, some others might even use robots.txt to go against the policies of the site owner.

So, there is no guarantee that an instruction in a robots.txt file which says that specific site areas should be skipped will be followed. Also, if the site areas that are excluded in a robots.txt file are linked to from other pages, they could still be included in search results – because search crawlers would find these pages via the links.

Please note that technologies described on Wiki pages are not necessary the part of Plesk control panel or its extensions.

Related Posts

Knowledge Base