Robots.txt is a file associated with your website used to ask different web crawlers to crawl or not crawl portions of your website.
The robots.txt file is primarily used to specify which parts of your website should be crawled by spiders or web crawlers. It can specify different rules for different spiders.
Googlebot is an example of a spider. It’s deployed by Google to crawl the Internet and record information about websites so it knows how high to rank different websites in search results.
Using a robots.txt file with your website is a web standard. Spiders look for the robots.txt file in the host directory (or main folder) of your website. This text file is always named “robots.txt”. You can find your robots.txt file by going to:
Most mainstream spiders comply with directions specified in robots.txt files but nefarious spiders may not. The content within robot.txt files are publicly available. You can attempt to ban unwanted spiders by editing the .htaccess file associated with your website.
It’s important that marketers check their robots.txt file to make sure search engines are invited to crawl important pages. If you ask search engines to not crawl your website, then your website won’t appear in search results.
You can also use the robots.txt file to show spiders where to find a sitemap of your website, which can make your content more discoverable.
You can also specify a crawl-delay, or how many seconds robots should wait before collecting more information. Some websites may need to use this setting if bots are eating up bandwidth and causing your website to load slower for human visitors.
An Example Robots.txt File
Here is what might appear in a robots.txt file:
Here is what each line means in plain English.
User-agent: * — The first line is explaining that the rules that follow should be followed by all web crawlers. The asterisk means all spiders in this context.
Disallow: /ebooks/*.pdf — In conjunction with the first line, this link means that all web crawlers should not crawl any pdf files in the ebooks folder within this website. This means search engines won’t include these direct PDF links in search results.
Disallow: /staging/ —In conjunction with the first line, this line asks all crawlers not to crawl anything in the staging folder of the website. This can be helpful if you’re running a test and don’t want the staged content to appear in search results.
User-agent: Googlebot-Image — This explains that the rules that follow should only be followed by just one specific crawler, the Google Image crawler. Each spider uses a different “user-agent” name.
Disallow: /images/ — In conjunction with the line immediately above this one, this asks the Google Images crawler not to crawl any images in the images folder.
Indexed, though blocked by robots.txt
This means that some of the content blocked by robots.txt is still indexed in Google.
Once again, if you’re trying to exclude this content from Google’s search results, robots.txt isn’t the correct solution. Remove the crawl block and instead use a meta robots tag or x‑robots-tag HTTP header to prevent indexing.
If you blocked this content by accident and want to keep it in Google’s index, remove the crawl block in robots.txt. This may help to improve the visibility of the content in Google search.
What’s the maximum size of a robots.txt file?
500 kilobytes (roughly).
Where is robots.txt in WordPress?
Same place: domain.com/robots.txt.
How do I edit robots.txt in WordPress?
Either manually, or using one of the many WordPress SEO plugins like Yoast that let you edit robots.txt from the WordPress backend.
What happens if I disallow access to noindexed content in robots.txt?
Google will never see the noindex directive because it can’t crawl the page.
Also Read: How to make a site map?