Why Google indexes blocked websites
Google’s John Mueller answered a question about why Google indexes pages that robots.txt says should not be crawled and why the associated Search Console reports on these crawls can be safely ignored.
Bot traffic to query parameter URLs
The person who asked the question documented that bots have been linking to non-existent query parameter URLs (?q=xyz) to pages with noindex meta tags that are also blocked in robots.txt. The reason for the question is that Google crawls the links to these pages, gets blocked by robots.txt (without a noindex robots meta tag showing), and then gets reported in Google Search Console as “Indexed but blocked by robots.txt”.
The person asked the following question:
“But here’s the big question: Why would Google index pages if they can’t even see the content? What’s the benefit of doing so?”
John Mueller from Google confirmed that the noindex meta tag will not be displayed if the page cannot be crawled. Interestingly, he also mentions the site:search operator and advises ignoring the results because the “average” user will not see these results.
He wrote:
“Yes, you’re right: if we can’t crawl the page, we can’t see the noindex. That is, if we can’t crawl the pages, there’s not much for us to index. So you might see some of those pages with a targeted site: query, but the average user won’t see them, so I wouldn’t worry about it. Noindex is fine too (without robots.txt disallow), it just means the URLs will eventually get crawled (and end up in the Search Console crawled/not indexed report – neither of those statuses causes problems for the rest of the site). The important thing is that you make them uncrawlable + indexable.”
Findings:
1. Mueller’s answer acknowledges the limitations of using the advanced search operator site:search for diagnostic reasons. One of those reasons is that it is not connected to the regular search index, but is something completely different.
Google’s John Mueller commented on the Site Search Operator in 2021:
“The short answer is that a site: query is not intended to be exhaustive, nor should it be used for diagnostic purposes.
A site query is a special type of search that limits the results to a specific website. It basically consists of just the word site, a colon, and then the website’s domain.
This query limits the results to a specific website. It is not a comprehensive collection of all pages on this website.”
2. A noindex tag without using a robots.txt file is fine in situations where a bot links to nonexistent pages that are discovered by Googlebot.
3. URLs with the noindex tag create a “crawled/not indexed” entry in Search Console and have no negative impact on the rest of the website.
Read the question and answer on LinkedIn:
Why would Google index pages if they can’t even see the content?
Featured image from Shutterstock/Krakenimages.com