Google’s John Mueller answered whether removing pages from a large website helps solve the problem of pages being discovered by Google but not crawled. John offered general insights to solve this problem.
Discovered – Not currently indexed
Search Console is a service provided by Google that communicates search-related issues and feedback.
Indexing status is an important part of Search Console because it tells a publisher how much of a website is indexed and eligible for ranking.
The indexing status of webpages is shown in the Search Console’s Page Indexing Report.
A message that a page was discovered by Google but not indexed is often a sign that a problem needs to be fixed.
There are several reasons why Google might discover a page but refuse to index it, although only one reason is listed in Google’s official documentation.
“Discovered – not currently indexed
The page was found by Google but not yet crawled.
Usually, Google wanted to crawl the URL, but it was expected that this would overload the site. Therefore, Google postponed the crawl.
Because of this, the last crawl date in the report is blank.”
Google’s John Mueller offers other reasons why a page is discovered but not indexed.
Deindex unindexed pages to improve indexing across the site?
The idea is that removing certain pages will help Google crawl the rest of the site by leaving fewer pages to crawl.
It is believed that Google allocates a limited crawling capacity (crawling budget) to each website.
Google employees have repeatedly pointed out that there is no crawl budget the way SEOs perceive it.
Google made a number of considerations about how many pages to crawl, including the capacity of the website server to handle heavy crawling.
One reason Google is picky about crawling is because Google doesn’t have enough capacity to store every single webpage on the internet.
Because of this, Google tends to index pages that have some value (if the server can handle it) and not index other pages.
For more information on crawl budget, see: Google Shares Insights into Crawl Budget
This is the question that was asked:
“Would de-indexing and aggregating 8 million used products into 2 million uniquely indexable product pages help improve crawlability and indexability (discovered – currently unindexed issue)?”
Google’s John Mueller first admitted that it was not possible to address the person’s specific problem, and then offered general recommendations.
“You can’t say that.
I would recommend reading the big site crawl budget guide in our documentation.
Large websites are sometimes limited by the fact that your website can handle more crawling.
However, in most cases it is more about the overall quality of the website.
Are you significantly improving the overall quality of your website by going from 8 million pages to 2 million pages?
If you don’t focus on improving the actual quality, it’s easy to spend a lot of time reducing the number of indexable pages without really improving the site, and that wouldn’t improve search.”
Mueller cites two reasons for the discovered unindexed issue
Google’s John Mueller gave two reasons why Google might discover a page but refuse to index it.
- server capacity
- overall quality of the site
1. Server Capacity
Mueller said that Google’s ability to crawl and index webpages “may be limited by how your site can handle more crawling.”
The bigger a website gets, the more bots are required to crawl a website. To make matters worse, Google isn’t the only bot crawling a large website.
There are other legitimate bots, such as those from Microsoft and Apple, that also attempt to crawl the site. In addition, there are many other bots, some legitimate and others related to hacking and data scraping.
This means that with a large website, especially in the evening hours, there can be thousands of bots using website server resources to crawl a large website.
Because of this, one of the first questions I ask a publisher with indexing issues is the status of their server.
In general, a website with millions of pages or even hundreds of thousands of pages needs a dedicated server or a cloud host (since cloud servers offer scalable resources like bandwidth, GPU, and RAM).
Sometimes a hosting environment may need more memory allocated to a process, e.g. B. PHP memory limit to allow the server to handle the high traffic and prevent 500 error response messages.
Troubleshooting servers involves analyzing a server error log.
2. Overall quality of the site
This is an interesting reason for not enough pages being indexed. The overall quality of a website is like a rating or a rating that Google assigns to a website.
Parts of a website can affect the overall quality of the website
John Mueller said that a section of a website can influence the determination of the overall quality of the website.
“…on some things we look at the quality of the site as a whole.
And when we look at the overall quality of the site, it doesn’t matter to us if essential parts are of lower quality, like why they are of lower quality.
…when we find that significant parts are of lower quality, we may think that overall this site isn’t as fantastic as we thought it would be.”
Definition of site quality
Google’s John Mueller provided a definition of website quality in another Office Hours video:
“When it comes to content quality, we don’t just mean the body of your articles.
It really comes down to the quality of your overall website.
And that includes everything from layout to design.
How you present things on your pages, how you integrate images, how you work quickly, all these factors play a role there.”
How long it takes to determine the overall quality of the site
Another fact about how Google determines website quality is how long it takes for Google to determine website quality. It can take months.
“It takes us a lot of time to understand how a website fits into the rest of the internet.
… And that can easily take, I don’t know, a few months, half a year, sometimes even longer than half a year…”
Optimizing a site for crawling and indexing
Optimizing an entire website or a section of a website is kind of a general approach to looking at the problem at a high level. It often comes down to optimizing individual pages in a scaled manner.
Especially for ecommerce sites with thousands of millions of products, optimization can take different forms.
What to look out for:
Make sure the main menu is optimized to take users to the important sections of the site they are most interested in. The main menu may also contain links to the most popular pages.
Link to popular sections and pages
The most popular pages and sections can also be linked from a prominent section of the homepage.
This helps users get to the pages and sections that are most important to them, but also signals to Google that these are important pages that should be indexed.
Improve thin content pages
Thin content is generally understood to be pages with little useful content or pages that are mostly duplicates of other pages (template content).
It is not enough to just fill the pages with words. The words and phrases must have meaning and relevance to website visitors.
Products may include dimensions, weight, colors available, suggestions for other products to pair with, brands the products work best with, links to manuals, FAQs, reviews, and other information that is valuable to users.
Solution for crawled non-indexed solutions for more online sales
In a physical store, simply putting the products on the shelves seems to be enough.
But the reality is that it often takes knowledgeable salespeople to make these products fly off the shelves.
A website can play the role of a knowledgeable salesperson that can tell Google why the page should be indexed and help customers choose those products.
Watch the Google SEO consultation hour starting at 13:41 minutes: