Optimizing the SEO of a website involves deindexing unnecessary or low quality pages. According to an Ahrefs study, 90.63 % of pages indexed in Google do not receive any organic traffic . That is to say that they are not found in search engines.
As of September 1, 2019, Google announced the end of NOINDEX directive in the robots.txt file. Also, If you’re using the crawl-delay, nofollow commands within the robots.txt file, you may want to utilize them in their proper place, explained below.
How to check if a page is indexed?
site:SEARCH is a good place to start, though not always accurate. The cached version of a page corresponds to the version of the page as it was when the crawler last ran. So even if a page is deindexed, the cashed version may still show as well as any canonicals or redirects.
Take for example, the search of site:adwords.com is showing a 301 result for Google ads now. Meaning that you cannot always rely on this format when searching for a page index status.
To check an index status of a page, try the URL inspection tool, and the index coverage report. These search console features will provide you with accurate index status, last time crawled, Meta Robots Tag, sitemaps…etc.
Scenarios for deindexing
Are there pages on your site that you would like to block access to so that they do not appear in Google? Well, become informed with the cases below as they might apply to you.
1- Are there any confidential pages? For example, the page of a free ebook that you don’t want people to go to without registering on your site.
2- Are there any pages in maintenance? For example, when redesigning a site, pages that aren’t finalized and you don’t want to show up in search engines yet.
3- Are there any duplicate pages that could be considered “ duplicate content ” and penalized by Google? In these cases, choose which referring page you want to index in Google.
4- Are there any pages that don’t really matter for your users? For example, the “legal notice” page. When creating your site, it is often one of the fastest pages indexed by Google. But it would be a shame if your a search user entered your site through the “back door”, wouldn’t it?
Robots meta tag
The noindex meta tag in the page’s HTML code allows you to deindex a web page with ease. Place it in the section of the HTML code like below and your page will no longer be indexed by search engines.
<! DOCTYPE html>
<meta name = "robots" content = "noindex" />
<body> (…) </body>
This process is simple to set up: the Meta tag can be managed by your CMS or by one of its plugins. It is the ideal solution when you have a few pages to deindex. However, for large sites where many documents are affected by the end of NOINDEX meta robots tag, the below method is more ideal.
X-Robots-Tag HTTP header
The X-Robots-Tag can be used as an element of the HTTP header response. It’s less well known than the Meta noindex tag. And yet, placed in the header of the page, it produces the same results. To achieve this solution, connect to the Apache server and integrate the X-Robots-Tag directive into the .htaccess and httpd.conf files.
The advantage of this alternative lies in its ability to process a whole sum of documents. You can, for example, deindex all PDF files from a site with the example code below.
<Files ~ "\ .pdf $">
Header set X-Robots-Tag "noindex, nofollow"
Indexing must be dissociated from the crawl. Make sure the page in question is not blocked from crawling in the robots.txt file.
Use HTTP 404 and 410 status codes
Another solution recommended by Google consists in adding the HTTP status codes 404 for “Not found” and 410 for “Moved” on the pages to be deindexed. The page that has undergone this remedy would normally be considered dead in the eyes of search engines.
Protect a page with a login and a password
The third alternative offered by Google is to impose a login and a password to access a page. Without a key, search engines will remain blocked from entering and will therefore not be able to index it.
But be careful, this solution is usually viable for development sites, subscription-based content, confidential pages…etc you get the idea.
Disallow in robots.txt
The disallow directive tells search engines that a page or set of pages is not intended to be crawled. So when the Googlebot comes to crawl the page, it’ll be blocked to crawl the page. That said, the existing index may still be present on Google search longer than intended.
Thus, this method is only used for files like images, pdf, docs. The advantage of this alternative lies in its ability to process a whole sum of documents. You can, for example, deindex all PDF files from a site with the example code below.
Delete via Search Console
Submit a URL removal request through Search Console. It is subject to validation which usually takes a day and can sometimes be refused.
This solution is useful in an emergency. It allows you to suspend the display of a page in Google search results for up to 90 days and gives you time to resolve issues with confidence. After this delay, your content will reappear.
As you have just seen, the 5 alternatives proposed by Google are sufficient to get your web page indexed depending on your goal.
Before choosing a method, ask yourself the following: do I have pages already deindexed? do you need to deindex pages, but still keep them accessible to users? What type of file are you looking to remove from the search results? based on your answer, you’ll naturally choose the right practices to deindex a page from Google.
Tell us why you need to deindex your website below?