Search crawlers tries to index non-existent pagination pages

Started by sandomatyas, February 10, 2022, 19:29:40 PM

Previous topic - Next topic

sandomatyas

Hi

There is a site with ~15.000 products and several hundreds of categories, manufacturers.
I checked the access logs and found a huge amount of records like this
66.249.66.158 - - [06/Feb/2022:03:29:24 +0100] "GET /manufacturer/yelowstone/discs/new-discs/by,price/dirAsc/results,4841-4940?keyword= HTTP/1.1" 200 211370 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.80 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" (-) 3929289

As far as I see these are the products from a manufacturer in a specific category, which is fine. But also it is sorted by price and the products from 4841-4940. The problem is that /manufacturer/yelowstone/discs/new-discs contains only two products so no need these pagination values.
But I might know where it is come from:
When you open a category you'll get and orderby field and a manufacturer filter and a pagination. If you have a lot of products in the category you'll get lots of pages. Just open a big page, you'll get like results,4841-4940
But if you start a manufacturer filter after this it keeps the results parameter no matter if there is no products for that.

The list is created from orderbymanu sublayout. I'm not sure which would be the proper way to handle this?
- remove limitstart, limit parameters?
- add rel="nofollow"?
- something else?

sirius

Hi

you should first check if you have a canonical link for those pages.

For example, my page is: /papiers/by,product_sku?language=fr-FR&keyword=
My canonical for this page is: /papiers

Because if not, the bots will index those pages, but usually googlebot, does not index if a page has parameters (?keyword=).
In any case, he knows how to recognize and manage them.

But to block the bot to index those type of pages, you can simply add this in your htaccess if you want:
RewriteCond %{REQUEST_URI} (.*)/by,(.*)$ [NC]
RewriteRule ^.*$ - [ENV=NOINDFO:true]
Header set X-Robots-Tag "noindex, follow" env=NOINDFO


So the bot will follow the link, what is still ok, but wil not index those pages anymore.

You can add several lines in the same block if you need, for example for your link
RewriteCond %{REQUEST_URI} (.*)/by,(.*)$ [NC,OR]
RewriteCond %{REQUEST_URI} (.*)/results,(.*)$ [NC]
RewriteRule ^.*$ - [ENV=NOINDFO:true]
Header set X-Robots-Tag "noindex, follow" env=NOINDFO


And do the same for others that for example you need the bots to index and follow
RewriteCond %{REQUEST_URI} (.*)/new-discs,(.*)$ [NC]
RewriteRule ^.*$ - [ENV=INDFO:true]
Header set X-Robots-Tag "index, follow" env=INDFO


Regards

J3.10.12 | PHP 7.4.33 + APC + memcached + Opcode
VM Prod : 3.8.6 | VM Test : 4.0.12.10777

pinochico

or you cn go to override view and add

//SEO Analyse
$document = JFactory::getDocument();
$document->setMetaData('robots', "noindex, nofollow");
//END

and for the right cannonical links you can use great app Jmap
www.minijoomla.org  - new portal for Joomla!, Virtuemart and other extensions
XML Easy Feeder - feeds for FB, GMC,.. from products, categories, orders, users, articles, acymailing subscribers and database table
Virtuemart Email Manager - customs email templates
Import products for Virtuemart - from CSV and XML
Rich Snippets - Google Structured Data
VirtueMart Products Extended - Slider with products, show Others bought, Products by CF ID and others filtering products

sandomatyas

Hi, thanks for the reply. Maybe the post's subject is misleading. Basically the problem is not about the crawlers, I also have canonical urls. My problem is why VirtueMart generates unneccessary pagination urls when there is no need them?
Again, the steps:

  • open a category page with lots of products. The pagination value is e.g 50 products/page
  • open the 10th page, you get results,451-500
  • select a manufacturer from the top filter.
It will filter the current category by the selected manufacturer. If the manufacturer has only e.g. 60 products in the category, you only need page1 and page2 but you'll get to page10 which is empty and unneccessary. What is the point with that? Why do we link to a page way above the product count limit?

So in my opinion the manufacturer filter in the category list should open the first page of the list, so the manufacturer filter list's items' links should not contain limit and limitstart parameters.

pinochico

Quote
- open the 10th page, you get results,451-500
- select a manufacturer from the top filter.

Select manufacturer on 10th page?

The error is in select manufacturer - the select must delete before info from cookies or session about first filtering and pagination and set pagination on first page for manufacturer.
I think is work for DEV VM team.
But I don't tested on clean installation VM :(
www.minijoomla.org  - new portal for Joomla!, Virtuemart and other extensions
XML Easy Feeder - feeds for FB, GMC,.. from products, categories, orders, users, articles, acymailing subscribers and database table
Virtuemart Email Manager - customs email templates
Import products for Virtuemart - from CSV and XML
Rich Snippets - Google Structured Data
VirtueMart Products Extended - Slider with products, show Others bought, Products by CF ID and others filtering products

sandomatyas

yup, this is for the dev team. Do they read the forum or should I open a new topic in the dev part of the forum?

BTW I tested with the latest VM version with the demo data and the issue is exists.
Steps to reproduct with the core

  • open the shop, it shows 86 products ( https://snipboard.io/1qyAbD.jpg )
  • filter the products with the manufacturer filter for 'producer'. Result: 1-5 of 5, so there are 5 products there, url: /index.php/shop/manufacturer/producer/dirAsc?keyword=
  • go back to the full list
  • open the 4th page, the url is: /index.php/shop?start=72
  • filter for the producer manufacturer again. It shows there are 5 products like in the second step but the url is /index.php/shop/manufacturer/producer/dirAsc/results,73-96?keyword=
  • it seems VM wants to show the 5 products starting from 73

In VM 3.8.9 there is a getManufacturerOrderByList method in the product model and it has a $fieldLink parameter which is $fieldLink = vmURI::getCurrentUrlBy('request', false, true, array('orderby','dir'));
I think the last array should contain 'limitstart' (and maybe limit as well) because no matter if I start the filtering from the nth page, the result should show the first page.

In the previous versions there is a if($key=='dir' or $key=='orderby') continue; line in getOrderByList
I added limitstart and limit there and it has huge imact to the site performance on big sites.
The most extreme scenario:
The category has 6000 products with 25 product/page. It has subcategories but it shows the products from the subcategories as well. There are ~1000 manufacturers for that most of them contain 1 or 2 products from the category.
Without manufacturer filtering it has 6000/25=240 pages for the pagination. Let's assume google indexes this 240 pages and the 1000 filter page as well, its 240 + 1000 = 1240
But if the manufacturer filter urls contains limitstart parameter it's a different url from every pagination page so it's 240*1000=240 000 pages to index.
We constantly had server crashes, huge load no matter what kind of cache did we use (because there is no cache for the nonexisting pages). I removed the $limitstart from the mf filter url and we had zero crashes since then.

Also the filter should be a <select> instead of a html list because if you have a lot of categories it way below the bottom of the browser.