I cannot briefly outline the essence of the problem, because sincerely I do not understand what exactly it is, I can only voice my observations and try to get at least some clues here.

There is a site brand cosmetics http://iledebeaute.ru . There is a working code for traversing the directory and pulling all the information from it. The problem is that in the n-th number of cases, the answer comes a void on the selectors. And only for them, i.e. half of the information is pulled out of the regular JSON on the page and everything is fine with it. Moreover, the problem goods are different every time, the pattern is not traced.

Here is what I tried to do: - I collected cookies with the postman from the page and with my hands on peresa - I tried to change my local ip, I connected the crawlera in the cloud of skrapinghab. (in principle, it would be relevant for cases where the goods would cease to be collected at a certain point because of the ip ban, but here it is selectively and randomly) - I wrote down the HTML in a file. This is where the mysticism began, everything is in order with him, so everything is really in order. I made the second file, I wrote only the data on the selector in it, in the spirit of "// *" or "// html". 900 lines of the necessary code magically disappear into the air. How? I do not get it.

Of course, the performance selectors checked. And here's what mysticism, the page can start to be guided normally out of the blue. I just restart the same code on the locale and at some point all the selectors "turn on" and everything abruptly becomes normal. And here it is necessary to search for another page. There are options what's the problem?

Minimal Spider Code: `
from scrapy import Request, Selector

PRODUCTS_SELECTOR = "//div[@class='b-showcase__item']//p[@class='b-showcase__item__link']/a/@href" TITLE_SELECTOR = "//span[@class='b-product-detail__description']/text()" MAC_TITLE_SELECTOR = "//h1[@class='product__name']/text()" DIOR_TITLE_FIRST_SELECTOR = "//div[@class='b-product__item__item__title']/h1/text()" DIOR_TITLE_SECOND_SELECTOR = "//div[@class='b-product__item__promo__title']/h1/text()" class IleDeBeauteSpider(Spider): start_urls = ["http://iledebeaute.ru/shop/care/face/lips/"] def start_requests(self): for url in self.start_urls: yield Request(url=f'{url}?perpage=72', callback=self.parse, meta={'start_url': url}) def parse(self, response): start_url = response.meta.get('start_url', None) page = response.meta.get('page', 1) if response.xpath(PRODUCTS_SELECTOR): yield Request(url=self.get_url(f'{start_url}page{page + 1}/?perpage=72'), callback=self.parse, meta={'page': page + 1, 'start_url': start_url}) for url in response.xpath(PRODUCTS_SELECTOR).extract(): yield Request(url=self.get_url(url), callback=self.parse_product) def parse_product(self, response): title = response.xpath(MAC_TITLE_SELECTOR).extract_first() if not title: title = response.xpath(DIOR_TITLE_FIRST_SELECTOR).extract_first() if not title: title = response.xpath(DIOR_TITLE_SECOND_SELECTOR).extract_first() if not title: title = response.xpath(TITLE_SELECTOR).extract_first() if not title: print('This product has no title: ', response.url) yield {'url': response.url, 'error': 'no title'}` 
  • one
    without looking at the code, I will give the following recommendation: if you find a page from which you couldn’t get anything, log the whole answer - code, headings, body. UPD - for good in the parse method, you need to check the response code from the server. Maybe you are trying to distribute pages from 500, or from 403. - rusnasonov
  • Always 200. That's always. Moreover, there are no obvious mistakes either. - Andrei Tupic
  • Give a tooth?) But it is better to still log in, and then when the error appears again to deal with the logs - rusnasonov
  • A little late, but still tell you what it was. It turned out that the server gave the html curve and all the normal parsers lamas on its head. I had to use clumsy bs4 for parsing. - Andrei Tupic

0