Trying to write a simple spider to pull data from the site . The task is to collect the name and address of the personal site on the network for each of the registered users.

Site structure: clickable members with tiles, only the name of the information, in order to get the site address I need to go to the profile of each member and go back. Go to the pages of the original site, i.e. user flipping works implemented and works successfully. With XPath and CSS selectors like there are no problems, all the checks pass, but the spider refuses to work. I get the address of the page of the next user, I make a request to this URL, it is processed by a separate function that must parse both the name and the link. However, they are not output, instead Scrapy falls with an error. Tell me which way to look?

from scrapy.http import Request, Response, HtmlResponse from scrapy.contrib.linkextractors.htmlparser import HtmlParserLinkExtractor from scrapy.contrib.spiders import CrawlSpider, Rule class VuSpider(CrawlSpider): name = "vu2" allowed_domains = ["voiceoveruniverse.com"] start_urls = ['http://www.voiceoveruniverse.com/profiles/members/'] def parse(self, response): base_url = 'http://voiceoveruniverse.com' SET_SELECTOR = 'div.member_item' for actor in response.css(SET_SELECTOR): NAME_SELECTOR = 'h5 a ::text' URL_SELECTOR = 'h5 a ::attr(href)' PROFILE_URL = base_url+actor.css(URL_SELECTOR)[0].extract() yield scrapy.Request( PROFILE_URL, callback=self.parse_item ) # yield { # 'name': actor.css(NAME_SELECTOR).extract_first() #} NEXT_PAGE_SELECTOR = '//a[contains(text(), "Next")]/@href' next_page = response.xpath(NEXT_PAGE_SELECTOR).extract_first() if next_page: yield scrapy.Request( response.urljoin(next_page), callback=self.parse ) def parse_item(self, response): yield { 'name':response.xpath('//dl/dt/span/text()')[0].extract(), 'URL':response.xpath('//dd/a/@href')[-1].extract() } 

The error itself:

 IndexError: list index out of range 2017-01-09 17:37:50 [scrapy] ERROR: Spider error processing <GET http://voiceoveruniverse.com/profile/NathanCole?xg_source=profiles_memberList> (referer: http://www.voiceoveruniverse.com/profiles/friend/list?page=3&xg_source=profiles_memberList_top_next) Traceback (most recent call last): File "/home/igor/anaconda3/lib/python3.5/site-packages/scrapy/utils/defer.py", line 102, in iter_errback yield next(it) File "/home/igor/anaconda3/lib/python3.5/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output for x in result: File "/home/igor/anaconda3/lib/python3.5/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr> return (_set_referer(r) for r in result or ()) File "/home/igor/anaconda3/lib/python3.5/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr> return (r for r in result or () if _filter(r)) File "/home/igor/anaconda3/lib/python3.5/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr> return (r for r in result or () if _filter(r)) File "/home/igor/Загрузки/UpWork/vu2/vu2/spiders/vu2.py", line 46, in parse_item 'name':response.xpath('//dl/dt/span/text()')[0].extract(), File "/home/igor/anaconda3/lib/python3.5/site-packages/parsel/selector.py", line 58, in __getitem__ o = super(SelectorList, self).__getitem__(pos) 
  • Selectors check in console Scrapy. Everything works as expected. - sky

2 answers 2

and the whole thing wasn't about the wrong xpath selectors. In total it was necessary instead of base_url = 'http://voiceoveruniverse.com' to do base_url = 'http://www.voiceoveruniverse.com'

    The problem is that your spider inherits the CrawlSpider class, and in this case you cannot use the parse method that the spider uses by default: https://doc.scrapy.org/en/latest/topics/spiders.html . You must either use the Spider class or rename the parse method. You can completely remove the base_url variable and use response.urljoin () to create the correct link. For example

    URL_SELECTOR = 'h5 a ::attr(href)' PROFILE_URL = base_url+actor.css(URL_SELECTOR)[0].extract()

    can be rewritten as

     LINK = actor.css('h5 a ::attr(href)').extract()[0] PROFILE_URL = response.urljoin(LINK) 

    You should read the basic tutorial on the offsite https://doc.scrapy.org/en/latest/intro/tutorial.html there are good examples of creating simple spiders.