Trying to write a simple spider to pull data from the site . The task is to collect the name and address of the personal site on the network for each of the registered users.
Site structure: clickable members with tiles, only the name of the information, in order to get the site address I need to go to the profile of each member and go back. Go to the pages of the original site, i.e. user flipping works implemented and works successfully. With XPath and CSS selectors like there are no problems, all the checks pass, but the spider refuses to work. I get the address of the page of the next user, I make a request to this URL, it is processed by a separate function that must parse both the name and the link. However, they are not output, instead Scrapy falls with an error. Tell me which way to look?
from scrapy.http import Request, Response, HtmlResponse from scrapy.contrib.linkextractors.htmlparser import HtmlParserLinkExtractor from scrapy.contrib.spiders import CrawlSpider, Rule class VuSpider(CrawlSpider): name = "vu2" allowed_domains = ["voiceoveruniverse.com"] start_urls = ['http://www.voiceoveruniverse.com/profiles/members/'] def parse(self, response): base_url = 'http://voiceoveruniverse.com' SET_SELECTOR = 'div.member_item' for actor in response.css(SET_SELECTOR): NAME_SELECTOR = 'h5 a ::text' URL_SELECTOR = 'h5 a ::attr(href)' PROFILE_URL = base_url+actor.css(URL_SELECTOR)[0].extract() yield scrapy.Request( PROFILE_URL, callback=self.parse_item ) # yield { # 'name': actor.css(NAME_SELECTOR).extract_first() #} NEXT_PAGE_SELECTOR = '//a[contains(text(), "Next")]/@href' next_page = response.xpath(NEXT_PAGE_SELECTOR).extract_first() if next_page: yield scrapy.Request( response.urljoin(next_page), callback=self.parse ) def parse_item(self, response): yield { 'name':response.xpath('//dl/dt/span/text()')[0].extract(), 'URL':response.xpath('//dd/a/@href')[-1].extract() } The error itself:
IndexError: list index out of range 2017-01-09 17:37:50 [scrapy] ERROR: Spider error processing <GET http://voiceoveruniverse.com/profile/NathanCole?xg_source=profiles_memberList> (referer: http://www.voiceoveruniverse.com/profiles/friend/list?page=3&xg_source=profiles_memberList_top_next) Traceback (most recent call last): File "/home/igor/anaconda3/lib/python3.5/site-packages/scrapy/utils/defer.py", line 102, in iter_errback yield next(it) File "/home/igor/anaconda3/lib/python3.5/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output for x in result: File "/home/igor/anaconda3/lib/python3.5/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr> return (_set_referer(r) for r in result or ()) File "/home/igor/anaconda3/lib/python3.5/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr> return (r for r in result or () if _filter(r)) File "/home/igor/anaconda3/lib/python3.5/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr> return (r for r in result or () if _filter(r)) File "/home/igor/Загрузки/UpWork/vu2/vu2/spiders/vu2.py", line 46, in parse_item 'name':response.xpath('//dl/dt/span/text()')[0].extract(), File "/home/igor/anaconda3/lib/python3.5/site-packages/parsel/selector.py", line 58, in __getitem__ o = super(SelectorList, self).__getitem__(pos)