Good day! It is necessary to write a spider in Python 2.7 using the Scrapy framework for pars cinema. Parsa homepage: http://kino-kassa.ru/vse-kinoteatry-rossii/


The spider should follow each link to the city, and, from the next page, parse the name of the cinema, its address, the number of seats and the number of cinema halls. Spider must parse each city.

How to parse data using XPath from the page on which it passes I figured it out.
But how to work with the rules of the transition of the spider through the pages and the use of routes (Rules) did not understand.

I post the spider code:

# -*- coding: utf-8 -*- from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from scrapy.loader.processors import TakeFirst, Identity from scrapy.loader import ItemLoader from scrapy.selector import HtmlXPathSelector, Selector from crawl.items import cinemaItem class cinemaLoader(ItemLoader): default_output_processor = Identity() class cinemaSpider(CrawlSpider): name = "abiturlist" allowed_domains = ["kino-kassa.ru"] start_urls = ["http://kino-kassa.ru/vse-kinoteatry-rossii/"] rules = ( Rule(LinkExtractor(allow=('kinoteatry-*')), callback='parse_item'), ) def parse_item(self, response): hxs = Selector(response) # all = hxs.xpath(".//*[@id='content']/div/div[1]/div[3]/h2/a/text()") # all = hxs.xpath(".//*[@class='post']") all_name_cinema = hxs.xpath("//span[text()='Адрес кинотеатра: ']/../../..//div[@class='post-title']/*/a/text()").extract() all_address_cinema = hxs.xpath("//span[text()='Адрес кинотеатра: ']/../text()[1]").extract() all_count_of_seats_cinema = hxs.xpath("//span[text()='Количество залов: ']/../text()[9]").extract() all_count_of_halls_cinema = hxs.xpath("//span[text()='Количество залов: ']/../text()[7]").extract() # for fld in all: # Item = cinemaItem() # FIO = fld.xpath("./td[2]/p/text()").extract()[0].split() # Item['family'] = FIO[0] # Item['name'] = FIO[1] # Item['surname'] = FIO[2] # Item['spec'] = fld.xpath("./td[last()]/p/text()").extract()[0] # ball = fld.xpath("string(./td[3]/p)").extract()[0] # Item['ball'] = ball # Item['url'] = response.url # Item['pagespec'] = pg_spec # yield Item i = 0 while i < len(all_name_cinema): Item = cinemaItem() Item['name'] = all_name_cinema[i].split() Item['address'] = all_address_cinema[i].split() Item['count_of_seats'] = all_count_of_seats_cinema[i].split() Item['count_of_halls'] = all_count_of_halls_cinema[i].split() yield Item i += 1 


I apologize for the coarseness of the format of the laid out code, but I have not yet learned how to properly lay out the code on Stack. Thanks in advance, I hope you will help me.

    1 answer 1

    Problem solved.

     rules = ( Rule(LinkExtractor(allow=('kino-kassa.ru/category/kinoteatr*')), callback='parse_item'), ) 

    Provides a "normal" running on the links that are on the page start_urls = ["http://kino-kassa.ru/vse-kinoteatry-rossii/"] without crawling nested and "unnecessary" links