Greetings.

I am at the initial stage of studying Python.

I uploaded a victim site to BeautifulSoup.

Code that interests me:

<li class="cinema-city"> <div class="city-caption"> Кривой Рог</div> <ul class="cinemas-list"> <li class="cinema" data-alias="kinoodessa-kr-rog"> <a href="/cinema/kinoodessa-kr-rog">Одессакино СРК Union</a></li> <li class="cinema" data-alias="mx-krivrog-victory"> <a href="/cinema/mx-krivrog-victory">Мультиплекс в ТРЦ «Виктори Плаза»</a></li> <li class="cinema" data-alias="olymp"> <a href="/cinema/olymp">Олимп</a></li> </ul> </li> <li class="cinema-city"> <div class="city-caption"> Луцк</div> <ul class="cinemas-list"> <li class="cinema" data-alias="premiercity"> <a href="/cinema/premiercity">PREMIER CITY</a></li> </ul> </li> <li class="cinema-city"> <div class="city-caption"> Львов</div> <ul class="cinemas-list"> <li class="cinema" data-alias="kp-dovjenko"> <a href="/cinema/kp-dovjenko">Кинопалац им. Довженко</a></li> <li class="cinema" data-alias="kp-kopernik"> <a href="/cinema/kp-kopernik">Кинопалац Коперник</a></li> <li class="cinema" data-alias="kp-lvov"> <a href="/cinema/kp-lvov">Кинопалац</a></li> </ul> </li> 

I do not know how to correctly compare the "cinema" with the "city" in the cycle in order to extract all the necessary values ​​from the tags. I then want to stuff it all into the database.

Am I guessing correctly? I’ll need to create a dictionary like this for each cinema:

cinema1 = {"city":"<city>", "cinema_name":"<cinema_name>", "href":"<href>"} cinema2 = {"city":"<city>", "cinema_name":"<cinema_name>", "href":"<href>"}

I can pull out the cities separately and the cinemas turn out separately, but then it’s impossible to make the correct vocabulary out of them, since the city-cinema peg is lost.

Thank you.

Edith: Initially, I tried to start something like this:

 city = soup.find_all(class_="city-caption") cinema = soup.find_all(class_="cinema") for keys in city: for values in cinema: print(keys, "=>", values) 

but I realized that was not the case. I do not know how to make it so that the "cinema" understands that in the html code it is embedded in the "city".

  • the code that you wrote, attach - dizballanze

2 answers 2

BS is a bad choice for complex tasks. Use lxml. The code will be something like this:

 for city in etree.xpath(".//div[@class='city-caption']"): print city.xpath("text()")[0] # city for cinema in city.xpath("following-sibling::*[1]/li/a") print cinema.xpath("text()")[0] # Name print cinema.xpath("@href")[0] # url 
  • I considered lxml as an option, but the documentation seemed very abstruse compared to BS. - TitanFighter
  • With the lxml library itself there will be exactly 3 lines: <br/> from lxml.html import fromstring <br/> etree = fromstring(response.text) <br/> etree.make_links_absolute(response.url) <br/> everything else - This is XPath , how to use it - see here , it’s not more difficult than the ad-libbing used in BS, but it is a standard for parsing XML and allows you to do everything that comes to mind. PS Bonus - El Ruso

In order to solve this problem, it must be decomposed into more simple ones. Let's say, first you only need to get all the information about each city. This can be done as follows:

 soup.select('li.cinema-city') 

Each city has a name that can be pulled out of the first div tag with the city-caption class:

 city.find('div', class_='city-caption').text.strip() 

Further, each city has a list of cinemas:

 city.select('a') 

From which you can get the url:

 cinema['href'] 

And the name:

 cinema.text.strip() 

You just need to combine these methods. I also recommend you to break this whole thing down according to different functions.

  • Thank you very much for your attention to my problem. If you pull all separately, separate functions, as you suggested, then the city-cinema connection is lost. This is my main problem. I do not know how to extract the data while maintaining the connection to the city-cinema. I have been looking for a solution in the internet for a day, I can not find it. As I understand the problem, you need to write a cycle that when the cinema is pulled, it will check what city it belongs to and eventually list the city1-cinema1-link1, city1-cinema2-link2, city1-cinema3-link3, city2-cinema1-cinema_link1 , city2-cinema2-k_s2 and so on. - TitanFighter
  • Communication city - cinema is not lost. No one bothers to pull out all the cities and for each city to call a function that will pull out cinemas from this city. - awesoon
  • I did not think about such an approach. I'll try. Please tell me with another Nubian question. I started trying what you suggested above. city.find ('div', class _ = 'city-caption'). text.strip () - what should the city assign to? I tried to do this with city = soup.select ('li.cinema-city'), but I end up with AttributeError: 'list' object has no attribute 'find' - TitanFighter
  • select returns a list of cities, respectively, cities = soup.select('li.cinema-city') . Then either you iterate over this list using a loop, or use the map function. - awesoon