How to add parsing data to the dictionary?

Question

Asked a question yesterday about the site dns-shop.ru. One good person suggested to use Selenium. Now there is another problem. I pulled out all that I wanted - the names of the goods, the cost and links to them. I can not understand how I cram the whole thing in the dictionary. That the name was a key, and cost and the link value. I created a dictionary d = {}, and when I run through a for loop, they are written to the list. But the problem is that the record goes first all the names of the goods, then the entire cost, then the links. And I need to have a name, value, link, etc. Maybe I need to use nested loops, but it turns out some kind of garbage, displays first the name, then all prices, links, then again one name, all prices from the page and links, etc.

from selenium import webdriver from lxml import html page_num = 1 url = 'https://www.dns-shop.ru/catalog/17a8a01d16404e77/smartfony/?p=%s&i=1&mode=list&brand=brand-apple' % page_num driver = webdriver.Firefox() driver.get(url) content = driver.page_source tree = html.fromstring(content) last_page = tree.xpath('//span[@class=" item edge"]')[0].attrib.get('data-page-number') last_page = int(last_page) d={} while page_num <= last_page: url = 'https://www.dns-shop.ru/catalog/17a8a01d16404e77/smartfony/?p=%s&i=1&mode=list&brand=brand-apple' % page_num driver.get(url) name = driver.find_elements_by_tag_name('h3') price = driver.find_elements_by_class_name('price_g') link = driver.find_elements_by_xpath("//div[@class='title']/a") print('Страница: ', page_num) for i in name: i = i.text print(i) d.append(i) for i in price: i = i.text print(i) d.append(i) for i in link: i = i.get_attribute("href") print(i) d.append(i) page_num += 1 print (d) driver.close()

I did now like this:

 for i in name: i = i.text for j in price: j = j.text for k in link: k = k.get_attribute("href") d[i] = [j ,k]

It seems to be as I wanted, but the price and the link do not correspond to the product.

Redid d from dictionary to list. It worked only for the first "trinity", then some links and the price went in a chaotic manner. I do not understand why

  for i in name: i = i.text d.append(i) for j in price: j = j.text d.append(j) for k in link: k = k.get_attribute("href") d.append(k)

Accepted Answer · 2018-12-03T06:30:52

But the problem is that the record goes first all the names of the goods, then the entire cost, then the links. And I need to have a name, value, link, etc.

I propose a solution on site (another, better, option is to process each product individually, pulling its properties out of it, instead of querying all the properties of all products and gluing them).

Instead:

 name = driver.find_elements_by_tag_name('h3') price = driver.find_elements_by_class_name('price_g') link = driver.find_elements_by_xpath("//div[@class='title']/a") print('Страница: ', page_num) for i in name: i = i.text print(i) d.append(i) for i in price: i = i.text print(i) d.append(i) for i in link: i = i.get_attribute("href") print(i) d.append(i)

Do this:

 names = driver.find_elements_by_tag_name('h3') prices = driver.find_elements_by_class_name('price_g') links = driver.find_elements_by_xpath("//div[@class='title']/a") print('Страница: ', page_num) for name, price, link in zip(names, prices, links): name = name.text price = price.text link = link.get_attribute("href") print(name, price, link) # d.append((name, price, link))

Ps.

The expression d.append(... will not work if d is a dictionary ( d={} ). Either you assign the list to d or you have an error there

But instead of d = {} use items = [] , for example, and items.append((name, price, link)) .

AtachiShadow AtachiShadow 795 four nineteen · Answer 2 · 2018-12-03T15:20:24

I will just add some good tips to your code! All this does not really look in the comments)))

one.

 content = driver.page_source tree = html.fromstring(content)

You can write immediately:

 tree = html.fromstring(driver.page_source)

then you will not have an extra content variable in which something will be stored dead load all the time the script is running)))) is something like a mini-optimization of RAM consumption by your script))

2

 last_page = tree.xpath('//span[@class=" item edge"]')[0].attrib.get('data-page-number')

You first find the element by Xpath, and then look for the attribute attrib.get() .

Although lxml is able to work with the content of the element, therefore it is better:

 last_page = tree.xpath('//span[@class=" item edge"]/@data-page-number')

This will return the number 6 (maximum pages) to your last_page variable last_page

BUT. Content navigation does not work if the item you are looking for is not lxml , but driver.find_elements_by_xpath() :

If you write driver.find_elements_by_xpath('//span[@class=" item edge"]/@data-page-number') will throw an exception that the xpath is not correct. It is necessary so:

 last_page = driver.find_elements_by_xpath('//span[@class=" item edge"]').get_attribute("data-page-number")

And to shorten the XPath a little more by removing the span so that you can only search by the class name - '//*[@class=" item edge"]/@data-page-number' :

 last_page = tree.xpath('//span[@class=" item edge"]')[0].attrib.get('data-page-number') last_page = tree.xpath('//span[@class=" item edge"]/@data-page-number') last_page = tree.xpath('//*[@class=" item edge"]/@data-page-number')

How to add parsing data to the dictionary?

2 answers 2

More articles: