Parsing the site using xpath. Python

Question

Hello!

I need to pull out links to sections of this site.

<li id="menu-item-28" class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-28"><a href="http://worldagnetwork.com/category/community/">Community</a></li> <li id="menu-item-25" class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-25"><a href="http://worldagnetwork.com/category/crops/">Crops</a></li> <li id="menu-item-27" class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-27"><a href="http://worldagnetwork.com/category/livestock/">Livestock</a></li> <li id="menu-item-24" class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-24"><a href="http://worldagnetwork.com/category/technology/">Technology</a></li> <li id="menu-item-26" class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-26"><a href="http://worldagnetwork.com/category/business/">Business</a></li> <li id="menu-item-29" class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-29"><a href="http://worldagnetwork.com/category/policy/">Policy</a></li> <li id="menu-item-53" class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-53"><a href="http://worldagnetwork.com/category/environment/">Environment</a></li> <li id="menu-item-82" class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-82"><a href="http://worldagnetwork.com/category/rd/">R&#038;D</a></li>

Do not quite understand how to use xpath.

This is what I do:

 from lxml import etree import requests from io import StringIO, BytesIO import lxml.html as LH url = 'http://worldagnetwork.com/' headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36'} result = requests.get(url, headers=headers) tree = LH.document_fromstring(result.content) print(tree.xpath('//div/ul/li')[0].get('href'))

print(tree.xpath('//div/ul')) already displays an empty list.

Help to understand, please.

Accepted Answer · 2016-07-21T00:11:57

You are a little undershot selector

'//div/ul/li' - selects all li whose parent is ul , whose parent is some kind of div .

Because you need not li and a which lie inside li you just need to add a to the end of the selector

//div/ul/li/a

 links = tree.xpath("//div/ul/li/a") for l in links: print( l.get( 'href' ) )

This will leave all the links you need.

If you want to get the addresses that are listed in the top menu (I suppose that the piece of html came from there, then it is better to use a more specialized selector). //div/ul/li/a will display some extra links. In order to get links from the menu it is better to use this selector.

 //div[contains(@class, "nav-collapse")]/ul/li[contains(@class, "menu-item")]/a[@href]

Selects all a who have href , whose direct parent is li with menu-item class, which are in div with nav-collapse class.

I have tree.xpath ('// div / ul') already returns an empty list (
Do you mean that tree.xpath('//div/ul/')[0].get('href') was empty or without [0].get('href') ?
If the second is, most likely, it is not possible to obtain data from the site ( request.content contains something wrong).
tree.xpath('//div/ul') with tree.xpath('//div/ul') and tree.xpath('//div/ul/li/a') returns what you really need.
But tree.xpath('//div[contains(@class, "nav-collapse")]/li[contains(@class, "menu-item")]/a[@href]') returns an empty list.
Corrected the second selector in the answer, suddenly even when it is useful.

Parsing the site using xpath. Python

1 answer 1

More articles: