Python lxml html.fromstring not working

Question

Parsing this html page: http://cogcc.state.co.us/COGIS/DrillingPermitsList.cfm First I download using requests

import requests from xmlWorker import HtmlParser url = "http://cogcc.state.co.us/COGIS/DrillingPermitsList.cfm" postDataForPending = {"listtype":"Pending", "country": "All", "B1": "Go!"} postDataForApproved = {"listtype":"Approved", "country": "All", "B1": "Go!"} response = requests.post(url, data = postDataForPending) htmlText = response.text print(htmlText) if __name__ == '__main__': htmlParser = HtmlParser(htmlText) print(htmlParser.get_received())

And then parse this business lxml

 class HtmlParser: xpathRoot = '/tr[position()>1 and position()<{0}+2]/' xpathToReceivedfirst = xpathRoot + 'td[1]/font/text()' def __init__(self, htmlText): logf = open("download.log", "w") try: self.document = lxmlHtmlParser.fromstring(htmlText) except Exception as e: # most generic exception you can catch logf.write(str(e)) finally: # optional clean up code pass def get_received(self): xp = self.xpathToReceivedfirst.format(maxRecordsToParse) receivedElements = self.document.xpath(xp) return receivedElements

No errors are displayed. The problem is that during debug all self.document attributes are either not specified or equal, say '\ n', respectively, all xpaths return empty sheets. At the same time, BeautifulSoup parses the elements normally. Html file is valid. What is the problem still do not understand

UPDATED

I parsed one table bs-op, deleted all the carriage translations, spaces between tags. Still nothing works, the browser normally detects and renders this table saved to a file

Of course, I’m a bad pythonist (and the code is presented partially), but what is it lxmlHtmlParser and why is one slash at the beginning xpathRoot
lxmlHtmlParser - an alias for lxml.html I tested my xpath on one third-party service, everything worked as I expected
Create a minimal input example that shows the problem (for example, discard half html, see if the problem remains, discard half of the remainder until you can discard anything else. Give the expected output (what beautifulsoup prints, and what instead with lxml is. Minimum reproducible

Python lxml html.fromstring not working

0

More articles: