Parsing this html page: http://cogcc.state.co.us/COGIS/DrillingPermitsList.cfm First I download using requests

import requests from xmlWorker import HtmlParser url = "http://cogcc.state.co.us/COGIS/DrillingPermitsList.cfm" postDataForPending = {"listtype":"Pending", "country": "All", "B1": "Go!"} postDataForApproved = {"listtype":"Approved", "country": "All", "B1": "Go!"} response = requests.post(url, data = postDataForPending) htmlText = response.text print(htmlText) if __name__ == '__main__': htmlParser = HtmlParser(htmlText) print(htmlParser.get_received()) 

And then parse this business lxml

 class HtmlParser: xpathRoot = '/tr[position()>1 and position()<{0}+2]/' xpathToReceivedfirst = xpathRoot + 'td[1]/font/text()' def __init__(self, htmlText): logf = open("download.log", "w") try: self.document = lxmlHtmlParser.fromstring(htmlText) except Exception as e: # most generic exception you can catch logf.write(str(e)) finally: # optional clean up code pass def get_received(self): xp = self.xpathToReceivedfirst.format(maxRecordsToParse) receivedElements = self.document.xpath(xp) return receivedElements 

No errors are displayed. The problem is that during debug all self.document attributes are either not specified or equal, say '\ n', respectively, all xpaths return empty sheets. At the same time, BeautifulSoup parses the elements normally. Html file is valid. What is the problem still do not understand

UPDATED

I parsed one table bs-op, deleted all the carriage translations, spaces between tags. Still nothing works, the browser normally detects and renders this table saved to a file

  • Of course, I’m a bad pythonist (and the code is presented partially), but what is it lxmlHtmlParser and why is one slash at the beginning xpathRoot - vitidev
  • lxmlHtmlParser - an alias for lxml.html I tested my xpath on one third-party service, everything worked as I expected - Yaktens Teed
  • Create a minimal input example that shows the problem (for example, discard half html, see if the problem remains, discard half of the remainder until you can discard anything else. Give the expected output (what beautifulsoup prints, and what instead with lxml is. Minimum reproducible - jfs

0