How to parse html page with JavaScript in python 3 and what is needed for this.
1 answer
To get static data from html, javascript text, you can use the appropriate parsers, such as BeautifulSoup, slimit. Example: How, using Beautuful Soup, to search by keyword if this word is in the script tag?
To get information from a web page, the elements of which javascript dynamically generates, you can use a web browser. To manage different browsers from Python, selenium webdriver helps: an example with a GUI showing . There are other libraries, for example: marionette (firefox), pyppeteer (chrome, puppeteer API for Python) - an example of getting a screen shot of a web page using these libraries . To get the html page, without showing the GUI, you can run a headless Google Chrome and use selenium:
from selenium import webdriver # $ pip install selenium options = webdriver.ChromeOptions() options.add_argument('--headless') # get chromedriver from # https://sites.google.com/a/chromium.org/chromedriver/downloads browser = webdriver.Chrome(chrome_options=options) browser.get('https://ru.stackoverflow.com/q/749943') # ... other actions generated_html = browser.page_source browser.quit() This interface allows you to automate user actions (keystrokes, buttons, search for items on the page according to various criteria, and so on). It is useful to split the analysis into two parts: download dynamically generated information from the network using a browser and save it (possibly redundant information is available), and then analyze static content in detail in order to extract only the necessary parts (possibly without a network in another process using the same BeautifulSoup ). For example, to find links to similar questions on a saved page:
from bs4 import BeautifulSoup soup = BeautifulSoup(generated_html, 'html.parser') h = soup.find(id='h-related') related = [a['href'] for a in h.find_all('a', 'question-hyperlink')] If the site provides an API (official or peeped in network requests performed by javascript: an example for fifa.com ), then this may be preferable to retrieving information from the UI of web page elements: an example of using the Stack Exchange API .
You can often find REST API or GraphQL API , which is convenient using requests or specialized libraries to use (by reference the examples of code for github api).
- +1 to BeautifulSoup - AseN