Using the following line I get the ResultSet:

abc = soup.findAll('script', text = re.compile('Data')) 

The resulting ResultSet itself:

 [<script type="text/javascript"> data = {"url":"haha.com", "id":"12345", "name":"haha",}; ... function() {abc.devg....})' ... 

From all this, the goal is to extract the parameters in data, namely, the values ​​of url and id are valid. How to do this is no idea. I tried various options for parsing with the help of the soup and what is above is the closest to the desired option.

  • beautifulsoup4 is nothing to do with (this library understands html, xml, but does not understand javascript). To get the url, id from a javascript object, you need a library that understands javascript. See jsonascript block using Python? - jfs

1 answer 1

Actually, BeautifulSoup here and with it you can find the necessary script element in the HTML tree. After the element is found and its text is on hand, you will need to decide how to parse the JS code and pull out the value of the desired variable.

One rather practical and simple option is a regular expression. Moreover, you can use the same compiled regular expression to find the element, and to get the data object as a string, which we can skip through json.loads() to get a Python data structure (in this example below - a dictionary).

Working example:

 import json import re from bs4 import BeautifulSoup data = """ <html> <head> <script type="text/javascript"> data = {"url":" haha.com", "id": "12345", "name": "haha"}; function() { // something here }); </script> </head> </html>""" soup = BeautifulSoup(data, "html.parser") pattern = re.compile(r"data = (\{.*?\});$", re.MULTILINE | re.DOTALL) script = soup.find("script", text=pattern) if script: obj = pattern.search(script.text).group(1) obj = json.loads(obj) print(obj) 

At the exit you will receive:

 {'url': ' haha.com', 'id': '12345', 'name': 'haha'} 

See also this StackOverflow post, where a similar task is disassembled - besides regular expressions, there is an example of using the JS slimit parser: