I master scraping, parsing. I use: Python and the library BeautifulSoup. It is necessary to fasten the data from the set of pages ( sample page ).

In particular, you need to pull email. Here is the place in the html-code where the email is located.

<tr><td>Email:</td><td width="10"></td><td><script>var ylhrfq = "&#121;&#112;&#114;";var bdnd = "&#97;&#105;&#108;";var byil = "&#115;&#116;&#46;&#99;";var bwdbdf = "&#97;&#103;&#101;&#64;";var dqiex = "&#46;&#99;";var pner = "&#111;&#109;";var qkfow = "&#103;&#109;";var azzl = "&#105;&#101;";var hgcr = "&#110;&#46;&#112;&#108;";var link = byil + ylhrfq + azzl + hgcr + bwdbdf + qkfow + bdnd + dqiex + pner;var text = link;document.write('<a href="mailto:'+link+'" />'+text+'</a>');</script></td></tr> 

Tell me, please, is it possible with BeautifulSoup? If so, how can this be achieved?

  • Database for spam collect, I suppose? - VladD
  • Perhaps not collecting for myself. - GiveItAwayNow

2 answers 2

If you get access to such a block of code, then you can easily extract the contents of the script block, but further BeautifulSoup will not help: it cannot parse JavaScript.

You may notice that the structure of the resulting code is simple: there is a variable link , which is the sum of several other text variables in which parts of the email are written in standard html coding.

Since the code is quite simple, you can parse it with regular expressions.

For example, the expression

 re.findall('var (\w+) = "(.+?)"', script_code) 

will find all variables of the form var name = "string" , and these are all variable parts of the final email.

You can find out the order of their following by parsing the assignment of the value of the link variable:

 re.search('var link = (.+?);', script_code).group(1).split(' + ') 

Here we find the assignment of the variable link , take the assigned value and divide by the plus symbol.

After that, it remains only to collect the contents of the variables in the desired order and decode the text, the standard html library will help with this, more specifically, the html.unescape function.

As a result, I got the following code:

 def extract_email(code): soup = bs4.BeautifulSoup(code, 'lxml') script_code = soup.script.string variables = {name: html.unescape(value) for name, value in re.findall('var (\w+) = "(.+?)"', script_code)} order = re.search('var link = (.+?);', script_code).group(1).split(' + ') return ''.join(variables[name] for name in order) 
  • Timofey, thank you very much for such a detailed explanation. Everything works as it should. - GiveItAwayNow

Yes you can. You can start xpath'em to get the data inside a certain td tag, and then just RegExp'om collect all the text in quotes, merge it and decode. In general, nothing complicated.