Explain, please, there is a line:

<td><span class=small>1.</span><br><a href='/roder/re3rr70' title='0r0t3 (1et).jpg'>picture</a></td>

I need to choose from it /roder/re3rr70 and 0r0t3 (1et).jpg and concatenate the result to get the following link: http://roder/re3rr70/0r0t3 (1et).jpg , but stuck with this:

 >>> import re >>> text = "<td><span class=small>1.</span><br><a href='/roder/re3rr70' title='0r0t3 (1et).jpg'>picture</a></td>" >>> d = re.findall(r"<a href='(.+?)'>(.+?)</a>", text) >>> print d [("/roder/re3rr70' title='0r0t3 (1et).jpg", 'picture')] 

That is, I have everything that is enclosed between the <a></a> tag, but I don’t know how to write the conditions so that after the href= tag the title value is output and the rest is discarded. Explain how to set the conditions?

    2 answers 2

    If I understand your question correctly:

     import re text = "<td><span class=small>1.</span><br><a href='/roder/re3rr70' title='0r0t3 (1et).jpg'>picture</a></td>" m = re.search(r"<a href='([^>]+?)'[^>]*title='([^>]+?)'[^>]*>", text) if m: print 'http:/%s/%s' % (m.group(1), m.group(2)) 

    displays

    http: // roder / re3rr70 / 0r0t3 (1et) .jpg

    UPD: First of all, I recommend reading about regudent expressions in general, for example, on Wikipedia. Although their syntax varies slightly depending on the programming language, but in general the same thing is everywhere. Secondly, read the documentation for the re library (the first link for the re request in google).

    About this example:

    1. [^>] means any character except '>'. It is necessary not to read too much. For example, the regular 'z. * Z' on the line "adszazzbzasd" will correspond to "zazzbz", and not "zaz" as we would like. There are also lazy quantifiers for this, but I do not remember the syntax.
    2. The part of the regulars in brackets is called a group. It is necessary to pull this part out of the line. m is an object of type re.Match. if m - check that the string contains a substring satisfying the expression. m.group (i) returns the part of the string corresponding to group i.
    • Here, from this in more detail! It is the conditions that interest - ([^>] +?) '[^>] and ([^>] +?)' [^>] *>, as well as the logic of the condition if m, well, m .group (1) and (2), because I need to figure out and not just copy-paste, not all the time, solve problems by helping - tukan
    • 3
      Maybe instead of> r"<a href='(.+?)'>(.+?)</a>" use> r"<a href='(.+?)' title='(.+?)'>.*</a>" you still need the title attribute. - panda
    • Here it is href='([^>]+?)' And title='([^>]+)' bit strange conditions. Probably more correct is something like: href='([^']+)' or better yet href='((?:[^']|(?<!\\).)+)' And similarly for title . @tukan, tell me what exactly you do not understand? Just in case: [Regular expression operations] [1] [1]: docs.python.org/library/re.html - Ilya Pirogov
    • what panda has suggested looks beautiful, although it’s still not completely understood with the control characters, but it’s beautiful that you take it: >>> import re >>> text = "<td> <span class = small> 1. </ span> <br> <a href='/roder/re3rr70' title='0r0t3 (1et).jpg'> picture </a> </ td> ">>> d = re.search (r" <a href = ' (. +?) 'title =' (. +?) '>. * </a> ", text) >>> print' http: /% s /% s'% (d.group (1), d .group (2)) roder / re3rr70 / 0r0t3 (1et) .jpg - tukan
    • I did not understand how to add a title to the condition, now I figured out that I had to write the title = '(. +?)'>. * </a> " - tukan

    Better not to parse HTML regexpami. Starting with the fact that, in fact, HTML is not regular (although the current regexps too, heh, Turing are full), and ending with the fact that lxml or BeautifulSoup was invented for this.

    Here is an example with lxml:

     [(e.get("href"), e.get("title")) for e in lxml.html.fromstring(text) .xpath("//td/span[@class='small']/following-sibling::a")] 
    • and, it is possible to explain it on the above written code? - tukan
    • one
      Well, parsing an HTML fragment with a call to lxml.html.fromstring() , execute an XPath query on it //td/span[@class='small']/following-sibling::a (anywhere - “ //td ", in it a span with the attribute class="small" -" /span[@class='small'] ", and from this the subsequent neighboring element a -“ /following-sibling::a ”), we iterate by the result and for each result element we return a tuple with the attributes “ href ” and “ title ”. - drdaeman