The simplest regular expression successfully handles links in plain text:

r'(https?://[\S]+)' 

Everything suits him, but sometimes html comes in, where you need to isolate the link from the a tag. If you get something like some text <a href="http://ya.ru">some text , it will return as a result: http://ya.ru>some

Here is an expression:

 r'(https?://[\S]+[>$])' 

returns an acceptable result (a link with the > symbol at the end, which can then simply be truncated), but no longer handles the links in plain text.

How in Python to combine these two expressions into one, according to the principle OR, to get all the matches, one by one?

Tried it through ()|() - it does not work that way. Third-party libraries do an excellent job with this task, but through the simplest regular expressions it is necessary to achieve the desired result.

    2 answers 2

    The easiest:

     (https?://[\w.-]+) 

    But it will also look for invalid links. For example: https: //.-ya_.5 If you are confident in the validity of your links in the text, then it is a working version.

    • This option cuts off https://ya.ru/da.html to https://ya.ru - and the whole link would be necessary. And the% in Russian letters will most likely not accept this option. - federk
    • Then this option: (https?://[^\"\s]+) - Roman Vladimirov
    • This option works like the first in question. Does not trim after > . And it is necessary to cut. - federk
    • one
      (https?: // [^ \ "\ s>] +) - Roman Vladimirov
    • Yes, thank you, it works just right. - federk

    If you want links from html to get it, you should use html parser. For example, Beautiful Soup :

     #!/usr/bin/env python3 import bs4 # $ pip install beautifulsoup4 soup = bs4.BeautifulSoup(html_text, 'html.parser') all_links = soup.find_all('a', href=True) 

    In general, regular expressions are not suitable for recognizing html and even in cases where regex can be used this may not be the best option .

    • Read the question carefully. Parsing the text, which can sometimes be html. Pay particular attention to the last paragraph. - federk
    • @federk: The questions on Stack Overflow are not just for you personally. People often try using regex to extract information from html without good reason. If you want exactly regex to pull out of the text, which "sometimes may turn out to be html", then click on the last link in my answer, where it is shown how using regex html to parse (and how much effort is required in general). If you insist on regex, I recommend to explicitly limit the types of valid text input / uri, explicitly mention how errors are allowed (that not all url found, or false positives). - jfs
    • @federk Meta Discussion Topic - jfs