Python, a regular expression for parsing URLs in text

Question

The simplest regular expression successfully handles links in plain text:

r'(https?://[\S]+)'

Everything suits him, but sometimes html comes in, where you need to isolate the link from the a tag. If you get something like some text <a href="http://ya.ru">some text , it will return as a result: http://ya.ru>some

Here is an expression:

 r'(https?://[\S]+[>$])'

returns an acceptable result (a link with the > symbol at the end, which can then simply be truncated), but no longer handles the links in plain text.

How in Python to combine these two expressions into one, according to the principle OR, to get all the matches, one by one?

Tried it through ()|() - it does not work that way. Third-party libraries do an excellent job with this task, but through the simplest regular expressions it is necessary to achieve the desired result.

Roman Vladimirov Roman Vladimirov 44 3 bronze marks · Accepted Answer · 2017-02-26T17:59:27

The easiest:

 (https?://[\w.-]+)

But it will also look for invalid links. For example: https: //.-ya_.5 If you are confident in the validity of your links in the text, then it is a working version.

Roman Vladimirov

44 3 bronze marks

This option cuts off https://ya.ru/da.html to https://ya.ru - and the whole link would be necessary. And the% in Russian letters will most likely not accept this option. - federk
Then this option: (https?://[^\"\s]+) - Roman Vladimirov
This option works like the first in question. Does not trim after > . And it is necessary to cut. - federk
one
(https?: // [^ \ "\ s>] +) - Roman Vladimirov
Yes, thank you, it works just right. - federk

|

Community spirit ♦ one · Answer 2 · 2017-02-27T05:07:23

If you want links from html to get it, you should use html parser. For example, Beautiful Soup :

 #!/usr/bin/env python3 import bs4 # $ pip install beautifulsoup4 soup = bs4.BeautifulSoup(html_text, 'html.parser') all_links = soup.find_all('a', href=True)

In general, regular expressions are not suitable for recognizing html and even in cases where regex can be used this may not be the best option .

@federk: The questions on Stack Overflow are not just for you personally.
People often try using regex to extract information from html without good reason.
If you want exactly regex to pull out of the text, which "sometimes may turn out to be html", then click on the last link in my answer, where it is shown how using regex html to parse (and how much effort is required in general).
If you insist on regex, I recommend to explicitly limit the types of valid text input / uri, explicitly mention how errors are allowed (that not all url found, or false positives).

Python, a regular expression for parsing URLs in text

2 answers 2

More articles: