How to remove all HTML tags with a regular expression?

Question

There is a regular expression (\ <(/? [^>] +)>) That leaves HTML tags. How can I delete all tags, leaving only the text?

andreymal andreymal 8,902 3 25 50 · Accepted Answer · 2015-07-02T19:07:40

So actually it also can be used for cleaning of tags, having fed in sub .

In Python:

 >>> import re >>> re.sub(r'(\<(/?[^>]+)>)', '', '<b>Текст с <br/>тегами</b>') 'Текст с тегами'

In javascript:

 >>> console.log('<b>Текст с <br/>тегами</b>'.replace(/(\<(\/?[^>]+)>)/g, '')) "Текст с тегами"

Just be sure to remember that no regular expression can correctly handle broken html:

 >>> line = '<div> >>>2 + 3 < 6<br/>True <!-- коммен > тарий --></div><b' >>> re.sub(r'(\<(/?[^>]+)>)', '', line) ' >>>2 + 3 True тарий --><b'

And for such a case it is better to use full-fledged html-parsers, and the regulars should not be allowed at all to the html-code .

I actively support the idea of using html parser, but not regulars.
@SergeySnegirev, and I not only did not express any objections about the parser, but I do not have any objections.
On the other hand, for many specific problems, a simple regular schedule is quite suitable.
@Qwertiy invalid html-code preliminary removal of comments does not help.
A gift for specific tasks — if the structure of the document is precisely known, if its (non) validity is precisely known, if a specific piece of text needs to be pulled out, regular papers may be appropriate.
But if, for example, it wants to filter user input (and, I suspect, this is what the author wants), then it’s already dangerous to do so
Non-valid html will be helped by a stricter definition of tags (for example, in my regular calendar, alphanumeric characters are required after < ).
Deleting comments will help from another - it improves accuracy and eliminates tricks with syntax.

Answer 2 · 2015-07-03T09:23:49

At the moment, the closest version to the browser:

 function textByBrowser(html) { var div = document.createElement("div"); div.innerHTML = html; return div.textContent; } function textByRegex(html) { return html.replace(/<!--[\s\S]*?--!?>/g, "").replace(/<\/?[az][^>]*(>|$)/gi, ""); } var tests = [ '2+3<6', '2+3<', '<<a>script>alert("XSS!")<<a>/script>', '<div> >>>2 + 3 < 6<br/>True <!-- коммен > тарий --></div><b', '<script<b>>alert(1)</script</b>>', '<a\n>123\n</a>' ]; tests.map(textByBrowser) + "" == tests.map(textByRegex) // true

The presence of angle brackets in the attributes is not processed correctly:

 textByBrowser('1<div data-smth=">">2</div>3') // 123 textByRegex('1<div data-smth=">">2</div>3') // 1">23

And mnemonics need to be dealt with at their discretion:

 textByBrowser("&lt;") // "<" textByRegex("&lt;") // "&lt;"

I draw your attention that none of the ways to get the text is not a defense against XSS attacks. When displaying custom text on a page, you should always use screening.

 console.log(textByBrowser('<<a>script>alert("XSS!")<<a>/script>')); // <script>alert("XSS!")</script>

PS: An earlier version of the response with a different code is available in the history.

There it is necessary to edit another - it is more flexible to process the contents of attributes.

alexander barakin 56.8k 13 48 172 · Answer 3 · 2016-02-10T12:29:32

(?:<).*?(?:>) - cuts all tags

alexander barakin

56.8k 13 48 172

Totamort

31 one

one
Groups why? - Qwertiy ♦

|

ezm52099 ezm52099 21 one · Answer 4 · 2018-06-11T13:40:24

Php has a function strip_tags - removes HTML and PHP tags from the string

stomaks stomaks 46 3 · Answer 5 · 2018-12-30T15:28:44

As for me, a more precise definition of the tag would be

In javascript:

 <\/?[A-Za-z]+[^>]*>

stomaks

46 3

Very little different from mine ... - Qwertiy ♦
Generally speaking, the differences are significant, standard tags cannot begin with a digit and underscore, also the tag at the beginning comes without a space. And your regular season allows all this. - stomaks pm
Where? o_O Are you exactly comparing with my answer? I have this: /<\/?[az][^>]*(>|$)/gi - Qwertiy ♦ 6:12 pm
Yes, apparently messed up, not with yours. Yours will be better, but. Your: "[az]" is only 1 character from the range az, then any character "[^>]" except for ">" 0 or more times until it encounters a closing quote or the end of a line. Well, at least the end of the line is not closing the tag "<f". Secondly, a query like "<f5454>" also rolls, there are no standard tags, with numbers, but it can be useful if there are custom tags, of course. regexper.com/#%2F%3C%5C%2F%3F%5Ba-z%5D%5B%5E%3E%5D *% 28% 3E% 7C% 24% 29% 2Fgi - stomaks
Although my can be improved even more <\ /? [A-Za-z] +? [^>] *?> Regexper.com/#%3C%5C%2F%3F%5BA-Za-z%5D%2B% 3F% 5B% 5E% 3E% 5D *% 3F% 3E - stomaks pm

|

How to remove all HTML tags with a regular expression?

5 answers 5

More articles: