There is a regular expression (\ <(/? [^>] +)>) That leaves HTML tags. How can I delete all tags, leaving only the text?
5 answers
So actually it also can be used for cleaning of tags, having fed in sub
.
In Python:
>>> import re >>> re.sub(r'(\<(/?[^>]+)>)', '', '<b>Π’Π΅ΠΊΡΡ Ρ <br/>ΡΠ΅Π³Π°ΠΌΠΈ</b>') 'Π’Π΅ΠΊΡΡ Ρ ΡΠ΅Π³Π°ΠΌΠΈ'
In javascript:
>>> console.log('<b>Π’Π΅ΠΊΡΡ Ρ <br/>ΡΠ΅Π³Π°ΠΌΠΈ</b>'.replace(/(\<(\/?[^>]+)>)/g, '')) "Π’Π΅ΠΊΡΡ Ρ ΡΠ΅Π³Π°ΠΌΠΈ"
Just be sure to remember that no regular expression can correctly handle broken html:
>>> line = '<div> >>>2 + 3 < 6<br/>True <!-- ΠΊΠΎΠΌΠΌΠ΅Π½ > ΡΠ°ΡΠΈΠΉ --></div><b' >>> re.sub(r'(\<(/?[^>]+)>)', '', line) ' >>>2 + 3 True ΡΠ°ΡΠΈΠΉ --><b'
And for such a case it is better to use full-fledged html-parsers, and the regulars should not be allowed at all to the html-code .
- Comments must first be deleted. Then the tags. - Qwertiy β¦
- I actively support the idea of ββusing html parser, but not regulars. - Sergey Snegirev
- @SergeySnegirev, and I not only did not express any objections about the parser, but I do not have any objections. On the other hand, for many specific problems, a simple regular schedule is quite suitable. - Qwertiy β¦
- @Qwertiy invalid html-code preliminary removal of comments does not help. A gift for specific tasks β if the structure of the document is precisely known, if its (non) validity is precisely known, if a specific piece of text needs to be pulled out, regular papers may be appropriate. But if, for example, it wants to filter user input (and, I suspect, this is what the author wants), then itβs already dangerous to do so - andreymal
- Non-valid html will be helped by a stricter definition of tags (for example, in my regular calendar, alphanumeric characters are required after
<
). Deleting comments will help from another - it improves accuracy and eliminates tricks with syntax. - Qwertiy β¦
At the moment, the closest version to the browser:
function textByBrowser(html) { var div = document.createElement("div"); div.innerHTML = html; return div.textContent; } function textByRegex(html) { return html.replace(/<!--[\s\S]*?--!?>/g, "").replace(/<\/?[az][^>]*(>|$)/gi, ""); } var tests = [ '2+3<6', '2+3<', '<<a>script>alert("XSS!")<<a>/script>', '<div> >>>2 + 3 < 6<br/>True <!-- ΠΊΠΎΠΌΠΌΠ΅Π½ > ΡΠ°ΡΠΈΠΉ --></div><b', '<script<b>>alert(1)</script</b>>', '<a\n>123\n</a>' ]; tests.map(textByBrowser) + "" == tests.map(textByRegex) // true
The presence of angle brackets in the attributes is not processed correctly:
textByBrowser('1<div data-smth=">">2</div>3') // 123 textByRegex('1<div data-smth=">">2</div>3') // 1">23
And mnemonics need to be dealt with at their discretion:
textByBrowser("<") // "<" textByRegex("<") // "<"
I draw your attention that none of the ways to get the text is not a defense against XSS attacks. When displaying custom text on a page, you should always use screening.
console.log(textByBrowser('<<a>script>alert("XSS!")<<a>/script>')); // <script>alert("XSS!")</script>
PS: An earlier version of the response with a different code is available in the history.
- Can recursively make a balance <and>? then the broken markup will be deleted - ReinRaus
- @ReinRaus, recursion in regulars? Not exactly in js. - Qwertiy β¦
- @ReinRaus, by the way, where should the balance be? There it is necessary to edit another - it is more flexible to process the contents of attributes. - Qwertiy β¦
- I do not remember. long ago it was - ReinRaus
(?:<).*?(?:>)
- cuts all tags
- oneGroups why? - Qwertiy β¦
Php has a function strip_tags - removes HTML and PHP tags from the string
As for me, a more precise definition of the tag would be
In javascript:
<\/?[A-Za-z]+[^>]*>
- Very little different from mine ... - Qwertiy β¦
- Generally speaking, the differences are significant, standard tags cannot begin with a digit and underscore, also the tag at the beginning comes without a space. And your regular season allows all this. - stomaks pm
- Where? o_O Are you exactly comparing with my answer? I have this:
/<\/?[az][^>]*(>|$)/gi
- Qwertiy β¦ 6:12 pm - Yes, apparently messed up, not with yours. Yours will be better, but. Your: "[az]" is only 1 character from the range az, then any character "[^>]" except for ">" 0 or more times until it encounters a closing quote or the end of a line. Well, at least the end of the line is not closing the tag "<f". Secondly, a query like "<f5454>" also rolls, there are no standard tags, with numbers, but it can be useful if there are custom tags, of course. regexper.com/#%2F%3C%5C%2F%3F%5Ba-z%5D%5B%5E%3E%5D *% 28% 3E% 7C% 24% 29% 2Fgi - stomaks
- Although my can be improved even more <\ /? [A-Za-z] +? [^>] *?> Regexper.com/#%3C%5C%2F%3F%5BA-Za-z%5D%2B% 3F% 5B% 5E% 3E% 5D *% 3F% 3E - stomaks pm