There is a regular expression (\ <(/? [^>] +)>) That leaves HTML tags. How can I delete all tags, leaving only the text?

    5 answers 5

    So actually it also can be used for cleaning of tags, having fed in sub .

    In Python:

     >>> import re >>> re.sub(r'(\<(/?[^>]+)>)', '', '<b>ВСкст с <br/>Ρ‚Π΅Π³Π°ΠΌΠΈ</b>') 'ВСкст с Ρ‚Π΅Π³Π°ΠΌΠΈ' 

    In javascript:

     >>> console.log('<b>ВСкст с <br/>Ρ‚Π΅Π³Π°ΠΌΠΈ</b>'.replace(/(\<(\/?[^>]+)>)/g, '')) "ВСкст с Ρ‚Π΅Π³Π°ΠΌΠΈ" 

    Just be sure to remember that no regular expression can correctly handle broken html:

     >>> line = '<div> >>>2 + 3 < 6<br/>True <!-- ΠΊΠΎΠΌΠΌΠ΅Π½ > Ρ‚Π°Ρ€ΠΈΠΉ --></div><b' >>> re.sub(r'(\<(/?[^>]+)>)', '', line) ' >>>2 + 3 True Ρ‚Π°Ρ€ΠΈΠΉ --><b' 

    And for such a case it is better to use full-fledged html-parsers, and the regulars should not be allowed at all to the html-code .

    • Comments must first be deleted. Then the tags. - Qwertiy ♦
    • I actively support the idea of ​​using html parser, but not regulars. - Sergey Snegirev
    • @SergeySnegirev, and I not only did not express any objections about the parser, but I do not have any objections. On the other hand, for many specific problems, a simple regular schedule is quite suitable. - Qwertiy ♦
    • @Qwertiy invalid html-code preliminary removal of comments does not help. A gift for specific tasks β€” if the structure of the document is precisely known, if its (non) validity is precisely known, if a specific piece of text needs to be pulled out, regular papers may be appropriate. But if, for example, it wants to filter user input (and, I suspect, this is what the author wants), then it’s already dangerous to do so - andreymal
    • Non-valid html will be helped by a stricter definition of tags (for example, in my regular calendar, alphanumeric characters are required after < ). Deleting comments will help from another - it improves accuracy and eliminates tricks with syntax. - Qwertiy ♦

    At the moment, the closest version to the browser:

     function textByBrowser(html) { var div = document.createElement("div"); div.innerHTML = html; return div.textContent; } function textByRegex(html) { return html.replace(/<!--[\s\S]*?--!?>/g, "").replace(/<\/?[az][^>]*(>|$)/gi, ""); } var tests = [ '2+3<6', '2+3<', '<<a>script>alert("XSS!")<<a>/script>', '<div> >>>2 + 3 < 6<br/>True <!-- ΠΊΠΎΠΌΠΌΠ΅Π½ > Ρ‚Π°Ρ€ΠΈΠΉ --></div><b', '<script<b>>alert(1)</script</b>>', '<a\n>123\n</a>' ]; tests.map(textByBrowser) + "" == tests.map(textByRegex) // true 

    The presence of angle brackets in the attributes is not processed correctly:

     textByBrowser('1<div data-smth=">">2</div>3') // 123 textByRegex('1<div data-smth=">">2</div>3') // 1">23 

    And mnemonics need to be dealt with at their discretion:

     textByBrowser("&lt;") // "<" textByRegex("&lt;") // "&lt;" 

    I draw your attention that none of the ways to get the text is not a defense against XSS attacks. When displaying custom text on a page, you should always use screening.

     console.log(textByBrowser('<<a>script>alert("XSS!")<<a>/script>')); // <script>alert("XSS!")</script> 

    PS: An earlier version of the response with a different code is available in the history.

    • Can recursively make a balance <and>? then the broken markup will be deleted - ReinRaus
    • @ReinRaus, recursion in regulars? Not exactly in js. - Qwertiy ♦
    • @ReinRaus, by the way, where should the balance be? There it is necessary to edit another - it is more flexible to process the contents of attributes. - Qwertiy ♦
    • I do not remember. long ago it was - ReinRaus

    (?:<).*?(?:>) - cuts all tags

    • one
      Groups why? - Qwertiy ♦

    Php has a function strip_tags - removes HTML and PHP tags from the string

      As for me, a more precise definition of the tag would be

      In javascript:

       <\/?[A-Za-z]+[^>]*> 
      • Very little different from mine ... - Qwertiy ♦
      • Generally speaking, the differences are significant, standard tags cannot begin with a digit and underscore, also the tag at the beginning comes without a space. And your regular season allows all this. - stomaks pm
      • Where? o_O Are you exactly comparing with my answer? I have this: /<\/?[az][^>]*(>|$)/gi - Qwertiy ♦ 6:12 pm
      • Yes, apparently messed up, not with yours. Yours will be better, but. Your: "[az]" is only 1 character from the range az, then any character "[^>]" except for ">" 0 or more times until it encounters a closing quote or the end of a line. Well, at least the end of the line is not closing the tag "<f". Secondly, a query like "<f5454>" also rolls, there are no standard tags, with numbers, but it can be useful if there are custom tags, of course. regexper.com/#%2F%3C%5C%2F%3F%5Ba-z%5D%5B%5E%3E%5D *% 28% 3E% 7C% 24% 29% 2Fgi - stomaks
      • Although my can be improved even more <\ /? [A-Za-z] +? [^>] *?> Regexper.com/#%3C%5C%2F%3F%5BA-Za-z%5D%2B% 3F% 5B% 5E% 3E% 5D *% 3F% 3E - stomaks pm