Good afternoon, there are pages of this format

<td class=x11111111111111111 width=140 style='>1111111111111111111111111111<td>11111111111111border-top:none;border-left:none; width:107pt'> 

The task is to turn all this murmur into a <td> . Those. remove kilobytes of unnecessary nonsense from the attributes inside the tags.

I am afraid to do it through strpos because I can dig into the analysis of substrings (greed), because inside the tag there may be brackets " <td class> </td> " and other nonsense. Those. inside the tag there can be text attributes in which there can be substrings corresponding to the name of the tag.

  • Do I need to remove attributes from a specific tag on a page? - VenZell
  • do you have two td there, leave one or two? - Jean-Claude

4 answers 4

 $str="<td class=x11111111111111111 width=140 style='>1111111111111111111111111111<td>11111111111111border-top:none;border-left:none; width:107pt'> "; print preg_replace('/<td(?:([\'"]).*?\1|.)*?>/',"<td>",$str); 

Clears tags with possible quotes.

To clear any tags, you can slightly preg_replace('/<(\w+)(?:([\'"]).*?\2|.)*?>/',"<$1>",$str); expression preg_replace('/<(\w+)(?:([\'"]).*?\2|.)*?>/',"<$1>",$str);

  • I put you a plus for a working regular expression - VenZell

I do not think that regular expressions are fully suitable for solving this problem.
It is better to use a DOMDocument . It will correctly process even invalid layout.

View an example of work

 $string = "<td class=x11111111111111111 width=140 style='>1111111111111111111111111111<td>11111111111111border-top:none;border-left:none; width:107pt'>"; $doc = new DOMDocument(); $doc->loadHTML($string, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD); // Измените селектор на тот, что вам нужен $elements = $doc->getElementsByTagName('td'); // Перебираем все элементы из выборки foreach ($elements as $element) { // Список атрибутов элемента $attributes = $element->attributes; // Перебираем атрибуты // После удаления элемента выполняется переиндексация списка атрибутов // Когда будет удален последний, условие станет ложным и произойдет выход из цикла while ($attributes->length) { // Удаляем атрибуты по одному, пока не будут удалены все из них $element->removeAttributeNode($attributes->item(0)); } } echo $doc->saveHTML(); // <td></td> 

Notice the LIBXML_HTML_NOIMPLIED and LIBXML_HTML_NODEFDTD .

Without them, the conclusion would be

 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body><td></td></body></html> 

Starting with PHP 5.4 and Libxml 2.6, the second parameter $option appeared in the loadHTML method, which explains Libxml how to parse HTML

LIBXML_HTML_NOIMPLIED (integer)
Sets the HTML_PARSE_NOIMPLIED flag, which disables the automatic addition of missing html / body ... elements.

LIBXML_HTML_NODEFDTD (integer)
Sets the HTML_PARSE_NODEFDTD flag, which prevents the addition of a standard doctype if it was not found.

All predefined constants can be viewed in the documentation .

Attention

Although the documentation states that Libxml version 2.6 is required, however LIBXML_HTML_NODEFDTD is available only from version 2.7.8, and LIBXML_HTML_NOIMPLIED from version 2.7.7

Based on the answers to the questions:

    Remove all dregs inside tags :

     $str = "<td class=x11111111111111111 width=140 style='>1111111111111111111111111111<td>11111111111111border-top:none;border-left:none; width:107pt'>"; $str = preg_replace('/<td.+?>/', '<td>', $str); echo $str; //result <td>1111111111111111111111111111<td> 
    • one
      And nothing that those 1111 were inside the style attribute of the <td> tag that is actually from the whole line - Mike
    • @Mike is genuinely unknown, maybe it happened during the formation of this html crap. - Jean-Claude
    • one
      The vehicle in question clearly noted this: "there may be text attributes inside the tag, in which there may be substrings corresponding to the name of the tag " - Mike

    Universal Clinic of All Attributes

     $text = '<p style="padding:0px;"><strong style="padding:0;margin:0;">hello</strong></p>'; echo preg_replace("/<([az][a-z0-9]*)[^>]*?(\/?)>/i",'<$1$2>', $text); // <p><strong>hello</strong></p> 
    • A bold statement, given that the example of the author stumbles: ideone.com/H77Iq9 - VenZell
    • In addition, not only numbers and letters are allowed in attributes. There may be, for example, a hyphen. Take at least the data-* attribute. And not only a hyphen can be ... - VenZell
    • about data- * and the rest is deleted normally, I do not see a precedent about stumbling, attributes are deleted correctly, what comes after the tag is already content, there is no closing tag and the attached tag too. - Redr01d
    • This is NOT the content of the tag, it is the content of the attribute, so it cannot be said that the attribute has been deleted correctly. The beginning of the attribute is removed, but the end of the attribute and its contents are not. About the data-* and other things you are right. At night looking I read the regular book. - VenZell
    • это содержимое атрибута yes exactly, as it did not notice. Removes only valid values. - Redr01d