Hello. I have a string with random html tags. The task is to break the line into an array of tags. Example:

$aaa = '<h1>Hello</h1> <p> zzzzz zz z zz <a href="#">VVV</a> zzzzz </p>'; 

print_r ($ aaa);

 [0] => NAME => h1, TEXT => Hello, [1] => NAME => p, TEXT => zzzzz zz z zz <a href="#">VVV</a> zzzzz, 

Update

Improved code:

  foreach($iterator as $node) { if ($node->nodeType === XML_ELEMENT_NODE && ($node->nodeName !== 'html' && $node->nodeName !== 'body')) { $name = iconv("ISO-8859-1", "UTF-8", $node->nodeName); $text = iconv("ISO-8859-1", "UTF-8", trim(getInnerHTML($node), PHP_EOL)); array_push($output, array( 'NAME' => $name, 'TEXT' => $text)); } } 

But the output is still cracking.

  • domDocument to help you - splash58
  • If you are given an exhaustive answer, mark it as correct (a daw opposite the selected answer). - Nicolas Chabanovsky

1 answer 1

There is such a solution:

Initial data:

 $content = '<h1>Hello</h1> <p> zzzzz zz z zz <a href="#">VVV</a> zzzzz </p>'; 

Classes:

 function getInnerHTML(\DOMNode $element) { $innerHTML = ''; $children = $element->childNodes; foreach ($children as $child) { $innerHTML .= $element->ownerDocument->saveHTML($child); } return $innerHTML; } 

RecursiveDOMIterator

Using:

 $dom = new DOMDocument; $dom->loadHTML($content); $output = array(); $iterator = new RecursiveIteratorIterator( new RecursiveDOMIterator($dom), RecursiveIteratorIterator::SELF_FIRST); foreach($iterator as $node) { if ($node->nodeType === XML_ELEMENT_NODE && ($node->nodeName !== 'html' && $node->nodeName !== 'body')) { array_push($output, array( 'NAME' => $node->nodeName, 'TEXT' => trim(getInnerHTML($node), PHP_EOL))); } } 

Result:

 Array ( [0] => Array ( [NAME] => h1 [TEXT] => Hello ) [1] => Array ( [NAME] => p [TEXT] => zzzzz zz z zz <a href="#">VVV</a> zzzzz ) [2] => Array ( [NAME] => a [TEXT] => VVV ) ) 

UPD :

 $dom->loadHTML(mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8')); 

UPD2 :

 function getAttributes(\DOMNode $element) { $attributes = array(); foreach ($element->attributes as $attribute) { $attributes[$attribute->nodeName] = $attribute->nodeValue; } return $attributes; } 

Then the challenge will change a bit:

 foreach($iterator as $node) { if ($node->nodeType === XML_ELEMENT_NODE && ($node->nodeName !== 'html' && $node->nodeName !== 'body')) { array_push($output, array( 'NAME' => $node->nodeName, 'TEXT' => trim(getInnerHTML($node), PHP_EOL), 'ATTR' => getAttributes($node))); } } 

In the new ATTR element will be with arrays of the form attribute name => attribute value.

And recursively not needed in this case - there is an iterator.

UPD3 :

 function walk($output, \DOMNode $node, $depth = 0) { if ($node->hasChildNodes()) { $children = $node->childNodes; foreach ($children as $child) { if ($child->nodeType === XML_TEXT_NODE) { continue; } $output[] = $child->nodeName; $item = walk([], $child, $depth + 1); if (!empty($item)) { $output[] = $item; } } } return $output; } $root = $dom->getElementsByTagName('body')[0]; $output = walk([], $root, 0); 

So you can recurse around.

Result :

 array(3) { [0]=> string(2) "h1" [1]=> string(1) "p" [2]=> array(1) { [0]=> string(1) "a" } } 
  • Tell me how to deal with the Russian language? - tsx
  • @tsx, look update - sleeper
  • Thanks, tell me, if I want to split nested tags, how can I do this? - tsx
  • recursively, and write src and href elements (img and a) to the URL key - tsx
  • @tsx, see UPD2 ... - sleeper