Now I'm writing a parser html pages. To do this, I get the dom page and read all the links. In order to find out whether a link is a link to an article, I need to first remove all the tags in the a tag, along with the contents, and then get the text.
For this, I use regular expressions, most often come across tags such as div, span, b, i, p, strong. To clean them, I use 6 regular expressions.

$clean_title = preg_replace("'<span[^>]*?>.*?</span>'si","", $title); $clean_title = preg_replace("'<p[^>]*?>.*?</p>'si","", $clean_title); $clean_title = preg_replace("'<div[^>]*?>.*?</div>'si","", $clean_title); $clean_title = preg_replace("'<strong[^>]*?>.*?</strong>'si","", $clean_title); $clean_title = preg_replace("'<i[^>]*?>.*?</i>'si","", $clean_title); $clean_title = preg_replace("'<b[^>]*?>.*?</b>'si","", $clean_title); 

How can you combine them into one regular expression, instead of 6?

  • one
    why are you torturing the house and yourself :) load into the DOMDocument and find all the elements of a - splash58

1 answer 1

Understanding HTML with regular expressions is a very ungrateful job. Let's do it more or less correctly. This is how you will receive a list of all references in the document, and then select the necessary ones ...

 $doc = new DOMDocument(); $doc->loadHtml($pageHtml); $a = $doc->getElementsByTagName("a"); foreach($a as $item) { $href = $item->getAttribute("href"); $text = $item->nodeValue; } 
  • That's right, but I need to get exactly the text that is inside the links and find out its length, to find out the title or not. To do this, after receiving the links in the way that you described above, I need to remove from them all the extra tags, along with the extra text. - Valentine Murnik
  • Are you saying that nodeValue gives innerhtml? - splash58