Parsing html: pull out all the words

Question

There was a problem. You need to parse the html and pull out all the words from it, here is a semi-working function:

preg_match_all("/<.+[^\/]>(.+[^<>])<\/.+>*/ix", $content, $var);

But it does not take into account the space before the following <.+> , It also cannot process if the html is set like this:

 <div>First Text <span>Last text</span></div>

Help to collect the right pattern.

Vitalina eleven one 2 eight · Answer 1 · 2015-02-15T06:16:55

 strip_tags($str) + preg_split('/[\W]+/', $str)

And the funny thing is that this solution is googled in 2 minutes.

fermeg fermeg 87 eleven · Answer 2 · 2016-10-11T17:39:37

by your example

 <div>First Text <span>Last text</span></div>

decision

 PATH = 'div' div = g.cssselect(PATH)[0].text_content()

Output

First Text Last text

Legionary 1.407 five sixteen · Answer 3 · 2016-10-11T18:10:40

 preg_match_all('/<\S+[^\/]>(.*?)<\/\S+>/ims', $html, $matches);

After strip_tags for elements of the resulting array.

3 answers 3