HTML Parsing

Question

Good evening. Now I decided to break my head over a single parsing algorithm, but that’s the problem. I can not understand what the problem is. Actually here is the code:

$alltext=file_get_contents($url); $alltext = trim($alltext); $alltext = strip_tags($alltext); $alltext = ereg_replace('/&\w;/', '', $alltext); preg_match_all("/(\b[\w+]+\b)/",$alltext,$words);

And the problem is that the function does not take all the text. In general, it is not clear what he is doing. Russian letters also do not take. Only errors sometimes pulls out, but the main page can not. I did it differently. Did not work out. Tell me, what's the problem?

There is such a page:

  <h1>TEXT</h1> Тут будет мой текст ололо push push <b>My frog<b><i>is die</i><br> http://site.ru/

In general, the function should pull out all the words from this page without tags. The following function he gave me only: b, de, my, ek, ololo, push, push, y, frogis, die, http, site, ru

  $alltext=file_get_contents($url); // убираем пробелы в начале и конце текста $alltext = trim($alltext); // удаляем тэги из текста $alltext = strip_tags($alltext); // удаляем последовательности вида &lt; &gt; &#1a22; при этом только если длина меньше 9 символов $alltext = preg_replace('/&[a-z0-9#]{1,9}?;/i', '', $alltext); // в $words помещаем слова preg_match_all("/[а-яa-z0-9_]+/",$alltext,$words);

That is, he did not give out many results, for example, the true TEXT. I can not understand what's wrong.

As well, when at the end of a busy day, someone will have fun))> It's not at all clear what he is doing.

Accepted Answer · 2012-07-29T22:44:59

Try this:

 // загрузка данных $alltext=file_get_contents($url); // убираем пробелы в начале и конце текста $alltext = trim($alltext); // удаляем тэги из текста $alltext = strip_tags($alltext); // удаляем последовательности вида &lt; &gt; &#1a22; при этом только если длина меньше 9 символов $alltext = preg_replace('/&[a-z0-9#]{1,9}?;/i', '', $alltext); // в $words помещаем слова preg_match_all("/[а-яa-z0-9_]+/i",$alltext,$words);

I forgot to put the register ignore modifier in the second regular expression.
now he issued: text, b, de, my, ec, ololo, push, pushmy, frogis, die, http, site, ru I can not understand why the Russian text does not give all.
umm, both files are in UTF 8, which one is better to convert?

HTML Parsing

1 answer 1

More articles: