Good evening. Now I decided to break my head over a single parsing algorithm, but that’s the problem. I can not understand what the problem is. Actually here is the code:
$alltext=file_get_contents($url); $alltext = trim($alltext); $alltext = strip_tags($alltext); $alltext = ereg_replace('/&\w;/', '', $alltext); preg_match_all("/(\b[\w+]+\b)/",$alltext,$words);
And the problem is that the function does not take all the text. In general, it is not clear what he is doing. Russian letters also do not take. Only errors sometimes pulls out, but the main page can not. I did it differently. Did not work out. Tell me, what's the problem?
There is such a page:
<h1>TEXT</h1> Тут будет мой текст ололо push push <b>My frog<b><i>is die</i><br> http://site.ru/
In general, the function should pull out all the words from this page without tags. The following function he gave me only: b, de, my, ek, ololo, push, push, y, frogis, die, http, site, ru
$alltext=file_get_contents($url); // убираем пробелы в начале и конце текста $alltext = trim($alltext); // удаляем тэги из текста $alltext = strip_tags($alltext); // удаляем последовательности вида < > a22; при этом только если длина меньше 9 символов $alltext = preg_replace('/&[a-z0-9#]{1,9}?;/i', '', $alltext); // в $words помещаем слова preg_match_all("/[а-яa-z0-9_]+/",$alltext,$words);
That is, he did not give out many results, for example, the true TEXT. I can not understand what's wrong.