Problems with encoding when getting the page title by url

Question

$content = file_get_contents($Url); preg_match_all('#<title>.+</title>#', $content, $matches); $title = preg_replace('#(<title>|</title>)#', '', $matches[0][0]);

This way I get the title from the URL. The problem is that not all sites have the same encoding, from most sites it turns out to extract the title, but the rest do not, displays rhombuses with question marks, do not write to the database at all. I tried to translate the resulting string into UTF-8, but so far to no avail.

Jean-Claude Jean-Claude 5.101 1 golden mark 11 silver marks 30 bronze marks · Answer 1 · 2016-12-10T16:56:54

Determine the encoding with the mb_detect_encoding () function.
Convert, for example, to UTF-8 with the function mb_convert_encoding ()
Then parse with the modifier u - utf-8.

This modifier includes additional PCRE functionality that is not compatible with Perl: the template and the target string are treated as UTF-8 strings.

preg_match_all('#<title>.+?</title>#isu', $content, $matches);

Answer 2 · 2016-12-10T16:14:48

First you need to decide in what encoding you will store pages.

Then you need to determine the encoding of the downloaded page.

If the encoding is different, call the iconv() function.

Vadim Moroz Vadim Moroz 182 1 silver mark 13 bronze marks · Answer 3 · 2016-12-10T17:07:21

I solved the problem as follows: Defined the page encoding by url

 preg_match_all('#charset=.+"#', $content, $array); $charset = preg_replace('#(charset=|")#', '', $array[0][0]);

But this code is not universal, because not all pages specify the encoding in this way. Then converted the string to the correct encoding.

 $newtitle = iconv($charset, "UTF-8", $title);

Of course, the code is not perfect, but the percentage of successful header extraction has increased significantly.

Problems with encoding when getting the page title by url

3 answers 3

More articles: