Encoding problems

Question

The site itself is in utf-8 encoding. The default encoding is written in the nTaxes and in the page header. This site loads the page from Ineta and retrieves the encoding from the title response, for example, it is 1251. Then, using nokogiri.php, parsit title and h1 from the page. Well, displays on the site page. Duck that's how it happened that if the site in win1251, he displays it on the page well. and if I sparced the site in utf8 - it turned out krakozyabry. Previously, using simple_html_dom.php seemed to be all good with the encoding, but it is terribly slow and it eats memory.

I can not understand why the encoding behaves this way?

So, I realized that Nakogiri uses

$dom = new DOMDocument('1.0', 'UTF-8' );

here I tried to change the UTF-8 by windows-1251 with my hands! The most interesting thing is that if the site in windows-1251 t oats is ok, and if UTF-8, then everything is a failure! like this:

In the shape of the worm, in the form of the harmony of the worm, in the form of the worm and in the worm and in the worm and in the harrow of the harping

Once again I clarify the problem. My site loads someone else's / other pages from Ineta, pulls out the headers, takes the encoding from the server's response, and already converts the pulled out info into utf8, so I cannot prescribe any kind of conversion, I need it universally to make it.

DOMDocument for some reason correctly loads the pages on win1251, although my site is on utf8 (and the site and the database where it writes the data) if the page was on utf8, then the crocodiles get into the database (as the commentator already said in iso8859-1 encoding)

If the problem is with only one line, then you can simply: $ str = — — — ° ° ‰ °,,,, Aperture, quality of action, aperture;
// Protection against viruses, fighting viruses, life without computer viruses
check that the default_charset in php.ini is not set in cp1251 here on the topic devzone.zend.com/1538/php-dom-xml-extension-encoding-processing
@armenka, you can not clearly say what OS you have on the machine where this parser runs?

Shelkot shelkot 42 four · Answer 1 · 2013-06-29T13:03:02

try to create a .htaccess file and write AddCharset UTF-8

evgen_dev evgen_dev one · Answer 2 · 2017-02-01T09:05:18

The problem may occur due to a missing meta tag with encoding. At one time, a strange but effective solution helped me:

 $response = str_replace('<head>', '<head><meta http-equiv="content-type" content="text/html; charset=utf-8">', $response);

handbat0 handbat0 169 four sixteen · Answer 3 · 2013-06-29T08:48:34

I came across this - open NotePad ++ - encoding -> convert to (UTF-8) and then Russian will be displayed to you)

No, it does not roll. here is the problem in new DOMDocument ('1.0', 'UTF-8'); file conversion does not help. - armenka

Emil Sabitov Emil Sabitov 1,042 one eleven 31 · Answer 4 · 2013-06-29T09:19:30

If the server is on Linux, then open the code in the laptop and convert the encoding to UTF-8 without BOM.

vano vano nineteen 2 · Answer 5 · 2013-06-29T08:24:39

Try typing not windows-1251 but cp1251 should work. in general, utf-8 almost always has problems, especially since utf-8 has a lot of subspecies, while Russian is supported by units. do not fool yourself create a website in windows-1251 and everything will be fine, in the end it is not advisable in terms of performance to translate entire pages and even fear for each new document connected. "Earn or not?"

every time I have different pages loaded, I cited for example that loading a page in 1251 happens in koi8 or what else it loads, my type of site is a parser.
it didn’t help for the site in UTF8 and the sites on windows-1251 load normally - paradox, I can’t normally load the page exactly in utf8, the site itself is also in utf8
“utf-8 almost always has problems” is not true, “utf-8 has a lot of subspecies” is not true, utf-8 is one, “units support Russian” is nonsense.

Encoding problems

5 answers 5

More articles: