The site itself is in utf-8 encoding. The default encoding is written in the nTaxes and in the page header. This site loads the page from Ineta and retrieves the encoding from the title response, for example, it is 1251. Then, using nokogiri.php, parsit title and h1 from the page. Well, displays on the site page. Duck that's how it happened that if the site in win1251, he displays it on the page well. and if I sparced the site in utf8 - it turned out krakozyabry. Previously, using simple_html_dom.php seemed to be all good with the encoding, but it is terribly slow and it eats memory.

I can not understand why the encoding behaves this way?

So, I realized that Nakogiri uses

$dom = new DOMDocument('1.0', 'UTF-8' ); 

here I tried to change the UTF-8 by windows-1251 with my hands! The most interesting thing is that if the site in windows-1251 t oats is ok, and if UTF-8, then everything is a failure! like this:

In the shape of the worm, in the form of the harmony of the worm, in the form of the worm and in the worm and in the worm and in the harrow of the harping

Once again I clarify the problem. My site loads someone else's / other pages from Ineta, pulls out the headers, takes the encoding from the server's response, and already converts the pulled out info into utf8, so I cannot prescribe any kind of conversion, I need it universally to make it.

DOMDocument for some reason correctly loads the pages on win1251, although my site is on utf8 (and the site and the database where it writes the data) if the page was on utf8, then the crocodiles get into the database (as the commentator already said in iso8859-1 encoding)

  • Your krakozyabry encoded iso8859-1 . If the problem is with only one line, then you can simply: $ str = — — — ° ° ‰ °,,,, Aperture, quality of action, aperture; echo iconv ('utf-8', 'iso8859-1', $ str); // Protection against viruses, fighting viruses, life without computer viruses - Deonis
  • and the parser you have on Windows? check that the default_charset in php.ini is not set in cp1251 here on the topic devzone.zend.com/1538/php-dom-xml-extension-encoding-processing - zb '29
  • This is data from the server, on sweb hosting - armenka
  • @armenka, you can not clearly say what OS you have on the machine where this parser runs? I personally do not care how your hoster is called. - zb '

5 answers 5

try to create a .htaccess file and write AddCharset UTF-8

  • 3
    Then it's better: AddDefaultCharset UTF-8 - Deonis

The problem may occur due to a missing meta tag with encoding. At one time, a strange but effective solution helped me:

 $response = str_replace('<head>', '<head><meta http-equiv="content-type" content="text/html; charset=utf-8">', $response); 

    I came across this - open NotePad ++ - encoding -> convert to (UTF-8) and then Russian will be displayed to you)

    • No, it does not roll. here is the problem in new DOMDocument ('1.0', 'UTF-8'); file conversion does not help. - armenka

    If the server is on Linux, then open the code in the laptop and convert the encoding to UTF-8 without BOM.

      Try typing not windows-1251 but cp1251 should work. in general, utf-8 almost always has problems, especially since utf-8 has a lot of subspecies, while Russian is supported by units. do not fool yourself create a website in windows-1251 and everything will be fine, in the end it is not advisable in terms of performance to translate entire pages and even fear for each new document connected. "Earn or not?"

      • i.e? every time I have different pages loaded, I cited for example that loading a page in 1251 happens in koi8 or what else it loads, my type of site is a parser. will it help right now I will check - armenka
      • Pancake! forgot to add something. it didn’t help for the site in UTF8 and the sites on windows-1251 load normally - paradox, I can’t normally load the page exactly in utf8, the site itself is also in utf8 - armenka
      • 3
        @vano, a very stupid answer. “utf-8 almost always has problems” is not true, “utf-8 has a lot of subspecies” is not true, utf-8 is one, “units support Russian” is nonsense. - dzhioev
      • Thank you for criticizing Dzhioev, I’m really wrong. And now armenka try using utf8_encode () function. - vano
      • nonsense is not to use in the 21st century UTF8 - zippp