I wrote the planetestate site parser . When pulling data from "some tags", on "some pages", question marks are displayed instead of the desired text. Here is one of these pages. On it, after parsing, instead of "uah" is displayed "???", also on this page instead of the heading "To sell Zhitlovy of boks" appears ????? " and for some reason the data from the tables with detailed information, such as "KІMNAT 3 | ZAGALNA 87.4 m2 | ZhITLOVA 51.4 m2" are not pulled out. The coding is shot down not only on Cyrillic characters, but also on numbers (for example, on price and credit). There are 221 such ads (with a bad encoding), a total of 8873 ads. An example of parsing the price of the object and currency in which the price is indicated.

function parseAd($html) { $dom = new DOMDocument(); $dom->encoding = "UTF-8"; @$dom->loadHTML($html); $objInfo = $dom->getElementById("objInfo"); ... $strongObjs = $objBuy->getElementsByTagName("strong"); $em = $strongObjs->item(0)->getElementsByTagName("em"); if ($em->length > 0) { // currency $currency = str_replace(chr(0xC2).chr(0xA0), " ", trim(utf8_decode($em->item(0)->textContent))); $ads['currency'] = $currency.'.'; $strongObjs->item(0)->removeChild($em->item(0)); } if ($strongObjs->length > 0) { // price $price = str_replace(chr(0xC2).chr(0xA0), "", trim(utf8_decode($strongObjs->item(0)->textContent))); $ads['price'] = $price; } ... return $ads; } 

I assume that this may be due to the locales of the language settings of the keyboard layout of the user who posted the ad ...

How to solve this problem?

  • I forgot to write that I output the result using var_dump(...), print_r(...) and Excel (via phpexcel). - Sergey Sereda

1 answer 1

Try this:

 $dom->loadHTML(mb_convert_encoding($html,'HTML-ENTITIES','UTF-8')); 

And further on the code do not use conversion encodings.

  • It helped. Thank! - Sergey Sereda