Help please solve.

1) var_dump('%EF%BB%BF'); //string(3) "" 2) var_dump('%C2%A0'); // string(2) " " 

The first seems to be an empty string, how can the result be var_dum = string(3) "".

I don’t know any other explanation as Unicode. A hike with these (utf-8) invisible symbols is.

The easiest option

 var_dump(explode('%', 'aaa%EF%BB%BF')[0]); 

But how to do it this way

If this is so, then we must take into account that this space may be different under a different encoding, and how can we clear the string from it?

3 answers 3

%EF%BB%BF - BOM - Byte Order Mark for Unicode.
%C2%A0 - Utf-8 non breaking space

In this format, you see them in the URL, in php they come as:

  • %EF%BB%BF => pack ("CCC", 0xef, 0xbb, 0xbf)
  • %C2%A0 => pack ("CC", 0xc2,0xa0)

Well, or easier:
urldecode('%C2%A0')

That is how they need to be checked.

In this example of yours, all characters are printed in the source line:
var_dump(explode('%', 'aaa%EF%BB%BF')[0]);

You can delete it as follows:

 function removeBOM($str=""){ if(substr($str, 0,3) == pack("CCC",0xef,0xbb,0xbf)) { $str=substr($str, 3); } return $str; } 

So:

 $str = preg_replace('/\xA0/u', '', 'A'.pack("CC",0xc2,0xa0).'B'); 

Delete all non-printable characters:

 $str = preg_replace('/[^[:print:]]/', '', $str); 

Demo

 var_dump(preg_replace('/\xA0/u', '', urldecode("Word%C2%A0Word"))); // WordWord var_dump(preg_replace('/[^[:print:]]/', '', urldecode("%EF%BB%BFWord%C2%A0Word"))); // WordWord 
  • about the BOM tag agrees, but C2A0 seems to be the Hangul syuk U + C2A0 PS syllabus :% 00% C2% 00% A0 in UTF8 or% 00% A0 but unicode non-breaking space. - Mike V.
  • This is in UTF-16BE . No-Break Space U + 00A0 View section Coding , string UTF-8 - vp_arth
  • It does not remove var_dump (preg_replace ('/ \ xA0 / u', '', '% EF% BB% BF')); = string (9) "% EF% BB% BF" - user216109
  • First of all, the string of nine bytes is not at all what was in the question. Secondly, this regular schedule to remove the non-breaking space. - vp_arth
  • @Sergey, read the selected text in the answer - vp_arth

Translation of this answer

7-bit ASCII?

If you suddenly turned out to be in 1963, and just want to use 7-bit ASCII printable characters, then all you need to do is delete all the characters in the range of codes 0-31 and 127-255:

 $string = preg_replace('/[\x00-\x1F\x7F-\xFF]/', '', $string); 

8-bit extended ASCII?

In 1963, you didn’t like it and moved to the eighties and encountered 8-bit ASCII, in which the characters 128-255 are normal, displayed, characters. Then you just need to slightly adjust the replacement string and delete the characters 0-31 and 127:

 $string = preg_replace('/[\x00-\x1F\x7F]/', '', $string); 

UTF-8?

Welcome to the 21st Century! If your string is a UTF-8 string, then you will have to use the \u modifier :

 $string = preg_replace('/[\x00-\x1F\x7F]/u', '', $string); 

You simply delete the characters 0-31 and 127. This construction will work for both UTF-8 and 8-bit ASCII, since the second is a subset of the first and they both have the same ranges of control characters . Frankly, such a construction will work without /u , but it will make your life easier if you need to delete any other characters ...

If you are dealing in Unicode, then there are a lot of non-printable characters in it , but let's consider one most frequently used one: NO-BREAK SPACE (U + 00A0)

In the UTF-8 string, it can be represented as 0xC2A0 . Accordingly, you will need to search and delete this sequence of characters, but if you used the /u modifier, you can simply specify \xA0 :

 $string = preg_replace('/[\x00-\x1F\x7F\xA0]/u', '', $string); 

Bonus: what if str_replace?

preg_replace is extremely efficient, but if you need to process a large amount of text, it will be more efficient to use str_replace with an indication of the character array:

 //задаем массив, который будем использовать во всех своих операциях замены $badchar=array( // Управляющие символы chr(0), chr(1), chr(2), chr(3), chr(4), chr(5), chr(6), chr(7), chr(8), chr(9), chr(10), chr(11), chr(12), chr(13), chr(14), chr(15), chr(16), chr(17), chr(18), chr(19), chr(20), chr(21), chr(22), chr(23), chr(24), chr(25), chr(26), chr(27), chr(28), chr(29), chr(30), chr(31), // Непечатные символы chr(127) ); //Удаляем нежелательные символы $str2 = str_replace($badchar, '', $str); 

Intuitively, it seems that this approach will work much faster, but let's do the tests. Let's create a set of test strings of various lengths and contents and check the speed of work (PHP 7.0.12 was used):

  2 chars str_replace 5.3439ms preg_replace 2.9919ms preg_replace is 44.01% faster 4 chars str_replace 6.0701ms preg_replace 1.4119ms preg_replace is 76.74% faster 8 chars str_replace 5.8119ms preg_replace 2.0721ms preg_replace is 64.35% faster 16 chars str_replace 6.0401ms preg_replace 2.1980ms preg_replace is 63.61% faster 32 chars str_replace 6.0320ms preg_replace 2.6770ms preg_replace is 55.62% faster 64 chars str_replace 7.4198ms preg_replace 4.4160ms preg_replace is 40.48% faster 128 chars str_replace 12.7239ms preg_replace 7.5412ms preg_replace is 40.73% faster 256 chars str_replace 19.8820ms preg_replace 17.1330ms preg_replace is 13.83% faster 512 chars str_replace 34.3399ms preg_replace 34.0221ms preg_replace is 0.93% faster 1024 chars str_replace 57.1141ms preg_replace 67.0300ms str_replace is 14.79% faster 2048 chars str_replace 94.7111ms preg_replace 123.3189ms str_replace is 23.20% faster 4096 chars str_replace 227.7029ms preg_replace 258.3771ms str_replace is 11.87% faster 8192 chars str_replace 506.3410ms preg_replace 555.6269ms str_replace is 8.87% faster 16384 chars str_replace 1116.8811ms preg_replace 1098.0589ms preg_replace is 1.69% faster 32768 chars str_replace 2299.3128ms preg_replace 2222.8632ms preg_replace is 3.32% faster 

Measurements were made for 10,000 iterations. It is very interesting to look at the relative differences. For lines up to 512 characters, preg_replace wins by a wide margin. In the range of 1-8kb the difference is leveled.

An interesting result, isn't it? But in any case, you should not entirely rely on my tests, since everything can be exactly the opposite on your specific data.

  • Have you tried these premieres like this: var_dump (preg_replace ('/ [\ x00- \ x1F \ x7F] /', '', '% EF% BB% BF')); , or any other. Nothing happens for me :( - user216109
  • There, as if the question to put it mildly another, maybe it is worth translating the question by moving the answer there? - vp_arth
  • 1) preg_replace ('/ [\ x00- \ x1F \ x7F \ xA0] / u', '', urldecode ($ uri), 2) preg_replace ('/ [\ x00- \ x1F \ x7F] / u', ' ', urldecode ($ uri) They are not deleted, but the question was how to remove. Maybe I misunderstood everything, I’m trying to figure it out right now - user216109
  • @vp_arth Well, in the original version of the question about the removal of these characters was. And I just really liked the answer, so I decided to translate :) - rjhdby
  • The answer is good and necessary, I simply say that he may need a more relevant question that is easy to create. - vp_arth

Actually, look towards the extension intl

There is, for example, such

IntlChar :: isprint - Checks if a character is displayed