I work with utf-8 encoding. In reference books (Jeffrey Friedl - Regular expressions 2008, Koterov - PHP 7 2016) it is mentioned that the modifier u is used to work with this encoding. As practice has shown, and many examples of regulars here on SO, regular sessions in php work without this modifier.

So when should u use the modifier, and when is it not necessary to do this?

  • Additional functionality is specified after the colon: the template and the target line are processed as UTF-8 lines - Grundy
  • It would be nice to have more links to questions in which an example of the work of regulars with UTF without the u modifier - Grundy
  • @Grundy was apparently wrong, I really can’t find any specific examples mentioning utf. But that is strange in my project of a regular season without this modifier and they work. - Jean-Claude

2 answers 2

Whether or not the u modifier is used depends on the purpose of the regular expression and your skill in composing them.

Here is an example in which I divide a string (UTF-8) into two parts: the first character and all the others:

 <?php $str = 'Π°Π±Π²Π³Π΄'; // ΠΌΠΎΠ΄ΠΈΡ„ΠΈΠΊΠ°Ρ‚ΠΎΡ€ Π΅ΡΡ‚ΡŒ preg_match('%^(.)(.+)$%u', $str, $matches); var_dump($matches); // ΠΌΠΎΠ΄ΠΈΡ„ΠΈΠΊΠ°Ρ‚ΠΎΡ€ Π½Π΅Ρ‚ preg_match('%^(.)(.+)$%', $str, $matches); var_dump($matches); 

Result of work:

 array(3) { [0]=> string(10) "Π°Π±Π²Π³Π΄" [1]=> string(2) "Π°" [2]=> string(8) "Π±Π²Π³Π΄" } array(3) { [0]=> string(10) "Π°Π±Π²Π³Π΄" [1]=> string(1) " " [2]=> string(9) " Π±Π²Π³Π΄" } 

Here a regular season without a modifier worked with an error, since . without the modifier, u corresponds to 1 byte (and not a character) except byte x0D.


And now another example: get a substring between two brackets:

 <?php $str = 'ΠΏΡ€Π³[Π°Π±Π²Π³Π΄]ΠΊΡ‚ΠΌ'; // ΠΌΠΎΠ΄ΠΈΡ„ΠΈΠΊΠ°Ρ‚ΠΎΡ€ Π΅ΡΡ‚ΡŒ preg_match('%\[([^\]]*)\]%u', $str, $matches); var_dump($matches); // ΠΌΠΎΠ΄ΠΈΡ„ΠΈΠΊΠ°Ρ‚ΠΎΡ€ Π½Π΅Ρ‚ preg_match('%\[([^\]]*)\]%', $str, $matches); var_dump($matches); 

Result of work:

 array(2) { [0]=> string(12) "[Π°Π±Π²Π³Π΄]" [1]=> string(10) "Π°Π±Π²Π³Π΄" } array(2) { [0]=> string(12) "[Π°Π±Π²Π³Π΄]" [1]=> string(10) "Π°Π±Π²Π³Π΄" } 

Both options work correctly, since the characters [ and ] uniquely identified in the UTF-8 encoding and their codes are not part of multibyte characters.

UPD

depending on the skill of compiling and in the first case could the result be obtained without a modifier?

 <?php $str = 'abcde'; // ΠΌΠΎΠ΄ΠΈΡ„ΠΈΠΊΠ°Ρ‚ΠΎΡ€ Π½Π΅Ρ‚ preg_match('%^([\x00-\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2})(.+)$%', $str, $matches); var_dump($matches); $str = 'Π°Π±Π²Π³Π΄'; // ΠΌΠΎΠ΄ΠΈΡ„ΠΈΠΊΠ°Ρ‚ΠΎΡ€ Π½Π΅Ρ‚ preg_match('%^([\x00-\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2})(.+)$%', $str, $matches); var_dump($matches); $str = 'α‚ Π°Π±Π²Π³Π΄'; // ΠΌΠΎΠ΄ΠΈΡ„ΠΈΠΊΠ°Ρ‚ΠΎΡ€ Π½Π΅Ρ‚ preg_match('%^([\x00-\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2})(.+)$%', $str, $matches); var_dump($matches); 

Result of work:

 array(3) { [0]=> string(5) "abcde" [1]=> string(1) "a" [2]=> string(4) "bcde" } array(3) { [0]=> string(10) "Π°Π±Π²Π³Π΄" [1]=> string(2) "Π°" [2]=> string(8) "Π±Π²Π³Π΄" } array(3) { [0]=> string(13) "α‚ Π°Π±Π²Π³Π΄" [1]=> string(3) "α‚ " [2]=> string(10) "Π°Π±Π²Π³Π΄" } 
  • depending on the skill of compiling and in the first case could the result be obtained without a modifier? - Jean-Claude
  • @ Jean-Claude, see the answer update. - Visman
  • I understand, thank you for the detailed response. Excellent perversion to use character codes)) - Jean-Claude

When will unicode be in a string with non-ASCI characters