PCRE - when to use the 'u' modifier?

Question

I work with utf-8 encoding. In reference books (Jeffrey Friedl - Regular expressions 2008, Koterov - PHP 7 2016) it is mentioned that the modifier u is used to work with this encoding. As practice has shown, and many examples of regulars here on SO, regular sessions in php work without this modifier.

So when should u use the modifier, and when is it not necessary to do this?

Additional functionality is specified after the colon: the template and the target line are processed as UTF-8 lines
It would be nice to have more links to questions in which an example of the work of regulars with UTF without the u modifier
@Grundy was apparently wrong, I really can’t find any specific examples mentioning utf.
But that is strange in my project of a regular season without this modifier and they work.

Accepted Answer · 2016-11-13T14:54:52

Whether or not the u modifier is used depends on the purpose of the regular expression and your skill in composing them.

Here is an example in which I divide a string (UTF-8) into two parts: the first character and all the others:

 <?php $str = 'абвгд'; // модификатор есть preg_match('%^(.)(.+)$%u', $str, $matches); var_dump($matches); // модификатор нет preg_match('%^(.)(.+)$%', $str, $matches); var_dump($matches);

Result of work:

 array(3) { [0]=> string(10) "абвгд" [1]=> string(2) "а" [2]=> string(8) "бвгд" } array(3) { [0]=> string(10) "абвгд" [1]=> string(1) " " [2]=> string(9) " бвгд" }

Here a regular season without a modifier worked with an error, since . without the modifier, u corresponds to 1 byte (and not a character) except byte x0D.

And now another example: get a substring between two brackets:

 <?php $str = 'прг[абвгд]ктм'; // модификатор есть preg_match('%\[([^\]]*)\]%u', $str, $matches); var_dump($matches); // модификатор нет preg_match('%\[([^\]]*)\]%', $str, $matches); var_dump($matches);

Result of work:

 array(2) { [0]=> string(12) "[абвгд]" [1]=> string(10) "абвгд" } array(2) { [0]=> string(12) "[абвгд]" [1]=> string(10) "абвгд" }

Both options work correctly, since the characters [ and ] uniquely identified in the UTF-8 encoding and their codes are not part of multibyte characters.

UPD

depending on the skill of compiling and in the first case could the result be obtained without a modifier?

 <?php $str = 'abcde'; // модификатор нет preg_match('%^([\x00-\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2})(.+)$%', $str, $matches); var_dump($matches); $str = 'абвгд'; // модификатор нет preg_match('%^([\x00-\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2})(.+)$%', $str, $matches); var_dump($matches); $str = 'Ⴀабвгд'; // модификатор нет preg_match('%^([\x00-\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2})(.+)$%', $str, $matches); var_dump($matches);

Result of work:

 array(3) { [0]=> string(5) "abcde" [1]=> string(1) "a" [2]=> string(4) "bcde" } array(3) { [0]=> string(10) "абвгд" [1]=> string(2) "а" [2]=> string(8) "бвгд" } array(3) { [0]=> string(13) "Ⴀабвгд" [1]=> string(3) "Ⴀ" [2]=> string(10) "абвгд" }

depending on the skill of compiling and in the first case could the result be obtained without a modifier?

hardworm hardworm 1,171 five 7 · Answer 2 · 2016-11-13T14:54:35

When will unicode be in a string with non-ASCI characters

PCRE - when to use the 'u' modifier?

2 answers 2

More articles: