Problems replacing words with pattern and preg_replace_callback ()

Question

There is a code in PHP that translates texts from Ukrainian into Russian and vice versa, by replacing the words of one language with the corresponding words of another:

$oldstring = 'Мова, з якої здійснюється переклад'; $words = array(array('ua'=>'з', 'ru'=>'с'), array('ua'=>'здійснюється', 'ru'=>'осуществляется'), array('ua'=>'мова', 'ru'=>'язык'), array('ua'=>'переклад', 'ru'=>'перевод'), array('ua'=>'якої', 'ru'=>'которой')); foreach ($words as $row) { $fndrep[$row['ua']] = $row['ru']; } $pattern = '~(?=([\x{0410}-\x{042F}]?)([\x{0430}-\x{044F}]?))\b(?i)(?:' . implode('|', array_keys($fndrep)) . ')\b~u'; $newstring = preg_replace_callback($pattern, function ($m) use ($fndrep) { mb_internal_encoding('UTF-8'); $lowm = $fndrep[mb_strtolower($m[0])]; if ($m[1]) return ($m[2]) ? mb_strtoupper(mb_substr($lowm, 0, 1)) . mb_substr(mb_convert_case($lowm, MB_CASE_LOWER), 1, mb_strlen($lowm)) : mb_strtoupper($lowm); else return $lowm; }, $oldstring); echo $newstring; // получаем "Язык, с которой осуществляется перевод"

The code works, but a number of problems remain:

The most important thing is: although, in general, from Ukrainian to Russian or vice versa can be translated literally, but there are, of course, many cases where the context needs to be taken into account, and the output requires a completely different word or at least a different case of the word;
Take an example from our code:
because ukr. The word мова is feminine, and in Russian the word corresponding to it is masculine, then at the output we get the Язык, с которой осуществляется перевод (although there should be "the language from which ...").
Hence the question: what should be corrected in the code so that whole words (for example, мова|язык ) can be entered into the base of words, and at the same time expressions (for example, мова з якої|язык с которого )? And, if the code finds not just a whole word, but an integer expression, then it uses it when translating
preg_replace_callback() in the above code eliminates the need to add the same word with a capital letter and a small letter to the database, receiving the output of the translated word in the register in which it is written in the source text. But there are glitches, when for some reason a word on the output for some reason displays only in BIG letters, although in the source text only the capital letter is its first letter.
The code does not understand the words with a hyphen. For example, the Ukrainian word будь-який code divides into two parts: translates the Ukrainian. the word який in Russian is like который , - and we get будь-который , although they should
The problem with the Ukrainian letter і (corresponds to Russian и ): if in the source text the word begins with a large І , then when translated at the output it is for some reason small и (in all words that start with this letter: i.e. internet, information and etc.)
The code does not understand abbreviations with a dot. For example, т.д. т.е. or similar (solving this problem, you need to take into account that the words at the end of the sentence must be translated as whole words without a dot)

@cyadvert, thanks for the "commas" but questions for programmers and not for philologists :)
@ php5engineer, Google Translate API does not allow getting html pages for indexing in search engines (or at least I don’t know how to achieve this) and: Google Translate API is available as a paid service
Correction of punctuation is not my desire, but the requirement of this site - all questions must be correctly formed ... Also, the punctuation to philology is not special.
But that's fine ... As for programming: you raised a difficult question.
In fact, such things are not written in web programming language ... At least tell me, is your project somehow limited thematically?
To begin with, the "project" will be limited to my limited knowledge of PHP and the words that are included in the database.
But the subject matter seems irrelevant to solving the above problems

Visman Visman 16.2k eight 21 52 · Accepted Answer · 2015-09-26T08:36:43

Here is a slightly corrected algorithm with an extended test.

 mb_internal_encoding('UTF-8'); $oldstring = 'Мова, з якої здійснюється переклад. І привіт вам Іван ІВАНОВИЧ.'; $words = array(array('ua'=>'мова, з якої', 'ru'=>'язык, с которого'), array('ua'=>'і', 'ru'=>'и'), array('ua'=>'з', 'ru'=>'с'), array('ua'=>'здійснюється', 'ru'=>'осуществляется'), array('ua'=>'мова', 'ru'=>'язык'), array('ua'=>'переклад', 'ru'=>'перевод'), array('ua'=>'якої', 'ru'=>'которой'), array('ua'=>'привіт', 'ru'=>'привет'), array('ua'=>'іван', 'ru'=>'иван'), array('ua'=>'іванович', 'ru'=>'иванович'), ); foreach ($words as $row) { $fndrep[$row['ua']] = $row['ru']; } $pattern = '~\b(?=([\x{0400}-\x{042F}]?)([\x{0400}-\x{042F}]?))(?i)(?:' . implode('|', array_keys($fndrep)) . ')(?:\b|\s|$)~u'; $newstring = preg_replace_callback($pattern, function ($m) use ($fndrep) { //echo "<pre>\n"; //var_dump($m); //echo "</pre>\n"; $lowm = $fndrep[mb_strtolower($m[0])]; if ($m[1]) { return ($m[2]) ? mb_strtoupper($lowm) : mb_strtoupper(mb_substr($lowm, 0, 1)) . mb_strtolower(mb_substr($lowm, 1)); } else return $lowm; }, $oldstring); echo $newstring; // получаем "Язык, с которого осуществляется перевод. И привет вам Иван ИВАНОВИЧ."

One question only: the problem of the capital with which the word begins within the phrase is not solved, right?
до Вашої уваги at the exit in Russian we get к вашему вниманию (yours with a little)
@stckvrw, yes, If the title is not on the first word, then it becomes uppercase.

Answer 2 · 2015-09-26T07:14:30

The problem is raised very serious and complicated by means of web programming (PHP). Only if you strictly limit the size of the dictionary and topics.
In addition to the problems listed by the author, there is another, more complex, related to the similarity of languages and alphabets.

What if the word is written in Ukrainian as well as another word in Russian?
For example, the рожа in Ukrainian is a роза .
You should understand that PHP will replace the words in the phrase consistently.
So, we translate "And the rose fell on the paw of azor" from Russian into Ukrainian. When the code will run through the dictionary and replace words / phrases, first the "rose" will become "roga", and then the Russian "rocha" will be replaced with the Ukrainian "peak". And, by the way, the Russian "peak" will again be replaced by the Ukrainian equivalent. And at the end the phrase will completely lose its meaning.

I have all this to the fact that, as it seems to me, for the correct solution of the problem it is necessary to use other means of programming.

Yes, that's right: the problem with the context about which you mentioned it is the first one and the main one among the problems I mentioned (in the case of a rose, the truth disagrees but nonetheless).
Therefore, the first thing that comes to mind is to add in addition to individual words also whole expressions, at least the most common.
I understand that it is stupid but I don’t see any other way out
Well, well, what's stopping them from entering the $words array with whole phrases?
Something like $words = array(array('ua'=>'мова з якої', 'ru'=>'язык с которого'));
The problem is that it doesn't work that way - the code does not yet understand the phrases (check) and I don’t know what to fix in it
@stckvrw, if you put the phrase earlier in the array than its component parts, then it should be processed first.
With phrases the problem will be with the installation of capital / lowercase letters

Problems replacing words with pattern and preg_replace_callback ()

2 answers 2

More articles: