Hello. My task is to categorize keywords when downloading. For example, there is a list:

ΠΊΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΠΈ ΠΊΠΎΡ‚ΠΈΠΊΠΎΠ² ΠΊΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΠΈ ΠΊΠΎΡ‚ΡΡ‚ΡŒ ΠΊΠΎΡ‚ΠΈΠΊΠΈ ΠΊΠΎΡ‚ΠΈΠΊΠΈ Π²ΠΏΠ΅Ρ€Π΅Π΄ ΠΊΠΎΡ‚ΠΈΠΊΠΈ ΠΊΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΠΈ ΠΊΠΎΡ‚ΠΈΠΊΠΎΠ² ΠΊΠΎΡ‚ΠΈΠΊΠΎΠ² Π²ΠΏΠ΅Ρ€Π΅Π΄ котятя ΠΊΡƒΠΏΠΈΡ‚ΡŒ собаку ΠΊΡƒΠΏΠΈΡ‚ΡŒ Ρ‰Π΅Π½ΠΊΠ° ΠΏΠΎΡ€ΠΎΠ΄Ρ‹ собак собака собаки Ρ„ΠΎΡ‚ΠΎ Ρ‰Π΅Π½ΠΊΠΈ Ρ‰Π΅Π½ΠΊΠΈ мСсяц Ρ‰Π΅Π½ΠΊΠΈ Π½Π΅ΠΌΠ΅Ρ†ΠΊΠΎΠΉ Ρ‰Π΅Π½ΠΊΠΈ ΠΎΠ²Ρ‡Π°Ρ€ΠΊΠΈ 

There are categories "cats" and "dogs". Each category corresponds to a part of the words that can define a category. For example, for "Kitties": cat, cats, kittens, for "Dogs": dogs, puppies. In fact, there are much more categories and parts of keywords for selection. Yes, and load lists of several hundred thousand lines. Now the distribution is as follows:

 $all_keys = explode("\r\n", $_POST['all_keys']);//Ρ€Π°Π·Π±ΠΈΠ²Π°Π΅ΠΌ построчно foreach($all_keys as $str_keys) { $name_category = 'unsort';//катСгория ΠΏΠΎ ΡƒΠΌΠΎΠ»Ρ‡Π°Π½ΠΈΡŽ, ΠΊΠ°ΠΊ нСсортированныС $cats_key = array ("ΠΊΠΎΡ‚", "кошк", "котя");//список частСй ΠΊΠ»ΡŽΡ‡Π΅ΠΉ для опрСдСлСния ΠΊΠ°Ρ‚Π΅Π³ΠΎΡ€ΠΈΠΈ foreach ($cats_key as $cats) { $poisk = stripos ($str_keys, $cats); //провСряСм Π½Π΅ΠΆΠ΅Π»Π°Ρ‚Π΅Π»ΡŒΠ½Ρ‹Π΅ слова if ($poisk !== false) { $name_category = 'cats'; goto endsort;//Ссли катСгория присвоСна, Ρ‚ΠΎ сразу записываСм слово Π² Π±Π°Π·Ρƒ } } $dogs_key = array ("собак", "Ρ‰Π΅Π½ΠΊ"); foreach ($dogs_key as $dogs) { $poisk1 = stripos ($str_keys, $dogs); if ($poisk1 !== false) { $name_category = 'dogs'; goto endsort; } } endsort: mysqli_query ($db, "INSERT INTO keywords (keyword, theme_key) VALUES ('$str_keys', '$name_category')"); unset ($name_category); } 

I am sure that there is a more intelligible decision. If the categories will increase, the number of parts of the words, too, then the download will be very heavy with such a search of each word. Advise how you can simplify the process of parsing the line. Parts of keys can be stored in code, in a text document, in the database. Thank you in advance. I will add the purpose of all this: Statistics are taken from the statistics in a solid list. The system prepares pages for dogs and cats separately. So, so that the operator for the dogs in the form got the keys only for dogs and the distribution occurs

  • If you are generally satisfied with the current search in parts of words, then you make a table in the database with these parts and a query using like select the appropriate ones. In general, you can do this in one request. Although something tells me that you need to dig in the direction of full-text search, taking into account word forms, etc. and this is some kind of sphinx ... - Mike
  • I understand that you offer to load the list indiscriminately, and then in the database by a query in a loop to change the category? That is, the WHERE sample theme_key = unsort and UPDATE theme_key = 'cats' WHERE keyword LIKE '% cat%' OR LIKE '% cat%'? - Sergey Strelchenko
  • one
    No, you can also insert the results of a select with one query and use words that are not explicitly specified in the query, but take something from the table like insert into keywords(keyword, key) select '$str_keys', category from key_categories K where '$str_keys' like concat('%', K.key, '%') only possible to add some group by having having to select each category 1 time in case of coincidence of several roots and it is possible to filter it additionally - Mike
  • one
    I usually integrate sphinx (sphinx), there are both word forms and morphology and rankings in general, everything that is needed, and I do not invent bicycles. - Naumov

1 answer 1

The topic is interesting, and as always there are options.

In fact, it depends on how and for what the result will be used.

1) It is necessary that the search worked. Take Sphinx, Solr, Elasticsearch, set up rejoice ...

2) Himself indulge.

I would look to Machine Learning https://github.com/php-ai/php-ml , http://php.net/manual/en/book.fann.php

Immediately I will say about machine learning or something I do not know , but something like this should work.

train.data

The first line, the 1st number, the number of tests, the 2nd number of entries, the 3rd result.

 17 1 1 ΠΊΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΠΈ ΠΊΠΎΡ‚ΠΈΠΊΠΎΠ² cat ΠΊΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΠΈ ΠΊΠΎΡ‚ΡΡ‚ΡŒ cat ΠΊΠΎΡ‚ΠΈΠΊΠΈ cat ΠΊΠΎΡ‚ΠΈΠΊΠΈ Π²ΠΏΠ΅Ρ€Π΅Π΄ cat ΠΊΠΎΡ‚ΠΈΠΊΠΈ ΠΊΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΠΈ cat ΠΊΠΎΡ‚ΠΈΠΊΠΎΠ² cat ΠΊΠΎΡ‚ΠΈΠΊΠΎΠ² Π²ΠΏΠ΅Ρ€Π΅Π΄ cat котятя cat ΠΊΡƒΠΏΠΈΡ‚ΡŒ собаку dog ΠΊΡƒΠΏΠΈΡ‚ΡŒ Ρ‰Π΅Π½ΠΊΠ° dog ΠΏΠΎΡ€ΠΎΠ΄Ρ‹ собак dog собака dog собаки Ρ„ΠΎΡ‚ΠΎ dog Ρ‰Π΅Π½ΠΊΠΈ dog Ρ‰Π΅Π½ΠΊΠΈ мСсяц dog Ρ‰Π΅Π½ΠΊΠΈ Π½Π΅ΠΌΠ΅Ρ†ΠΊΠΎΠΉ dog Ρ‰Π΅Π½ΠΊΠΈ ΠΎΠ²Ρ‡Π°Ρ€ΠΊΠΈ dog 

train.php

 <?php $num_input = 1; $num_output = 1; $num_layers = 2; $num_neurons_hidden = 3; $desired_error = 0.001; $max_epochs = 500000; $epochs_between_reports = 1000; $ann = fann_create_standard($num_layers, $num_input, $num_neurons_hidden, $num_output); if ($ann) { fann_set_activation_function_hidden($ann, FANN_SIGMOID_SYMMETRIC); fann_set_activation_function_output($ann, FANN_SIGMOID_SYMMETRIC); $filename = dirname(__FILE__) . "/train.data"; if (fann_train_on_file($ann, $filename, $max_epochs, $epochs_between_reports, $desired_error)) fann_save($ann, dirname(__FILE__) . "/evaluator.net"); fann_destroy($ann); } 

tester.php

 <?php $train_file = (dirname(__FILE__) . "/evaluator.net"); if (!is_file($train_file)) die("The file evaluator.net has not been created!"); $ann = fann_create_from_file($train_file); if (!$ann) die("ANN could not be created"); $ev = function ($input) use ($ann) { // Validation return fann_run($ann, [$input]); } var_dump($ev("ΠΊΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΠΈ ΠΊΠΎΡ‚ΠΈΠΊΠΎΠ²")); var_dump($ev("котятя")); var_dump($ev("ΠΏΠΎΡ€ΠΎΠ΄Ρ‹ собак")); fann_destroy($ann);