too slow to find data in files, any idea how to speed up? there are files in the folder, each file has the structure:

url:man.report <title> man.report - это наилучший источник информации по теме . </title> <h1> man.report </h1> man.report это наилучший источник интересующей Вас информации. От общеизвестных тем, до того, что вы даже не ожидаете найти; на man.report есть все. Надеемся, что вы найдете то, что ищете. ||| url:soft-trans.ru <title> Софттранс | Программы для транспорта </title> <h1> Софттранс </h1> ||| и т.д. 

file size is ~ 1GB total, search time ~ 200sec., search code:

 <?php declare (strict_types = 1); error_reporting(-1);//выводить всё $start = microtime(true);//запускаем таймер function my_search($search, $text) { $q = 1; foreach($search as $v) { if(!mb_strstr($text, $v)) { $q = 0; break; } } return($q); } $search = trim(mb_strtolower('ssd цена')); $s = $search; $search = explode(' ', $search); foreach($search as &$v) { $v = ' '.$v.' '; } unset($v); $files = array_diff(scandir(__DIR__.'/file/'), array('..', '.')); $u = []; $c = 0; foreach($files as $value) { $e = $value; //тут в цикле перебираю файлы $value = file_get_contents(__DIR__.'/file/'.$value); $value = mb_strtolower($value); $site_arr = explode(PHP_EOL.'|||'.PHP_EOL, $value); foreach($site_arr as $value1) { $value2 = str_replace([PHP_EOL,'.',',','-'], ' ', $value1);//,'-' $q = my_search($search, $value2); if($q) { ++$c; $url_site = substr(strstr($value1, 'url:'), 4);//взял ссылку страницы $url_site = strstr($url_site, PHP_EOL, TRUE); $value1 = strip_tags($value1); foreach($search as $v) { $value1 = str_replace(trim($v), '<b>'.$v.'</b>', $value1); } unset($v); $u[] = '<a href="http://'.$url_site.'/" target="_blank">http://'.$url_site.'/'.'</a>'.' | '.$e.'<br />'.$value1.'<br />'; } if($c > 60) { break 2; } } unset($value1); } unset($value); if(!empty($u)) { $u = array_unique($u);//удаляю дубли $u = implode('<hr>', $u); } else { $u = 'ничего не найдено'; } echo $u.'<br />'; echo 'всего: '.$c.'<br />'; echo '<br />'.'Время работы скрипта: ' . (microtime(true) - $start) . ' секунд<br /><br />'; ?> 
  • For example, system("grep -R 'search fraze' . ") - Naumov
  • I'm looking for words, not a phrase - Vasily
  • one
    How to index documents in advance? Smart search usually works like this. - Lexx918
  • one
    full text indexes in a DBMS, or search engines like sphinx - teran
  • one
    @ Vasily, as a DBMS you can use sqlite ( sqlite.org ). As a search engine PHP Lucene ( github.com/pucene/pucene ) - Yuriy Prokopets

2 answers 2

specify which way to dig, please, I do not quite understand how you can index a file

As you already know how to do inside the loop, open the file (by the way, look at glob to search for files), prepare it for the search, then search for all the words in it that can be searched (for example, with the regular \b\w{2,}\b , or explode by space, removing all punctuation marks, special characters and line breaks, or something else of your choice.

Of the words found, leave only unique ones. It turns out an array of unique useful words for a particular file.

 $wordsFromFileFoo = ["ssd", "быстрее", "hdd"]; $wordsFromFileBar = ["как", "настроить", "принтер"]; 

Create another huge array of all words of all files. The words of the file are added to this array (gradually all the words will accumulate in it and, as a rule, there will be nothing left to add so it will not be super huge). Each word in this common for all the array of words will have its own index.

 $wordsFromAllFiles = [ "ssd", "быстрее", "hdd", "как", "настроить", "принтер", ]; 

Now turn the array that contains only the words of a specific file into only a set of indices that correspond to each word and a common array. This will reduce their size.

 $wordsFromFileFoo = [0, 1, 2]; $wordsFromFileBar = [3, 4, 5]; 

Index is ready! Save all these arrays to files - a common array of all words and an array of word indexes of each file (be sure to look at SplFixedArray ). Intermediate arrays with the words of each file are no longer needed.

How to search? The desired phrase, as you have done, divide it into words. Look for each word in the general array of all words - get the index of each word. At this point, you can already make a decision - there is no word in the index - it will not be in the files, and the phrase can not be searched.

If all the words are in the index, go through the index arrays of each file and look in these through the indexes in_array or array_search . Found all? - The file contains all the search words. Not all? - The file partially contains words. No one? - The file contains no words.

Unfortunately, when changing files (changing old ones, adding new ones, deleting) the index will have to be updated, rebuilt. On the fly, only changing the part affected by the file, or entirely, is the subject of a separate question. The main thing - it is necessary.

You can go further and screw up, for example, search relevance. See in which order the words in the search phrase were and in what order they go in the index. And how many words between them. If the order is small and the order is the same, you are closer to the desired result than if the words go in a different order and are separated by several other words.

It is also a good idea to bring all word forms to infinitives. So the dictionary will be smaller and the search will speed up. Or leave as is for higher relevance. On the other hand, without searching by word forms, you will more often miss the mark even if the phrase more or less reflects the essence of the text in the file, but did not coincide a bit with prefixes and endings in the words.

If the phrase is very long, you can count as many of its words in a particular file, and sort the sample by that number. A larger number means more file matching.

  • Thank! In great detail, I will think about how to implement - Vasily
  • @ Vasily, how? - Lexx918
  • at the moment I’ve done a search on c # and so far it’s fine for me, 8 seconds versus 180 for php, I’ll consider your option a little later, until the data collapse problem arises and I solve it - Vasily

From simple to complex:

  • For each entry, a search is performed using mb_strstr . Pos strpos will be a little faster.
  • In your example, some time is spent on preparation (splitting into ||| , deleting punctuation marks), it can be done in advance.
  • You can build an index where each word will correspond to a list of documents in which this word appears. Then you will quickly find several lists, their intersection will be the result of a search. However, please note that you will need to rebuild the index when changing (deleting, adding new) documents.
  • You can parallel search.

In addition, you can apply some tricks. For example, use existing search tools. Filling (as advised to you in the comments) files in which you find the words you need, you can reduce the number of files to search. You can use a database or search engine.

  • 1. yes, this is probably true, but according to my tests, mb_strpos is slower by about 5 seconds. 2. A breakdown cannot be done in advance. the search is needed exactly in the block between |||, deletion of characters / no practically does not affect the speed. 3. Yes, you can try it. 4. what options would you recommend? - Vasily
  • @ Vasily for example like that habr.com/post/148688 - Yegor Banin