Unfortunately I can not provide the source code of the script and there are only examples when in my opinion there may be a leak.

The php script works in the background, processes everything in several processes work with the processes - there is the main process and it only breeds children for the purpose of processing, the child processes download zip archives from ftp, extracts all xml files from zip and then processes them and writes them to mysql .

And an interesting way to find leaks, the script runs for several days from 1 to 3. After some time, uses all the memory on the server and the server stops responding.

Possible reasons for the leak (simplified version, there are no checks that the file was downloaded, processing xml and sending it to the database and searching for new zip files):

// Скачиваем архив и отправляем путь до файла в функцию, около 430000 zip файлов function start(){ $file_loc = fopen(local_dir.ftp_zip, 'w'); if(@ftp_fget($conn_id, $file_loc, $ftp_zip, FTP_BINARY)) { $this->open_zip(local_dir.$ftp_zip, $link_sql); } unlink(local_dir.ftp_zip); } // Вытаскиваем из zip xml файлы function open_zip($file, $link_sql){ $zip = new ZipArchive; if($zip->open($file) == TRUE) { // Обычно в 1 архиве от 10000 до 100000 xml файлов for($i = 0; $i < $zip->numFiles; $i ++) { $filename_full = $zip->getNameIndex($i); $filename = explode('_', $filename_full)[0]; // Отправляем $xml_text в приватную функцию xml_to_sql($zip->getFromName($filename_full), $link_sql) } } } } // Преобразуем xml в SimpleXMLElement, где $xml_text это строка в формате xml function xml_to_sql($xml_text, $link_sql){ foreach(new SimpleXMLElement($xml_text) as $xml) { break; } } 

Identity code for child processes:

 class daemon_regions { // Список обрабатываемых регионов private $regions = ['dir1', 'dir2', 'dir3']; // Максимальное количество дочерних процессов public $maxProcesses = 10; // Когда установится в TRUE, демон завершит работу protected $stop_server = FALSE; // Здесь будем хранить запущенные дочерние процессы protected $currentJobs = array(); public function __construct() { // Ждем сигналы SIGTERM и SIGCHLD pcntl_signal(SIGTERM, array($this, "childSignalHandler")); pcntl_signal(SIGCHLD, array($this, "childSignalHandler")); } // Родитель плодит детей public function run() { // Пока $stop_server не установится в TRUE, гоняем бесконечный цикл foreach($this->regions as $name) { // Если уже запущено максимальное количество дочерних процессов, ждем их завершения $flud_off = True; while(count($this->currentJobs) >= $this->maxProcesses) { if($flud_off) { $flud_off = False; } sleep(10); } if(!$this->stop_server) { $this->launchJob($name, $settings); } } while($this->currentJobs != []) { sleep(1); } } // Создает дочерний процесс protected function launchJob($name) { // Создаем дочерний процесс // весь код после pcntl_fork() будет выполняться // двумя процессами: родительским и дочерним $pid = pcntl_fork(); if($pid == - 1) { // Не удалось создать дочерний процесс return FALSE; } elseif($pid) { // Этот код выполнится родительским процессом $this->currentJobs[$pid] = TRUE; } else { // А этот код выполнится дочерним процессом $start = time(); $dm = new daemon_region_parser(); $dm->main($name); $today_summ = gmdate("H:i:s", time() - $start); exit(); } return TRUE; } // Обработка получаемых UNIX сигналов public function childSignalHandler($signo, $pid = null, $status = null) { switch ($signo) { case SIGTERM: // При получении сигнала завершения работы устанавливаем флаг $this->stop_server = true; break; case SIGCHLD: // При получении сигнала от дочернего процесса if(!$pid) { $pid = pcntl_waitpid(- 1, $status, WNOHANG); } // Пока есть завершенные дочерние процессы while($pid > 0) { if($pid && isset($this->currentJobs[$pid])) { // Удаляем дочерние процессы из списка unset($this->currentJobs[$pid]); } $pid = pcntl_waitpid(- 1, $status, WNOHANG); } break; default: // все остальные сигналы break; } } } 

I use php v5.6, mysql and the mysqli module.

About the server: tested on a VPS SMP Debian v3.16.36, 2 cores and 58GB of RAM, and another server running SMP Debian 3.16.36, 12 cores with and 64 GB of RAM.

  • one
    Well, you understand that there is a joke here by itself "- I don’t like Caruso at all, he sings terribly! - And where did you hear him? - Yes, Rabinovich sang to me." To get started, try to pomonitorte server stupid top , look at what processes are guzzling memory. - rjhdby
  • @rjhdby is what we are talking about and a php script, monitored via htop, eating child processes, and the code from the child process in which there may be a problem I cited as an example. - users
  • How are the child processes started? Once at the beginning and hang in the memory or at each iteration run, work out and go out? Those processes that eat - what time is it started, just now or 1-3 days ago? - rjhdby
  • @rjhdby updated the response, but regardless of the number of child processes, the child process manages to eat up all the memory. - users
  • And yet I repeat the question, in order to exclude the obvious things. Those processes that eat - the launch time is what, just that or 1-3 days ago? - rjhdby

2 answers 2

To begin with, the definition of where the memory goes: use some profiler who can monitor memory. Most likely, the included profiler drastically drops the script performance, and the 2-day usage report may be too huge, so it’s best to limit the amount of work to run with the profiler to some more compact value. enSO advises profilers:

xdebug , unfortunately, does not track memory.

It seemed that it would be possible in the old manner to stumble calls to memory_get_usage or memory_get_peak_usage , but this is not only inconvenient - these functions cannot track the memory allocated for libraries. For example, the simplexml inside libxml2 that simplexml uses.

And on simplexml , I have the main suspicion that it does not return all the occupied memory.


It makes sense to cut the task into a pipeline. PHP architecturally sharpened for long-lived scripts.

  • One group of scripts goes through the network and downloads files. Network I / O is usually slow, in the case of rewriting a stream into a file descriptor, the memory consumption will be quite small even for very long scripts. Information about the downloaded file to publish in the task queue. If you get distorted with multicurl 'th, then it can be a single script that downloads in many threads.

  • The second group is subscribed to the queue of new archives, unpacks the archive into separate xml and puts the tasks of processing them into another queue. You can restart the worker, for example, after every 10 unpacked files, if it turns out that ziparchive is flowing. Or even pull the system unzip instead of ziparchive.

  • the third group reads the xml queue ready for processing and already reads and writes mysql. Too it is possible to restart easily and without serious consequences.

At first glance, this architecture is more complicated, but it makes it easier to keep track of at what stage of the plugging, to add new workers to the processing of this stage, or to completely replace the processors of this particular site. The handler can easily die and restart without the need, for example, to download a large archive again and bypass it completely, although 999 out of 1000 files have already been processed.

  • In addition, it is worth noting that such an approach, with its “seeming” complexity, will be much more convenient for perception than playing with a fork within one piece of code - rjhdby
  • I think it's time to switch to the conveyor version, but at the moment it looks like a leak in SimpleXML. - users

How often do you process files? How often does GC work?

You can try this

 function xml_to_sql($xml_text, $link_sql){ $xmlTree = new SimpleXMLElement($xml_text); foreach($xmlTree as $xml) { break; } unset($xmlTree); } 

Also try adding, after the loop

 $zip->close(); gc_collect_cycles(); 
  • The gap between zip from 0.3 seconds to 1 minute, at the moment GC works only when the child process is completed, perhaps this is the problem. - users