Reduce the load on the server (s) when performing a loop with file_get_contents () and preg_match ()

Question

There is a PHP price parser:

foreach ($products as $product){ ... $pageContent = file_get_contents($sourcePageURL); preg_match('/<'.$openedTagWithClass.'>'(.*?)<\\/'.$openedTag.'>/is', $pageContent, $priceString); $priceFromLink = $priceString[1]; ... }

In the $products array, each element contains a link to a page with a price source and an opening tag with a class. Those. file_get_contents() takes the contents of the page by reference, and preg_match() pulls only the price between the specified tags from it.

So far, the input array contains some pages, but over time several hundred are planned, so the question is: how can you minimize the load on the source sites or their server when performing this process?

It can either break the whole cycle into parts or not, I’m not guided deeply about what is actually happening, in general, you need to somehow optimize this process if there are hundreds of pages of input.

PinkTux PinkTux one · Answer 1 · 2017-03-24T12:42:15

file_get_contents() is the same HTTP request. If the source site gives Last-Modified , then a good option is to handle this.

Somewhere you save information like URL => Last-Modified
You do not use file_get_contents() , but directly an HTTP request (like - there are many options, for example, curl ), with the HTTP header If-Modified-Since: значение_для_этого_URL .
If you received HTTP 304 , then we do nothing, the information has not changed.
Got something else - we process and save the new modification time in the database.

You can also enter pauses between requests.

Everything else happens on your side, so the source site cannot affect it in any way (and if it starts blocking you, this is a completely different topic).

By the way, does curl use less memory than file_get_contents?
(no matter what server memory is in question: my server or the source site)
@stckvrw what is happening on your side can not consume anything on the source site.

Mcile Mcile 941 2 18 · Answer 2 · 2017-03-24T19:30:09

If you continue in the same vein, you will stumble upon this. Based on my own experience and guided by an article on optimization, I see that your code is eating too much RAM, since you use regular expressions to the entire information volume 1 that can consume memory - if in this cycle the assignment to the array goes, then it is taken out for the cycle by reference like this

 $save_memory=&$arr;//добавлено с целью оптимизации памяти в икле foreach ($products as &$product){ ... $arr[]=$some_sing; ... $pageContent = file_get_contents($sourcePageURL); preg_match('/<'.$openedTagWithClass.'>'(.*?)<\\/'.$openedTag.'>/is', $pageContent, $priceString); $priceFromLink = $priceString[1]; ... unset($pageContent,$product); }

Second, get rid of regular expressions

  $reader = new XMLReader(); $save_memory=&$arr;//добавлено с целью оптимизации памяти в икле foreach ($products as &$product){ ... $arr[]=$some_sing; ... $pageContent = file_get_contents($sourcePageURL); $reader->open($pageContent); // указываем ридеру что будем парсить этот файл while($reader->read()) { if($reader->localName == $openedTag && $reader->getAttribute('class') == 'искомый класс') { //$movies = new SimpleXMLElement($reader->readString); //$priceFromLink = $movies->{$openedTag}; $priceFromLink = $reader->readString; } } ... unset($pageContent,$product); }

correct me in line

  $priceFromLink = $reader->readString;

There is another option - this is phpQuery

And how does all this relate to reducing the load on source sites?
To reduce the load on other servers is possible only with the help of one usleep command (10,000);
I ask about something else: what does your answer have to do with the question?
Yes, that's right: I’m wondering how to reduce the load on the servers of the source sites, not on my server, I
@Mcile, about the reader, the fact is that the source sites are different and their price is contained within different tags with different attributes: classes, id, data-, etc.
Therefore, $reader->getAttribute('class') == 'искомый класс' is not suitable here.
A more flexible method is needed so that you can specify any attributes to the html tag and in any order, for example div class="product-price" id="price" .
In addition, inside the tag, but still BEFORE the price itself, there may be some kind of repeating text, for example <div class="product-price" id="price">Цена: 100 у.е.</div> (i.e. repeated the word "Price")

Reduce the load on the server (s) when performing a loop with file_get_contents () and preg_match ()

2 answers 2

More articles: