Good day! I have a script that parses many pages. And some pages have to load through curl. And it works very slowly - is it possible to optimize it somehow?
- 2You can, you have to rewrite everything in C. In general, black boxes are difficult to optimize. - ReinRaus pm
- aha, and if on the assembler - then in general will fly) - woland
- oneThe hint is not understood apparently :) Black box How can you talk about optimization without seeing the code? I can also recommend replacing the script parser with regular expressions - one native regular expression call is better than a heap of script lines with a bunch of script calls. - ReinRaus
- Code 500 <lines, I can not lay out the whole. to the regulars account - I use simple_html_dom - woland
- oneI highly recommend reading the article [Comparing libraries for parsing] [1], based on the conclusions of simple_html_dom is the most inhibitory, and strongly. [1]: habrahabr.ru/post/114323 - thunder
1 answer
First, skip your program through the profiler. For example, xdebug . You should immediately see which calls take up the most time. They are optimized.
Secondly, if you need to pull out just a few values, I recommend using regular expressions . However, with the complication of the structure of the parser, they will have to be abandoned.
Thirdly, I advise you to look at the special libraries that are engaged in parsing. For example, there are several libraries for PHP that parse HTML documents in the DOM and work pretty quickly.
Fourth, you can use accelerators for PHP . They sometimes significantly speed up script processing. Their list can be found, for example, here .
Well, in the latter, if all of the above is not an option, you will have to change to a language that is faster and closer to the car, like C.
You can’t say anything more without seeing the implementation and the result of the profiler.
- 2> you will have to change to a faster and closer language to the car, like S. @khaos, the pearl is quite enough for me, despite the fact that "many" are hundreds of thousands of pages :) No need to run ahead of the locomotive, it may well turn out to be so that most of the time is spent not on parsing, but on downloading pages. For example, transferring (and processing) gzip to Accept-Encoding speeds up well, Last-Modified processing is also useful, etc. - user6550
- @klopp, quite likely. This is a bit strange to me too. But without a profiler (or manual determination of time intervals), nothing can be better said. - khaos