I made a parser on php, I connect to the site via curl, a lot of requests, a site that delivers content naturally protects itself and bans IP for a while. I tried to fasten a list of proxies, but they, too, some different proxies quickly work differently. As though to me not option to emulate the user and to wait certain time between loadings of pages. What are the ways to circumvent this situation?
By user emulation, we mean not only “wait for a certain time,” but also requests for related resources (pictures, scripts). A bunch of requests are all DoS, so there is no specific recipe, unless of course the site does not provide an API for getting the data you need.
For a start, see if there are rules for spiders on resources :) There are, for example, LJ, for example, and if they are followed, nobody bans anyone.
Then, without fail, work with If-Modified-Since. Many sites correctly give Last-Modified and do not tense up if they give 304.
Well, if possible, do all this with RSS / Atom, if available. And it is easier to handle, and the attitude towards their collectors is usually different (although this, of course, depends on the tasks).