I made a parser on php, I connect to the site via curl, a lot of requests, a site that delivers content naturally protects itself and bans IP for a while. I tried to fasten a list of proxies, but they, too, some different proxies quickly work differently. As though to me not option to emulate the user and to wait certain time between loadings of pages. What are the ways to circumvent this situation?
- What is the interval of ban from the site? - IVsevolod
- What did not arrange the list of proxies and delay for everyone? slow ones can be thrown out and replenished on the go from any site with a list of free proxies. a couple of three quick normal ones can be obtained. - afiki
- onein this case, it is easier to put TOR and change the un for each connection directly from the script. the benefit is easily done. zyzh twirled one counter - everything works super, the real un is not given out. - thunder
- I understand what a user emulation is. In general, the thing is this: there is an avito, it has a set of ads that interest me, I take a list of all pages, then from each page I select all links to the detail, then I check the ID if I have them, I don’t take them myself, if I don’t open them each link and I get a photo, a detailed description, price and phone number which I translate from the figures into the text. Everything works out very well, but if you wait, full unloading will work only for 2 days it does not suit me. - binliz
- 2Is the same picture on the mobile version? In any case, download less, according to the simplest estimates - 4 times :) - user6550
2 answers
By user emulation, we mean not only “wait for a certain time,” but also requests for related resources (pictures, scripts). A bunch of requests are all DoS, so there is no specific recipe, unless of course the site does not provide an API for getting the data you need.
For a start, see if there are rules for spiders on resources :) There are, for example, LJ, for example, and if they are followed, nobody bans anyone.
Then, without fail, work with If-Modified-Since. Many sites correctly give Last-Modified and do not tense up if they give 304.
Well, if possible, do all this with RSS / Atom, if available. And it is easier to handle, and the attitude towards their collectors is usually different (although this, of course, depends on the tasks).
- Remarkable recommendations, but unfortunately nothing like this in this situation shines on me. - binliz