I will give a complete example. Consider a bulletin board. I want to first learn about any product. I, for example, need acer aspire 5742g. I choose the CPU, the amount of OP, vidyuhi and price tag. As soon as a product appears that fits my filters, I receive an SMS. Those. Ads that were posted earlier are not interesting to me. There may be 50 such filters. The algorithm is as follows:

  1. Take the filter, take away the page with a curl.

  2. Collect the ID of the goods.

  3. Compare them with those already in the database.

  4. If there is an ID that is not in the database - send SMS and add it to the database.

  5. Otherwise, do nothing.

The problem is that as soon as N users appear (for example, we take 100) and each one has 50 filters, then this board of kroons will send 5,000 requests every minute (and if you take 1,000 users, it’s generally 50k = _ =). You will not get rid of this only if you look for the same filter among all and make only one request. But it will be grains.

Well, naturally, the server of the bulletin board will not be happy with such activity ... Buy Proxy? And how much? Roughly speaking on the 1st on the user? Only such a solution? I will be glad to hear your thoughts on this!

  • Your question is not suitable for the site so. With a stretch except in the "algorithms". And in general for this, I would beat my hands. So the question is most likely zaminusovat. But since I myself did this .... The number of proxies is considered based on the idea "what pause should there be between requests through one proxy / how many requests per period should go through proxy + reserve". In fact, turn them into a queue and cycle through them. And parsing does not need filters, but partitions of which there are fewer few. When updating information check with the necessary filters and notify. - vitidev
  • It also makes sense to check the page modification time with a Head request, or work with the Last-Modified and If-Modified-Since pair to minimize traffic. You can also calculate the time to update the section and predict the time of the next request, rather than hammer every minute (but this depends on the final site) - vitidev
  • Good day! Thanks for the answer! Those. do you suggest parsing all partitions to your database, and then search for it? Yes, not bad, from the number of users the number of requests will not be changed! You need to hammer every minute anyway, because I want to see the ad first. If there were no ads and various dynamic content on the pages, I would just look at the content-length and that's it) But I didn’t think about your headlines, I’ll go to watch) Thank you for not being too lazy to answer! - QxCoder
  • Not. "search" means that even though you will be less hammering the bulletin board, users will still hammer your server. It is better to save the filters as subscriptions and not "search by database", but "when they appear, they check themselves if they are not suitable for what filters and if so, they notify the subscriber" - so the digging of the "notification service" is minimized - vitidev
  • If users are not provided with a clean server "that is following another board", but a clean mirror, then either drag the whole site to them (which is unwise) or proxy requests to the board while updating the data (inconvenient in implementation because the proxy will be stupid and everything will slow down). The user-friendly task was not in front of me - I selected the necessary filter on the final site and saved it as a subscription in my " notifier " - vitidev

0