Counting the number of visits with heavy loads

Question

How would it be better to implement this in this case? Every time you do not make up requests for UPDATE resources.

Everything depends of course on the power of the server, but, for example, suppose that each request is of the form:

UPDATE news SET count = count + 1 WHERE news_id = 9

takes 0.002 seconds (records in the table ~ 9000). The site is very loaded. Well, let's say, such requests take 50-70 pieces every second. Therefore (this is of course wrong, but if we take the linear dependence into account), we get 0.1 second only for processing a single request! And I need to generate a page (which is not always loaded from the cache)! In practice, it turns out even more. In general, this approach is not suitable here.

The only thing that remains is to write to some temporary source (be it RAM, or tin), and write to the database, for example, only half an hour after the last recording.

Everything seems to be good, if it were not for the tar spoon, as they say ... For all that, I need to keep the possibility of sorting the results of the sample from the database (by the number of views). After all, the data can change in half an hour, and therefore ORDER BY can change significantly, of course, you can do it, but it will not be accurate.

All this I need to, for example, display the most popular entry. Or the top 15 most popular. Well, in general, I want to hear the views of experts on this issue. Thank you in advance.

PS And one more question: is it worth doing the same with the "I like" button (I mean, not the contact button, but the native site button)? Here, as it were, not many requests will have to be done, but they still have to do it. So, if you give some advice, I will be grateful.

UPD:

In general, guys ... Thank you all for the feedback and answers. I will try in every way. As soon as something becomes clear, I will update the question. I think not only is it interesting to me.

In most cases, a nominally indexed base and optimally coded queries with loads calmly cope.
In other situations, different tweaks of HEAP tables, clusters, memcache, etc. are used.
memcache then I already have it, just with this I lose the ability to sort exactly.
If you only score on this small minus and write to the database, for example, not every half hour, but every ten minutes, another thing ...
You are extremely flawed to describe the essence of the problem, respectively, you can not answer by definition.
They would bring at least some numbers, some, in your opinion, important sections of code, etc., then perhaps you could be helped

Accepted Answer · 2012-04-11T11:01:50

Write statistics to the daemon, which will do the intermediate aggregation and write to the database. Try Python or Node.js for this task, personally I like Node.js more.

Answer 2 · 2012-04-18T19:34:29

One of the fastest and easiest ways to implement:

Add the number of visits to memcache for each news with a value of 0 and a key, for example, raiting-news-$newsId
Each time you view the news in memcache for the corresponding key, increment the value by 1 ( incr/decr command ). Because Since this command is atomic, then no synchronization is needed.
Once every 5/30/60 minutes, go through all the news, increase the corresponding value in the database to the value from memcache, and in memcache, reset the value to 0 .
1. We make a selection of all news IDs in one request (since this is a sample by pk, it will be very fast)
2. Then with one request (or in batches of 1000, for example), we pull out all the values from memcache through multiget.
3. For each value that is greater than a certain delta, we do UPDATE in the database and reset it to the value in memecache.

Ilya Pirogov

10.4k one 14 25

Thank you very much. By the way, incr and descr are great. You, as always, give powerful advice. I have another idea. Do this not every 5 minutes, but when the memcache counter accumulates a value of 10, for example. So do not have to turn the entire base every 5 minutes. And besides, the top news will be held for some time first, before it is precisely slaughtered by views. In your version of course it is too. I'm here right now and I think how best to do. So far, the truth is that I’m doing other things so that I don’t spend much time on it ... It's just that, if so, then I’m already in this matter ... - Dobby007
> Do this not every 5 minutes, for example, but when the memcache counter accumulates a value of 10, for example. So do not have to turn the entire base every 5 minutes. Yes, this is also a good option. The truth about turning the base of speech did not go, I did not accurately describe. Updated the answer. Actually, the difference is basically that in my case there will be an even distribution of the load over time, but a certain fixed lag of update over time. And in your case, the load will depend on top news, but there will be a more flexible update lag. - Ilya Pirogov
But as I understand it, you still have to do a lot of updates at once. In one request, all changes cannot be made as I understand it in SQL in this case. It is necessary through ";" transfer all updates generated by the script. - Dobby007
This idea is good exactly as long as the memory card has enough memory, and when it is no longer enough, statistics will mysteriously disappear from you :) - razielsd
one
> But as I understand it, you still have to do a lot of updates at once. Yes, from this, in this case, no longer go. But I do not think that because of this it is worth worrying (do not forget about prepared statments). In extreme cases, you can write a simple procedure in PL / SQL. > This idea is good exactly as long as the memory card has enough memory. There are only 9000 short-lived ints, with a very high number of hits :) What is the probability that memcache will decide to drop them, and not any heavy, old and rarely used html? - Ilya Pirogov

|

neoascetic neoascetic 2,026 7 eleven · Answer 3 · 2012-04-11T10:18:59

How about Google Analytics ? And then in this block "Top 15" we are already making a JS-request to their API (not 100% sure that it will be possible to retrieve popular queries ... But damn, didn't Google really think about this?). If you need to have this info locally, you can periodically pull through the same API.

Bonuses: reduced load on your own servers, the ability to have related information

Disadvantages: non-working ability without JS, (possibly) payment for using API / some functions

And I didn’t like it because of one unpleasant thing ... I have an Internet provider and other local sites that like Google Analytics.
And when I load the page without connecting to the external, it can load only after two minutes (I use the opera), as in this INCREDIBLE case Google’s servers are not available :))) Plus, this is always an additional request that I don’t want to do.
And if you take the server API, then this is the same option as with caching above.

igosprin igosprin eleven one · Answer 4 · 2012-04-11T17:36:15

There is one good way, I used it on one very heavily loaded site. Create two tables: one MEMORY other normal. In the table of memory you make a trigger to add to the usual one. It works very quickly. Yes, the truth is, there is one pitfall - this is memory. But if you correctly write the trigger, then everything will be in chocolate. You know better.

igosprin

eleven one

Wow! This is already useful info! MEMORY it will always be stored in memory, right? And where can I read about triggers? So that competently and with explanations was written. And then in the documentation of Zulevsky, usually only the facts ... - Dobby007
Something I did not understand, to be honest ... How can a trigger work faster if in this case two queries are stupidly executed instead of one? Those. instead of making one UPDATE of the usual news table, the memory table will be updated, and then the news table will also be updated. Maybe of course I'm stupid, but, as it seems to me, it will work much longer than making an ordinary update. - Dobby007
San, triggers and stored procedures are stored in the database, and not called from the outside. The lion's share of the processing time of the request is the parsing of the request being called from outside. Triggers and procedures are always faster than "external" requests. Of course, if they are written correctly. And information for education. Read about hadoop. Here is the link for the seed insight-it.ru/masshtabiruemost/hadoop - Vitaly Kustov
@Vitaly Kustov, but how to correctly write a trigger in this case? By the way, about the procedures ... Procedures are faster only in the case of writing or changing data, when retrieving data, they are not cached, unlike Select queries. habrahabr.ru/post/41166 - Dobby007
By the way, we checked with you. Not such a strong difference was the procedures with the usual requests. Most of my time is taken by the data record itself in the database. Moreover, the request is not complicated at all. He himself said you had requests for three pages. And here, ugh)) One unfortunate line))) - Dobby007

|

Answer 5 · 2012-04-19T11:22:51

Replace the number of visits with the mathematical expectation of the number of visits, and measure the actual number once a week in order to adjust the expectation based on it. Then there will be no loads at all.

Mathematical expectation ... This is not for me) Maybe it can be done so ... But I can not ... It is necessary to consult with mathematicians ...
As far as I know, the counter of people registered on VKontakte was made this way when it was still displayed on the main page.
He did not show how many people actually registered on such a day, but showed how many people should be registered according to the administration’s estimates.

Counting the number of visits with heavy loads

5 answers 5

More articles: