How does the approach to programming change if it is planned to have 50,000 unique users per day?

Question

Hello, such a question. Previously, I always wrote regular websites on Code Igniter, up to 100 unique users per day, so I did not bother with the download speed, server settings and other things, so there is very little competence in this matter.

How does the approach to programming change if it is planned to have 50,000 unique users per day? It is clear that caching, which one is better? Which database is better, MySQL with 30 tables to handle? Do I need a dedicated server, which OS and server is better suited? Does the style of writing code change somehow? Do I need to use frameworks, I mean CodeIgniter, or is it better to write from scratch myself?

What else would you advise when writing high load projects?

As I understand it, Vkontakte is also written in PHP, how do they manage to cope with such a load?

Thank you very much in advance for your competent answers.

> Should I use frameworks, I mean CodeIgniter, or is it better to write from scratch myself?
It is better not to write from scratch, it is better not to use CodeIgniter.
> What else would you advise when writing high load projects?
familiarize yourself with the Big O notation, put and use athletic , dig out and read all the information about testing that will be found.
> Do I need a dedicated server, yes> which OS and server is better suited?
Well, in general, highload is a very multifaceted issue, and only a framework here will not work.
Although, of course, writing from scratch is also not an option if questions of such a level are asked :) Try to look at the very basics ([for example] [1]), and where to dig further will come up.
> As I understand it, Vkontakte is also written in PHP, how do they manage to cope with such a load?
A lot of servers + functional programming + half is generally written in sishka, which is in the top ten most rapid-firing languages + own PHP compiler in bytecode ( kphp )> It is clear that caching, which one is better?
a specific engine depends on the tasks and preferences, and so, of course, in the RAM> Which database is better, MySQL can handle 30 tables?
Further, if the code is correctly written, it will be possible to change the engine to the next one relatively painlessly.
It is impossible to write a curve, undocumented, unformatted code, then it strongly auks.
Follow PSR, leave PHPDoc comments everywhere, isolate modules from each other to the maximum, follow SOLID.

etki etki 33.2k 2 45 71 · Answer 1 · 2016-05-05T13:56:46

Two years later, after some (but not to say that a very large) experience of butting with the problem of large loads, I can give a slightly more detailed answer.

The first thing to take when moving to the territory of services, sharpened under load - is that the previous rules no longer apply. And, often, were fundamentally wrong. For example, there is no longer such a thing as a JOIN (some storages that we’ll talk about later support them, but due to the physical inability to do them correctly, this is another way to shoot a leg), or that the data should be stored in a normalized form, or that the log should be written to a text file (there too, but only secondarily), or that some data can be cached in memory. All this works perfectly within the framework of a single server model, but it collapses uncontrollably when moving to a scalable system, and then it turns out that this approach was not at all necessary. And finally, it turns out that there are no generally accepted solutions to problems.

First of all, the load rests on the fact that one server can not physically withstand what comes to it. Either this is just a file system (for example, you are building a photo hosting site - you simply will not be able to physically allocate a separate server for statics, because it will end soon), or this is a RAM (for example, there is not enough space for a fast cache), or it is a processor (too complicated to the database, too many requests processed at the same time), or, if you wrote the perfect application, you are stupidly short of the throughput of the network port. If any of these problems arise, the need for horizontal scaling arises - scaling due to the growth of the server fleet, and this factor causes most of the problems that arise, because working with old methods will no longer work.

DB

The very first thing that always pops up in such a discussion is a database. The database is really quite easy to load with a complex query, and then it starts to slow down the query, slow down the neighboring queries and eat percent; A similar situation will occur with the influx of users. Therefore, the database should be spread to different nodes , use optimized queries , and, as follows from the previous paragraph, optimize data storage .

You will have to select a separate paragraph about database exploration. When a system is divided into several nodes, it becomes distributed, and this increases its complexity by an order of magnitude. First, each node cannot store all data (because otherwise the scope of the database is limited by the size of the disk space of one server), secondly, each node may die suddenly, and the system must continue to work, no matter how than did not happen. Against the first problem, the so-called sharding is applied - a uniform distribution of data across nodes. As a rule, this is done in the following way: within the database cluster there are a number of virtual nodes that are distributed among the cluster members, when new members arrive or the old ones leave, the virtual nodes are redistributed among the members. When adding a record to the database, a hash is calculated from its partition key (the partition key can be either the primary key itself or a part of it - it all depends on the selected database), then a mathematical node is used to calculate the virtual node from this hash where this record should be located - this mathematical transformation can be simple partition_key_hash % virutal_node_count - and, finally, on the basis of the virtual node received, an entry is made to the required database server.

The second problem - the possible removal of participants from the cluster - is struggled by simple data replication. Recording is usually carried out not only in the selected virtual node, but also in the N nodes behind it, and if the participant suddenly leaves the cluster, then the virtual nodes registered after it are restored from replications.

In fact, this all means the following:

There is no single place in the system where the entire table is collected, and obtaining a complete list of data can be difficult.
The system does not have such a thing as a join - a joint should take all the data from one table and compare it with the data of another table, and the tables, as it was written earlier, are spread over a whole cluster and get them (not to mention that they can physically exceed the size of the disk system and the RAM of any of the cluster nodes) can take a long time.
The capacity of the system to receive a record on the primary key is equal to the average bandwidth of one node * the number of nodes, that is, say, obtaining records at a speed of gigabytes per second is quite a realistic situation. Netflix at one time knocked out of Cassandra a million records in seconds on three hundred servers (and this is far from the limit).
Due to asynchronous replication, to obtain reliable data, the ratio R (number of servers from which reading is made) + W (number of servers to which recording is made) is required to be greater than or equal to N + 1 (total number of servers + 1). In fact, you can devote an entire chapter to accessibility and data integrity, and without it you cannot work with NoSQL, but I cannot write all this now. It is worth starting with the CAP theorem.
There are no transactions in the system. It is theoretically possible, but very difficult and slow.

Optimized queries, about which I spoke, should adapt to a specific repository. The lack of joins itself kills most of the headache, and otherwise it remains to add only the need to ensure that queries always use indexes and not rape the database (they did not use joins in the repositories that still allow them to be used, did not use specific features that require reading data from the disk, etc.). This imposes certain restrictions, and the most far-sighted (or simply knowledgeable) have already realized that if you simply use indexes on all fields, then this will just be violence over the clusters: due to the fact that all data are randomly evenly distributed over to all nodes, the request "to give out users of 35 years" will go to all nodes, and this is an excessive load on the database (because it could be calculated on one node. Therefore, this approach usually requires building a so-called materialized views - separate tables, for which these samples will be simpler. So, in cassandra, you can build a separate table for which the partition key will consist only of the user's age, and such a search will only need to be performed on it - then only one database node will be used. I note separately that the last passage about materialized views works in the specified In the context of cassandra, and in other repositories, everything may be different, but the principle remains the same: for a digestible sample, additional tables can be made where to store data in a convenient form for the samples.

Search

The question remains with the search queries. Searches like to use a lot of columns, and to prepare data for them, as a rule, will not work. Usually, in this case, use a separate search engine, for example, elasticsearch, which, if possible, is placed on a bold site with a large amount of RAM, in which it stores all indexes (sometimes because of the amount of data this is not possible and you still have to store indexes on disk) . There, it is already difficult to ensure that only one node from the cluster works (but it is possible), and in general the principles of operation become slightly different, but in this case the approach to resources is usually somewhat more acceptable. In some (critical) cases, it may be necessary to go to extreme measures and build and store such things directly in the memory inside your application, but this is also an extensive topic in which shooting a leg is easier than ever; all I need to say is that such a structure must be immutable, and all possible sad variants must be worked out with the distribution of this structure across all search engines (yes, the application itself will also not be located on a single server).

File storage

About file storage, you can only repeat @cheops - you will need some kind of distributed system that will not drip onto your brains with the ending place on the disks. The easiest way to use a third-party provider, the benefit of their darkness is darkness - S3, Swift and derivatives compatible with them.

Caching

The issue with data caching is known to be very subtle and one has to be very careful with it. Here you can give some clear instructions:

You can cache data only in a separate third-party network service if you do not want to catch complex elusive bugs.
The caching service should follow the same principles as the above described databases - it should be easily and painlessly shaded. Because the cache is simply a key-value store, the sharding can be easily written by yourself, but instead you should just take a normal caching service. My personal favorite is Aerospike.
You need to be ready for a so-called. dogpile-effect, when a hundred (two, three, fourteen) users come for the same resource, but it is not in the cache. In this case, it is possible to break through both the database and a hundred (two, three, fourteen) times to rebuild the cache (instead of one for each of the servers). With lazyload here you have to be very careful.
Remember the phrase about data invalidation. It is very easy to catch the problem of data integrity (it has been updated in the database, not in the cache), and it is difficult to find the cause.

application

Load distribution

Here, finally, you can talk about the application itself. Because we believe that one server cannot take all the load, we begin to scatter the same application across the servers and balance the load. Here the shots in the direction of bad architecture begin:

You can not upload files to the server, and then forward them with an ajax request. The ajax request will fall on the wrong server.
You can not check something through the presence of the file. The file will not be, it is on another server.
You cannot cache anything at the server level. This cache can not then be reset - it is on another server.

Therefore, you must immediately write the so-called. stateless applications. These applications do not store any state in themselves, but drop them into the database or cache; as a last resort, they indicate which server this data belongs to (for example, so that if the server is restarted, it can add files received from the user).

Microservices

Anyway, the application is divided into separate domains (eg users, subscriptions, search, master records, etc.), which, potentially, can crush each other. For example, the subscription service normally takes 100% of the processor for five minutes, and the user service, which tries to log in, does not like it very much. In this case, it is worthwhile to begin to share the load not only on the servers, but also on the individual micro-applications that will communicate with each other. After this, subscriptions start to live on one server, the user manager with the search on the other, stop blocking each other and live a beautiful life.

Programming language and internal architecture

It is impossible not to cling to this problem, which, for political reasons, is usually avoided. In PHP, this whole colossus is almost impossible to raise because of the lack of asynchrony: in case you need to get data for twenty users, PHP will spend twenty times more time than it would cost. Due to the fact that the number of microservices will only increase, and each of them can slow down a bit, on the synchronous model it is almost impossible to extend it at all. Here we need support for asynchronous calls, which rests on multi-threading, which ultimately rests on the server architecture (event loop and real asynchrony will better utilize the processor than the thread-per-request model with blocking queries to the database, but in fact not enough not to use thread-per-request at all). I cannot recommend a specific language (there are many of them, mainstream languages are all JVM-based, C #, NodeJS), but my experience (albeit small) suggests that without this you can’t draw anything at all.

Paradigms

Remember that all the paradigms that have worked with you before this point are working, but not necessarily optimal paradigm. I recently discovered event sourcing, which is a much more complicated way of interaction than CRUD, but in a couple of places it is just perfect for my tasks and will solve even the problem where I was going to do sharding, what the next section will be about.

Sharding issues

Sooner or later, you might even think about the idea that inside an application, for one reason or another, you need to do sharding yourself. Do not do this . After sharding, sooner or later, the need to do dynamic rewarding will likely come, and this can turn into a real hell in the fight against race conditions. Leave it to the database.

Logging

If the logger was not yet your best friend, now either this will finally happen or the project will not take off. Logging every sneeze is the de facto responsibility of the programmer, which here becomes even stronger due to the complexity of the distributed system spread across the server cluster. If something has happened, you should have a complete picture of the process taking place, painted in steps and dangling with the exception of a full trace. Otherwise, you just can not fix anything. In view of the fact that in case of problems you don’t even know on which node the error occurred, you need to collect logs in a centralized way. Usually they are sent to the log collector in a way that the collector supports; we use AMQP + Graylog in projects.

Metrics

And finally, the metrics. As you know, what is not visible, may not work, and you may not notice. Therefore, be sure to collect in the application all possible metrics for its responsiveness, failed tasks, suspicious moments - everything that is generally possible to collect. Without this, you will not be able to evaluate the performance of the application and identify narrow points or even determine that the background task has fallen off. The metrics themselves must somehow be uploaded to visualization tools (graphite, influxdb), but here I cannot suggest something concrete.

Useless things

Useless things are the choice of a specific OS, tuning to increase performance by 25% and simple micro-optimizations in the code. All this is absolutely useless things until the company has grown to hundreds of people, each of whom is a specialist. Up to this point, your time will come in handy in a bunch of other places that will need to be patched much faster. When fighting with loads, you fight with O(n^x) , and you first need to work with places where our load grows exponentially; all other places are much cheaper to close a couple of extra servers.