Hello.

I'm going to start developing an API for a mobile application, the number of daily users of which according to plans should vary in the region of ten to twenty million.

What technologies are best to use when developing a high-loaded API service? Is it possible to use "PHP" as the main language, or should "NodeJS" / "Ruby" definitely take place here? If so, why? What are the minuses and advantages of SQL- and NoSQL-DBMS in relation to the planned load? What caching tools rarely updated information is best to use?

Thank you in advance for the answers.

Closed due to the fact that it is necessary to reformulate the question so that you can give an objectively correct answer by the participants of D-side , Mike , Dmitriy Simushev , Nicolas Chabanovsky Mar 30 '16 at 6:53 .

The question gives rise to endless debates and discussions based not on knowledge, but on opinions. To get an answer, rephrase your question so that it can be given an unambiguously correct answer, or delete the question altogether. If the question can be reformulated according to the rules set out in the certificate , edit it .

  • one
    If it is not difficult, supplement the question with an approximate ratio of read / write operations. Do you only have a mobile app? Predominantly writing? And reads and writes, with what margin? This will allow you to get more quality answers. - cheops
  • one
    work with money and accounts provided? Or the information that is exchanged through the API is not critical? - cheops
  • one
    Well, what is it, atomicity, maintaining integrity, there are advantages; the price may be too high for it, especially under such loads. - cheops
  • one
    Atomicity and integrity maintenance is in all repositories. - etki
  • one
    What is the estimated amount of data (in addressable entities and terabytes) and what hardware (including the number of servers (say, 2 socket), network balancers, storage systems) are calculated? - avp

2 answers 2

Is it possible to use "PHP" as the main language, or should "NodeJS" / "Ruby" definitely take place here

These are all platforms of the same order. With a bit of great features in the last two, you need to ask yourself a choice between the "interpreted general purpose language" and java-scala, c # and some less mainstream, which I did not recall. The mentioned interpretable ones (as it is the custom) are more error tolerant, sometimes more difficult to debug and easier to update and write directly, the mentioned compilers are intolerant to errors, but they have both the desired speed increase and often the more interesting API.

What technologies are best to use when developing a high-loaded API service?

Those that meet your requirements. Write an API that pulls data a thousand times a second - not a problem, the problem is to solve a real business problem. While it is not known what is there, one can only guess where the straw really needs to be laid, and so a bunch of php-mysql-redis can easily withstand these notorious thousands of requests and scale without grunting. If you start a more interesting game, you begin to require distributed locks, instant cache updates on all hosts, parallel grinding of tasks on all hosts without being known about how many of these hosts, and generally direct communication between them, the trouble begins rake concrete decisions.

The biggest problem facing the development of this service is the shared-nothing architecture, which does not imply assigning any state to an individual node, the state is stored in a database. At first glance, this condition seems to be fulfilled, in practice everything starts to creep out: when loading a file, the download progress of this file should be visible on all nodes at once, the caching layer should be joined together, so that all cache clients see the same thing a simple resource lock turns into a whole adventure. It is difficult to learn this without cramming your cones, so the sooner you come to writing on anything and laying out a few knots, the sooner you will feel this problem.

What are the minuses and advantages of SQL- and NoSQL-DBMS in relation to the planned load?

As I already wrote, SQL has no pluses. Compared to NoSQL solutions, there is such a thing as a join, but it was abandoned quite consciously, and it hurts more often than it helps. The fundamental difference for the task in scaling: almost any NoSQL solution offers the possibility of horizontal expansion out of the box, while SQL does not know how.

Before you even rush to the choice of a database, you need to understand what is required of it. In addition to the standard hard-coded type of SQL-compatible databases, there are four main types of NoSQL databases: key-value (roughly speaking, access is solely on the primary key without the ability to make selections), row-column (very similar to SQL, but, of course, completely another thing under the hood), document-oriented (that is, without a rigidly defined structure) and graphical (intended for working with complex links). At you, apparently, the choice is between the document-oriented and row-column database: you cannot say anything concrete, and the choice is yours. I can only say that from a row-column, Cassandra is usually preferred, and the document-oriented ones like Mongu, but I heard some rather sad reviews about it (and in the current project I chose rethinkdb for a number of reasons). In addition to all this, it is worth remembering (and studying) the fact that the whole system is now becoming distributed, and now it has all the favorite problems of distributed systems, including network breaks and asynchronous data updates on different nodes.

Here you can say a lot about performance problems, about possible pitfalls, but in fact everything rests simply on how well your platform scales and whether you can roll out a new server during the day, because even the most ideal backend has a bandwidth limit. The only thing I would like to say about is that the transition to distributed systems, as a rule, also requires a change in the pull-on-demand paradigm for push-on-change .

    You can develop such a service on any technology, including using PHP. By placing several servers nearby, you can almost infinitely scale read operations due to replication (both at the DBMS level and at the NoSQL solution level). Problems begin when you have a lot of simultaneous connections, even if you can select a separate request stream for each of them. There are just so many of them that switching the processor between them starts to lose too much time.

    Since you work for mobile applications, you may have a very unimportant connection with the client, which means a lot of slow clients. What is terrible slow client for a classic server? Let your server hold 1000 simultaneous requests, 500 EDGE clients connect to you and start waiting for a response within 10 minutes. And here you already server serves not 1000, but 500 simultaneous requests. Since 500 hung in anticipation of a response from customers. The remaining connections start not to cope, to send requests more slowly, and now your server can serve only 250 requests. And after a while and generally crashes. This may not be so important for the Web, where the customers are more crowded, but in mobile development this is a more frequent phenomenon - coverage is different everywhere, the connection depends on it great.

    When they talk about NodeJS or Ruby servers, there’s nothing magic there, just servers are newer and designed a little differently. Instead of waiting for each connection, a single request is organized by a separate request, which polls the connections in non-blocking mode. The answer came from the client - he is processing it, no - he went to the next connection. This kills two hares. You do not switch between threads / processes (save time), you solve the problem of slow clients - the thread does not wait for you, it constantly works. The only problem is that your server-side does not slow down, since the polling thread is one. This is an Event driven architecture, you can find solutions for PHP. Just in the case of NodeJS or Ruby, they almost go out of the box, in the case of PHP you have to look for a solution. However, whatever language you choose, it’s important to deal with it.

    The Event Driven architecture allows you to use WebSockets, persistent connections from clients — you can allow them as many as you like, since a connection in pending mode consumes almost no resources (neither memory nor processor).

    And here we smoothly approach how you will write. Even if you do not use EventDriven solutions, but put a lot of classic fork servers alongside, replicated data at the database level, you still have a problem with the record, since it does not scale with replication. You need to write very quickly so that the client does not wait for a response from the server, does not keep the connection, but goes about his business. The faster you will serve them - so it will be easier for you. Writing directly to the database is not an option. Classic DBMS is very slow and unhurried, they strive to do everything in transactions, and this is also an overhead. There are two ways - a fast NoSQL solution fully deployed in RAM (redis, MongoDB). Quickly recorded in the memory, released the client, in the background understand, write. Queues - we dropped everything into the queue; in the background, slowly, events from the queue are processed. It would be great if some of the information was transformed, aggregated directly in RAM, and only then entered into a slow bulk database.

    What is good about modern NoSQL solutions are that they are almost all written for non-blocking connections and can hold thousands of simultaneous connections. What is bad is not a classic DBMS, it’s difficult to organize connections and associations, in general you need to work a little differently. They are bad with transactions. However, they are located in RAM and are well clustered. Accepting data in them is a pleasure - it is not a surrogate of 100 tables, followed by merging into a DBMS, in order to write data it was easier, and not a hemorrhoid ring architecture in replication.

    In general, in practice, everything that you mentioned is usually used, you can write a part in PHP, with WebSockets you may be more comfortable working through the NodeJS server, queues and PubSub solutions may be easier organized through Redis or RabbitMQ, raw data put in the same Reids or MongoDB, and use a classical DBMS as a long-term reserved storage (although you can do without a classical DBMS). If you bring your API to the stated loads, you will most likely have to try, if not all, then almost everything :)

    • 3
      Quickly recorded in the memory, released the client, in the background understand, write. - what for? There are a lot of repositories that do exactly that on their own, not to mention the fact that synchronous / asynchronous commit and the number of replicas recorded are controlled at the level of a specific request. - etki
    • one
      It’s not necessary to do it yourself - the main principle is - the faster a customer is served, the better. If not difficult, give an example of storage, I think NTaul will be interesting. Me too. - cheops
    • one
    • 2
      HBase to a greater or lesser extent, produce more lightweight records, then to carry out heavy (and, of course, I remembered and checked far from all the repositories). In one way or another, Write-Ahead Log is an obvious solution that everyone uses, because it is wiser than keeping the client until the indexes are rearranged. - etki