The task is as follows. A large array of data comes in the JSON format. They need to select several fields and save them to the database. The problem is that there are a lot of such data arrays, the processing of one takes a decent time.

I would like to organize the following system.

http-parser->| Queue |->{Worker_1}->{DB} | Item1 | {Worker_2} | Item2 | {Worker_n} | ItemN | 

A small Ruby script receives a response from the server, adds it to the queue, and continues and waits for the next response from the server. In turn, the workers, if they see the information in the queue, then they disassemble it and begin to process and save to the database.

At the moment, the difficulty is that the parser is responsible for everything: parsing, sampling, saving to the database, so the processing of responses from the server is very slow.

Zero thought was banal to throw the persistence in the database into a separate thread, but this is not a very flexible solution.

The very first thought was to throw the entire json object straight into the base, and then from there, but then there is the question of how to block the record so that worker_n would not process the same record along with another worker.

In other words, the known methods \ software, which allows to reduce the time of writing to the database, are of interest. In particular, some Amazon Web Services products are of interest. I casually read about Amazon Simple Queue, but I'm not sure that it will cover my tasks.

  • And your database allows you to give ID records without breaks in the numbering. You can do without locks that the worker N would only process every M record (number of workers) with a constant offset N - Mike
  • By the way, the idea is sound! Thank you very much. Are there any ideas on how to notify workers and what ways they notify them in general exist. For now, Rails only comes to my mind. - Ascelhem

2 answers 2

Ways, generally speaking, is complete.

You can use Amazon SQS, right. The usage example looks easy. But this is a cloud service, which many for such a (sort of) simple task will consider "a gun on the sparrows", although you yourself have taken an interest in them.

You can use your delivery server with delivery confirmation, there are many of them, but for example, there is RabbitMQ.

You can use PostgreSQL and its feature "advisory lock", a lock that only the application knows about the meaning of, the DBMS itself does not pay attention to it. This approach uses Que queue in which the SQL part is already implemented. In this approach, it is nice that you can manage tasks in ACID transactions along with changes to other data.


And now bad ideas.

You can use Concurrent::Edge::Channel from concurrent-ruby . And in tasks where most of it consists of active Ruby, not I / O, this is a bad idea , because there is a GIL in the MRI, because of which only one task will be performed at a time.

It is possible to carry out every Nth task from the database by each worker, as suggested in the comments. But it will work well only if all the tasks are about equally time consuming (otherwise some workers will lag behind the others) and all workers never fall down (since each N-th task can be completed by exactly one worker).

    At the moment, the difficulty is that the parser is responsible for everything: parsing, sampling, saving to the database, so the processing of responses from the server is very slow.

    Congratulations. This is called a god-object . And in any case should get rid of it.

    As for the main task - the option "in the forehead" is to save the data to a file and in a separate process to parse it. In fact, there are more elegant solutions, for example, the same Sidekiq

    It works like this: You need to create a class that will contain the public perform method and add Sidekiq::Worker to it.

    Those. something like this will turn out:

     class DelayedParser include Sidekiq::Worker def perform(json) # тут что-то делаешь с данными end end DelayedParser.perform_async(json) # json - данные которые нужно обрабатывать.