Today we will talk about event logs, quantitative metrics and monitoring all of this in order to increase the team’s response rate to incidents and to reduce the target system’s downtime.
Erlang / OTP as a framework and ideology of building distributed systems gives us regulated approaches to development, tools and implementation of standard components. Suppose we have applied the potential of OTP and have gone all the way from prototype to production. Our Erlang project feels great on the combat servers, the code base is constantly evolving, new requirements and functionality appear, new people join the team, and everything seems to be good. But sometimes something goes wrong and technical problems, multiplied by the human factor, can lead to an accident.
Since it is impossible to lay straws absolutely for all possible cases of failures and problems, or it is not economically feasible, it is necessary to reduce the system downtime in case of failures with management and software solutions.
In information systems there will always be the probability of occurrence of failures of different nature:
Immediately delineate the responsibility: for the operation of computing equipment and data networks will be responsible for monitoring the infrastructure, for example, organized by means of zabbix. Much has been written about the installation and setup of such monitoring, we will not repeat it.
From the point of view of the developer, the problem of accessibility and quality lies in the plane of early detection of errors and problems with performance and early response to them. This requires approaches and means of assessment. So, we will try to derive quantitative metrics, analyzing which we can significantly improve the quality at different stages of project development and operation.
Let me remind you once again about the importance of the engineering approach and testing in software development. Erlang / OTP offers two testing frameworks at once: eunit and common test.
The number of successful and problem tests, their time and percentage of code coverage by tests can be used as metrics for the initial assessment of the state of the code base and its dynamics. Both frameworks allow you to save test results in Junit format.
For example, for rebar3 and ct, you need to add the following lines to rebar.config:
{cover_enabled, true}. {cover_export_enabled, true}. {ct_opts,[ {ct_hooks, [{cth_surefire, [{path, "report.xml"}]}]} ]}.
The number of successful and unsuccessful tests will allow you to build a trend graph:
looking at which, you can evaluate the dynamics of the team and the regression of tests. For example, in Jenkins, this graph can be obtained using the Test Results Analyzer Plugin.
If the tests are reddened or started to take a long time, the metrics will allow sending the release for revision even at the assembly stage and automatic testing.
In addition to the operating system metrics, monitoring should include application metrics, such as the number of views per second, the number of payments, and other critical indicators.
In my projects, I use the ${application}.${metrics_type}.${name}
template for naming metrics. This naming allows you to get lists of metrics like
messaging.systime_subs.messages.delivered = 1654 messaging.systime_subs.messages.proxied = 0 messaging.systime_subs.messages.published = 1655 messaging.systime_subs.messages.skipped = 3
Perhaps the more metrics, the easier it is to understand what is happening in a complex system.
Special attention should be paid to monitoring Erlang VM. The ideology of let it crash is beautiful, and the proper use of OTP will certainly help lift the fallen parts of the application inside the Erlang VM. But do not forget about the Erlang VM itself, because it is difficult to drop it, but you can. All options are based on the exhaustion of resources. We list the main ones:
Overflow table of atoms.
Atoms are identifiers whose main task is to improve the readability of the code. Atoms once created remain forever in the memory of the Erlang VM instance, since they are not cleared by the garbage collector. Why is this happening? The garbage collector works separately in each process with data from this process, while atoms can be distributed over the data structures of multiple processes.
By default, 1,048,576 atoms can be created. In articles about how to kill Erlang VM, you can usually find something like this
[list_to_atom(integer_to_list(I)) || I <- lists:seq(erlang:system_info(atom_count), erlang:system_info(atom_limit))]
as an illustration of this effect. It would seem that an artificial problem is unattainable in real systems, but there are cases ... For example, in the external API handler, when parsing queries, binary_to_atom/2
instead of binary_to_existing_atom/2
or list_to_atom/1
instead of list_to_existing_atom/1
.
To monitor the state of atoms is to use the following parameters:
erlang:memory(atom_used)
- the amount of memory used for atomserlang:system_info(atom_count)
- the number of atoms created in the system. Together with erlang:system_info(atom_limit)
you can calculate atom utilization.Process leaks
I just want to say that when process_limit is reached (+ P argument erl) erlang vm does not fall, but it goes into an alarm state, for example, even connecting to it will most likely be impossible. Ultimately, the exhaustion of available memory when allocated to leaked processes will lead to a drop in erlang vm.
erlang:system_info(process_count)
- the number of active processes at the moment. Together with erlang:system_info(process_limit)
you can calculate the utilization of processes.erlang:memory(processes)
- allocated memory for processeserlang:memory(processes_used)
- used memory for processes.Overflow mailbox process.
A typical example of a similar problem is that the sender process sends messages to the recipient process without waiting for confirmation, while receive
in the recipient process ignores all of these messages due to a missing or incorrect pattern. As a result, messages are saved in the mailbox. Although in erlang there is a mechanism for slowing down the sender if the handler does not cope with processing, all the same after the exhaustion of the available memory, vm drops.
Whether etop can help you understand if the mailbox overflows.
$ erl -name etop@host -hidden -s etop -s erlang halt -output text -node dest@host -setcookie some_cookie -tracing off -sort msg_q -interval 1 -lines 25
As a metric for continuous monitoring, you can take the number of problematic processes. To identify them, you can use the following function:
top_msq_q()-> [{P, RN, L, IC, ST} || P <- processes(), { _, L } <- [ process_info(P, message_queue_len) ], L >= 1000, [{_, RN}, {_, IC}, {_, ST}] <- [process_info(P, [registered_name, initial_call, current_stacktrace]) ] ].
This list can also be logged, then when receiving a notification from monitoring, the analysis of the problem is simplified.
Leaks binaries.
Memory for large (more than 64 bytes) binaries is allocated in the general heap. A dedicated block has a reference count showing the number of processes that have access to it. After resetting the counter, the cleaning takes place. The simplest system, but as they say, there are nuances. In principle, there is the likelihood of a process generating so much garbage on the heap that the system does not have enough memory to carry out cleaning.
The monitoring metric is erlang:memory(binary)
, showing the memory allocated for binaries.
So, the cases leading to the drop in vm have been disassembled, but apart from them, it is not bad to monitor no less important parameters that directly or indirectly affect the correct functioning of your applications:
erlang:memory(ets)
.erlang:memory(code)
.erlang:memory(system)
. Shows memory consumption by erlang runtime.erlang:memory(total)
. This is the sum of memory consumed by processes and runtime.erlang:statistics(reductions)
.erlang:statistics(run_queue)
.erlang:statistics(runtime)
- allows without analyzing the logs to understand whether there was a restart.erlang:statistics(io)
.Let's create a file containing application metrics and erlang vm metrics, which will be updated every N seconds. For each erlang node, the metrics file must contain the metrics of the applications running on it and the metrics of the erlang vm instance. The result should be something like this:
messaging.systime_subs.messages.delivered = 1654 messaging.systime_subs.messages.proxied = 0 messaging.systime_subs.messages.published = 1655 messaging.systime_subs.messages.skipped = 3 …. erlang.io.input = 2205723664 erlang.io.output = 1665529234 erlang.memory.binary = 1911136 erlang.memory.ets = 1642416 erlang.memory.processes = 23596432 erlang.memory.processes_used = 23598864 erlang.memory.system = 50883752 erlang.memory.total = 74446048 erlang.processes.count = 402 erlang.processes.run_queue = 0 erlang.reductions = 148412771 ....
With the help of zabbix_sender
we will send this file to zabbix, where a graphical representation and the ability to create automation and notification triggers will already be available.
Now, having the metrics in the monitoring system and the automation triggers and notification events created on their basis, we have a chance to avoid accidents by responding in advance to all dangerous deviations from the full-featured state.
When in the project 1-2 servers, you can probably still live without a central collection of logs, but as soon as a distributed system appears with multiple servers, clusters, environments, there is a need to solve the problem of collecting and conveniently viewing logs.
To write logs in my projects, I use lager. Often, on the way from prototype to production, projects go through the following stages of collecting logs:
We will talk about the minuses, pros and quantitative metrics that can be used using the latter, and lager_clickhouse
talk in the light of a specific implementation - lager_clickhouse
, which I use in most of the projects I am developing. A few words about lager_clickhouse
. This is the lager backend for saving events to clickhouse. At the moment, this is an internal project, but there are plans to make it open. When developing lager_clickhouse, I had to bypass some of the features of clickhouse, for example, to use event buffering in order not to make frequent requests to clickhouse. The effort spent has paid off with stable work and good performance.
The main disadvantage of the storage caching approach is an additional entity — clickhouse and the need to develop event saving code in it and a user interface for analyzing and searching for events. Also for some projects it may be critical to use tcp to send logs.
But the pros, I think, outweigh all the possible disadvantages.
Easy and quick event search:
An exemplary view of the log viewing interface is shown in the screenshot:
Ability to automate.
With the introduction of the log repository, it became possible to get real-time information about the number of errors, the occurrence of critical faults, and system activity. By entering certain limits, we can generate emergency events of the system’s exit from the functional state, the handlers of which will perform automation actions to eliminate this state and send notifications to the team members responsible for the functionality:
A further development of the automation theme for handling emergency events was the use of lua scripts. Any developer or administrator can write a script for processing logs and metrics. Scripts bring flexibility and allow you to create personal automation and notification scripts.
To understand the processes occurring in the system and investigate incidents, it is vital to have quantitative indicators and event logs, as well as convenient tools for analyzing them. The more information available about the system is available to us, the easier it is to analyze its behavior and correct problems even at the stage of their occurrence. In the case when our measures did not work, we always have schedules and detailed logs of the incident.
How do you exploit solutions on Erlang / Elixir and what interesting cases did you encounter in production?
Source: https://habr.com/ru/post/437720/