Fight for quality solutions on Erlang / Elixir

@jcutrer

Today we will talk about event logs, quantitative metrics and monitoring all of this in order to increase the team’s response rate to incidents and to reduce the target system’s downtime.

Erlang / OTP as a framework and ideology of building distributed systems gives us regulated approaches to development, tools and implementation of standard components. Suppose we have applied the potential of OTP and have gone all the way from prototype to production. Our Erlang project feels great on the combat servers, the code base is constantly evolving, new requirements and functionality appear, new people join the team, and everything seems to be good. But sometimes something goes wrong and technical problems, multiplied by the human factor, can lead to an accident.

Since it is impossible to lay straws absolutely for all possible cases of failures and problems, or it is not economically feasible, it is necessary to reduce the system downtime in case of failures with management and software solutions.

In information systems there will always be the probability of occurrence of failures of different nature:

Hardware failures and power failures
Network failures: configuration errors, firmware curves
Logical errors: starting from coding errors of algorithms and ending with architectural problems arising at the boundaries of subsystems and systems.
Security issues and related attacks and hacks, including internal fraud.

Immediately delineate the responsibility: for the operation of computing equipment and data networks will be responsible for monitoring the infrastructure, for example, organized by means of zabbix. Much has been written about the installation and setup of such monitoring, we will not repeat it.

From the point of view of the developer, the problem of accessibility and quality lies in the plane of early detection of errors and problems with performance and early response to them. This requires approaches and means of assessment. So, we will try to derive quantitative metrics, analyzing which we can significantly improve the quality at different stages of project development and operation.

Assembly systems

Let me remind you once again about the importance of the engineering approach and testing in software development. Erlang / OTP offers two testing frameworks at once: eunit and common test.

The number of successful and problem tests, their time and percentage of code coverage by tests can be used as metrics for the initial assessment of the state of the code base and its dynamics. Both frameworks allow you to save test results in Junit format.
For example, for rebar3 and ct, you need to add the following lines to rebar.config:

{cover_enabled, true}. {cover_export_enabled, true}. {ct_opts,[ {ct_hooks, [{cth_surefire, [{path, "report.xml"}]}]} ]}.

The number of successful and unsuccessful tests will allow you to build a trend graph:

looking at which, you can evaluate the dynamics of the team and the regression of tests. For example, in Jenkins, this graph can be obtained using the Test Results Analyzer Plugin.

If the tests are reddened or started to take a long time, the metrics will allow sending the release for revision even at the assembly stage and automatic testing.

Application Metrics

In addition to the operating system metrics, monitoring should include application metrics, such as the number of views per second, the number of payments, and other critical indicators.

In my projects, I use the ${application}.${metrics_type}.${name} template for naming metrics. This naming allows you to get lists of metrics like

 messaging.systime_subs.messages.delivered = 1654 messaging.systime_subs.messages.proxied = 0 messaging.systime_subs.messages.published = 1655 messaging.systime_subs.messages.skipped = 3

Perhaps the more metrics, the easier it is to understand what is happening in a complex system.

Erlang VM metrics

Special attention should be paid to monitoring Erlang VM. The ideology of let it crash is beautiful, and the proper use of OTP will certainly help lift the fallen parts of the application inside the Erlang VM. But do not forget about the Erlang VM itself, because it is difficult to drop it, but you can. All options are based on the exhaustion of resources. We list the main ones:

Overflow table of atoms.
Atoms are identifiers whose main task is to improve the readability of the code. Atoms once created remain forever in the memory of the Erlang VM instance, since they are not cleared by the garbage collector. Why is this happening? The garbage collector works separately in each process with data from this process, while atoms can be distributed over the data structures of multiple processes.
By default, 1,048,576 atoms can be created. In articles about how to kill Erlang VM, you can usually find something like this
```
 [list_to_atom(integer_to_list(I)) || I <- lists:seq(erlang:system_info(atom_count), erlang:system_info(atom_limit))] 
```
as an illustration of this effect. It would seem that an artificial problem is unattainable in real systems, but there are cases ... For example, in the external API handler, when parsing queries, binary_to_atom/2 instead of binary_to_existing_atom/2 or list_to_atom/1 instead of list_to_existing_atom/1 .
To monitor the state of atoms is to use the following parameters:
1. erlang:memory(atom_used) - the amount of memory used for atoms
2. erlang:system_info(atom_count) - the number of atoms created in the system. Together with erlang:system_info(atom_limit) you can calculate atom utilization.
Process leaks
I just want to say that when process_limit is reached (+ P argument erl) erlang vm does not fall, but it goes into an alarm state, for example, even connecting to it will most likely be impossible. Ultimately, the exhaustion of available memory when allocated to leaked processes will lead to a drop in erlang vm.
1. erlang:system_info(process_count) - the number of active processes at the moment. Together with erlang:system_info(process_limit) you can calculate the utilization of processes.
2. erlang:memory(processes) - allocated memory for processes
3. erlang:memory(processes_used) - used memory for processes.
Overflow mailbox process.
A typical example of a similar problem is that the sender process sends messages to the recipient process without waiting for confirmation, while receive in the recipient process ignores all of these messages due to a missing or incorrect pattern. As a result, messages are saved in the mailbox. Although in erlang there is a mechanism for slowing down the sender if the handler does not cope with processing, all the same after the exhaustion of the available memory, vm drops.
Whether etop can help you understand if the mailbox overflows.
```
 $ erl -name etop@host -hidden -s etop -s erlang halt -output text -node dest@host -setcookie some_cookie -tracing off -sort msg_q -interval 1 -lines 25 
```
As a metric for continuous monitoring, you can take the number of problematic processes. To identify them, you can use the following function:
```
 top_msq_q()-> [{P, RN, L, IC, ST} || P <- processes(), { _, L } <- [ process_info(P, message_queue_len) ], L >= 1000, [{_, RN}, {_, IC}, {_, ST}] <- [process_info(P, [registered_name, initial_call, current_stacktrace]) ] ]. 
```
This list can also be logged, then when receiving a notification from monitoring, the analysis of the problem is simplified.
Leaks binaries.
Memory for large (more than 64 bytes) binaries is allocated in the general heap. A dedicated block has a reference count showing the number of processes that have access to it. After resetting the counter, the cleaning takes place. The simplest system, but as they say, there are nuances. In principle, there is the likelihood of a process generating so much garbage on the heap that the system does not have enough memory to carry out cleaning.
The monitoring metric is erlang:memory(binary) , showing the memory allocated for binaries.

So, the cases leading to the drop in vm have been disassembled, but apart from them, it is not bad to monitor no less important parameters that directly or indirectly affect the correct functioning of your applications:

Memory used by ETS tables: erlang:memory(ets) .
Memory compiled modules: erlang:memory(code) .
If your solutions do not use dynamic code compilation, then this parameter can be excluded.
Separately, I want to mention erlydtl. If you compile the templates dynamically, then as a result of the compilation, a beam is created that is loaded into the vm memory. It can also cause memory leaks.
System memory: erlang:memory(system) . Shows memory consumption by erlang runtime.
Total memory consumed: erlang:memory(total) . This is the sum of memory consumed by processes and runtime.
Information about reductions: erlang:statistics(reductions) .
The number of processes and ports that are ready for execution: erlang:statistics(run_queue) .
Uptime instance vm: erlang:statistics(runtime) - allows without analyzing the logs to understand whether there was a restart.
Network Activity: erlang:statistics(io) .

Sending metrics in zabbix

Let's create a file containing application metrics and erlang vm metrics, which will be updated every N seconds. For each erlang node, the metrics file must contain the metrics of the applications running on it and the metrics of the erlang vm instance. The result should be something like this:

 messaging.systime_subs.messages.delivered = 1654 messaging.systime_subs.messages.proxied = 0 messaging.systime_subs.messages.published = 1655 messaging.systime_subs.messages.skipped = 3 …. erlang.io.input = 2205723664 erlang.io.output = 1665529234 erlang.memory.binary = 1911136 erlang.memory.ets = 1642416 erlang.memory.processes = 23596432 erlang.memory.processes_used = 23598864 erlang.memory.system = 50883752 erlang.memory.total = 74446048 erlang.processes.count = 402 erlang.processes.run_queue = 0 erlang.reductions = 148412771 ....

With the help of zabbix_sender we will send this file to zabbix, where a graphical representation and the ability to create automation and notification triggers will already be available.

Now, having the metrics in the monitoring system and the automation triggers and notification events created on their basis, we have a chance to avoid accidents by responding in advance to all dangerous deviations from the full-featured state.

Central collection of logs

When in the project 1-2 servers, you can probably still live without a central collection of logs, but as soon as a distributed system appears with multiple servers, clusters, environments, there is a need to solve the problem of collecting and conveniently viewing logs.

To write logs in my projects, I use lager. Often, on the way from prototype to production, projects go through the following stages of collecting logs:

The simplest logging with output to a local file or even to stdout (lager_file_backend)
Centralized logging using, for example, syslogd and automatically sending logs to the collector. For such a scheme is suitable lager_syslog .
The main drawback of the scheme is that it is necessary to go to the server for collecting logs, find the file with the necessary logs and somehow filter the events in search of the ones needed for debugging.
Centralized collection of logs with storage in the repository with the ability to filter and search by records.

We will talk about the minuses, pros and quantitative metrics that can be used using the latter, and lager_clickhouse talk in the light of a specific implementation - lager_clickhouse , which I use in most of the projects I am developing. A few words about lager_clickhouse . This is the lager backend for saving events to clickhouse. At the moment, this is an internal project, but there are plans to make it open. When developing lager_clickhouse, I had to bypass some of the features of clickhouse, for example, to use event buffering in order not to make frequent requests to clickhouse. The effort spent has paid off with stable work and good performance.

The main disadvantage of the storage caching approach is an additional entity — clickhouse and the need to develop event saving code in it and a user interface for analyzing and searching for events. Also for some projects it may be critical to use tcp to send logs.

But the pros, I think, outweigh all the possible disadvantages.

Easy and quick event search:
- Filter by date without having to search for a file / files on a central server containing a range of events.
- Filtering by environment. Logs from different subsystems and often from different clusters are written to one repository. At the moment, the separation occurs by tags, which are set on each node of the cluster.
- Filter by node name
- Filtering by the name of the module that sent the event
- Filter by event type
- Text search
An exemplary view of the log viewing interface is shown in the screenshot:
Ability to automate.
With the introduction of the log repository, it became possible to get real-time information about the number of errors, the occurrence of critical faults, and system activity. By entering certain limits, we can generate emergency events of the system’s exit from the functional state, the handlers of which will perform automation actions to eliminate this state and send notifications to the team members responsible for the functionality:
- When a critical error occurs.
- In the event of a massive occurrence of errors (the time derivative increases faster than a certain limit).
- A separate metric is the speed of event generation, that is, how many new events appear in the event log. Almost always you can know the approximate amount of logs generated by the project per unit of time. If it is multiply exceeded, then most likely something goes wrong.

A further development of the automation theme for handling emergency events was the use of lua scripts. Any developer or administrator can write a script for processing logs and metrics. Scripts bring flexibility and allow you to create personal automation and notification scripts.

Results

To understand the processes occurring in the system and investigate incidents, it is vital to have quantitative indicators and event logs, as well as convenient tools for analyzing them. The more information available about the system is available to us, the easier it is to analyze its behavior and correct problems even at the stage of their occurrence. In the case when our measures did not work, we always have schedules and detailed logs of the incident.

How do you exploit solutions on Erlang / Elixir and what interesting cases did you encounter in production?

Source: https://habr.com/ru/post/437720/