📜 ⬆️ ⬇️

Fight for quality solutions on Erlang / Elixir


Today we will talk about event logs, quantitative metrics and monitoring all of this in order to increase the team’s response rate to incidents and to reduce the target system’s downtime.

Erlang / OTP as a framework and ideology of building distributed systems gives us regulated approaches to development, tools and implementation of standard components. Suppose we have applied the potential of OTP and have gone all the way from prototype to production. Our Erlang project feels great on the combat servers, the code base is constantly evolving, new requirements and functionality appear, new people join the team, and everything seems to be good. But sometimes something goes wrong and technical problems, multiplied by the human factor, can lead to an accident.

Since it is impossible to lay straws absolutely for all possible cases of failures and problems, or it is not economically feasible, it is necessary to reduce the system downtime in case of failures with management and software solutions.

In information systems there will always be the probability of occurrence of failures of different nature:

Immediately delineate the responsibility: for the operation of computing equipment and data networks will be responsible for monitoring the infrastructure, for example, organized by means of zabbix. Much has been written about the installation and setup of such monitoring, we will not repeat it.

From the point of view of the developer, the problem of accessibility and quality lies in the plane of early detection of errors and problems with performance and early response to them. This requires approaches and means of assessment. So, we will try to derive quantitative metrics, analyzing which we can significantly improve the quality at different stages of project development and operation.

Assembly systems

Let me remind you once again about the importance of the engineering approach and testing in software development. Erlang / OTP offers two testing frameworks at once: eunit and common test.

The number of successful and problem tests, their time and percentage of code coverage by tests can be used as metrics for the initial assessment of the state of the code base and its dynamics. Both frameworks allow you to save test results in Junit format.
For example, for rebar3 and ct, you need to add the following lines to rebar.config:

{cover_enabled, true}. {cover_export_enabled, true}. {ct_opts,[ {ct_hooks, [{cth_surefire, [{path, "report.xml"}]}]} ]}. 

The number of successful and unsuccessful tests will allow you to build a trend graph:

looking at which, you can evaluate the dynamics of the team and the regression of tests. For example, in Jenkins, this graph can be obtained using the Test Results Analyzer Plugin.

If the tests are reddened or started to take a long time, the metrics will allow sending the release for revision even at the assembly stage and automatic testing.

Application Metrics

In addition to the operating system metrics, monitoring should include application metrics, such as the number of views per second, the number of payments, and other critical indicators.

In my projects, I use the ${application}.${metrics_type}.${name} template for naming metrics. This naming allows you to get lists of metrics like

 messaging.systime_subs.messages.delivered = 1654 messaging.systime_subs.messages.proxied = 0 messaging.systime_subs.messages.published = 1655 messaging.systime_subs.messages.skipped = 3 

Perhaps the more metrics, the easier it is to understand what is happening in a complex system.

Erlang VM metrics

Special attention should be paid to monitoring Erlang VM. The ideology of let it crash is beautiful, and the proper use of OTP will certainly help lift the fallen parts of the application inside the Erlang VM. But do not forget about the Erlang VM itself, because it is difficult to drop it, but you can. All options are based on the exhaustion of resources. We list the main ones:

So, the cases leading to the drop in vm have been disassembled, but apart from them, it is not bad to monitor no less important parameters that directly or indirectly affect the correct functioning of your applications:

Sending metrics in zabbix

Let's create a file containing application metrics and erlang vm metrics, which will be updated every N seconds. For each erlang node, the metrics file must contain the metrics of the applications running on it and the metrics of the erlang vm instance. The result should be something like this:

 messaging.systime_subs.messages.delivered = 1654 messaging.systime_subs.messages.proxied = 0 messaging.systime_subs.messages.published = 1655 messaging.systime_subs.messages.skipped = 3 …. erlang.io.input = 2205723664 erlang.io.output = 1665529234 erlang.memory.binary = 1911136 erlang.memory.ets = 1642416 erlang.memory.processes = 23596432 erlang.memory.processes_used = 23598864 erlang.memory.system = 50883752 erlang.memory.total = 74446048 erlang.processes.count = 402 erlang.processes.run_queue = 0 erlang.reductions = 148412771 .... 

With the help of zabbix_sender we will send this file to zabbix, where a graphical representation and the ability to create automation and notification triggers will already be available.

Now, having the metrics in the monitoring system and the automation triggers and notification events created on their basis, we have a chance to avoid accidents by responding in advance to all dangerous deviations from the full-featured state.

Central collection of logs

When in the project 1-2 servers, you can probably still live without a central collection of logs, but as soon as a distributed system appears with multiple servers, clusters, environments, there is a need to solve the problem of collecting and conveniently viewing logs.

To write logs in my projects, I use lager. Often, on the way from prototype to production, projects go through the following stages of collecting logs:

We will talk about the minuses, pros and quantitative metrics that can be used using the latter, and lager_clickhouse talk in the light of a specific implementation - lager_clickhouse , which I use in most of the projects I am developing. A few words about lager_clickhouse . This is the lager backend for saving events to clickhouse. At the moment, this is an internal project, but there are plans to make it open. When developing lager_clickhouse, I had to bypass some of the features of clickhouse, for example, to use event buffering in order not to make frequent requests to clickhouse. The effort spent has paid off with stable work and good performance.

The main disadvantage of the storage caching approach is an additional entity — clickhouse and the need to develop event saving code in it and a user interface for analyzing and searching for events. Also for some projects it may be critical to use tcp to send logs.

But the pros, I think, outweigh all the possible disadvantages.

A further development of the automation theme for handling emergency events was the use of lua scripts. Any developer or administrator can write a script for processing logs and metrics. Scripts bring flexibility and allow you to create personal automation and notification scripts.


To understand the processes occurring in the system and investigate incidents, it is vital to have quantitative indicators and event logs, as well as convenient tools for analyzing them. The more information available about the system is available to us, the easier it is to analyze its behavior and correct problems even at the stage of their occurrence. In the case when our measures did not work, we always have schedules and detailed logs of the incident.

How do you exploit solutions on Erlang / Elixir and what interesting cases did you encounter in production?

Source: https://habr.com/ru/post/437720/