Checklist: what to do before running microservices in prod

This article contains a brief squeeze from my own experience and that of my colleagues, with whom I have been able to clean up incidents day and night. And many incidents would never have occurred if everyone’s favorite microservices were written at least a little more carefully.

Unfortunately, some ~~low~~ -level programmers seriously believe that a Dockerfile with any team at all inside is a microservice in itself and can be deployed even now. Dockers are spinning, lavenshka is muddied. This approach turns into problems starting with a drop in performance, inability to debug, and service failures and ending in a nightmare called Data Inconsistency.

If you feel that the time has come to launch another applet in Kubernetes / ECS / whatever, then I have something to object to.

Russian version is also available .

I formed for myself a certain set of criteria for assessing the readiness of applications for launch in production. Some items on this checklist cannot be applied to all applications, but only to special ones. Others generally apply to everything. I am sure you can add your options in the comments or dispute any of these points.

If your micro service does not meet at least one of the criteria, I will not allow it into my ideal cluster, built in a bunker 2000 meters under the ground with heated floors and a closed self-sufficient Internet feed system.

Go....

Note: the order of the items does not matter. Anyway, for me.

Short description in the readme

It contains a short description of itself at the very beginning of Readme.md in its repository.

God, it seems so simple. But how often have I come across that the repository does not contain the slightest explanation of why it is needed, what tasks it solves, and so on. There is no need to talk about something more complicated, such as configuration options.

Integration with monitoring system

Send metrics to DataDog, NewRelic, Prometheus, and so on.

Analysis of resource consumption, memory leaks, stacktraces, service interdependencies, error rate — it is extremely difficult to control what happens in a large distributed application without understanding all of this (and not only).

Alerts are configured

For the service included alerts (alerts) that cover all standard situations, plus known unique situations.

Metrics are good, but no one will follow them. Therefore, we automatically receive calls / pushes / sms if:

CPU / memory consumption has increased dramatically.
Traffic has increased dramatically / dropped.
The number of transactions processed per second has changed dramatically in any direction.
The size of the artifact after assembly has changed dramatically (exe, app, jar, ...).
The percentage of errors or their frequency exceeded the permissible threshold.
The service has stopped sending metrics (often an overlooked situation).
The regularity of certain expected events is broken (cron job does not work, not all events are processed, etc.)
...

Runbooks created

A document has been created for the service describing known or expected abnormal situations.

how to make sure that the error is internal and does not depend on third-parties;
if it depends, where, to whom and what to write;
how to safely restart it;
how to restore from backup and where are backups;
what special dashboards / queries are created to monitor this service;
Does the service have its own admin panel and how to get there?
is there an API / CLI and how to use to fix known problems;
and so on.

The list can be very different in different organizations, but at least the basic things should be there.

All logs are written in STDOUT / STDERR

The service does not create any files with logs in the mode of operation in production, does not send them to any external services, does not contain any redundant abstractions for log rotation, etc.

When an application creates files with logs - these logs are useless. You will not enter 5 containers running in parallel, hoping to catch the necessary error (and you will be crying ...). Restarting the container will result in the complete loss of these logs.

If an application writes logs to a third-party system, for example in Logstash, this creates useless redundancy. Neighboring service can not do this, because does it have another framework? You get a zoo.

The application writes a part of the logs to the files, and a part to the stdout because the developer is convenient to see the INFO in the console, and DEBUG in the files? This is generally the worst option. No one needs the complexity and completely extra code and configuration that you need to know and maintain.

Logs - this is Json

Each line of the log is written in Json format and contains a consistent set of fields.

Until now, almost everyone writes logs in plain text. This is a real disaster. I would be happy to never know about Grok Patterns . Sometimes I dream about them and I freeze, trying not to move, so as not to attract their attention. Just try once parsing Java exceptions in the logs.

Json is a blessing, it is a fire presented from heaven. Just add there:

timestamp with milliseconds as per RFC 3339 ;
level: info, warning, error, debug
user_id;
app_name,
and other fields.

Download to any suitable system (properly configured ElasticSearch, for example) and enjoy. Connect the logs of many microservices and feel again what good monolithic applications were.

(You can also add Request-Id and get tracing ...)

Logs with verbosity levels

The application must support an environment variable, for example LOG_LEVEL, with at least two modes of operation: ERRORS and DEBUG.

It is desirable that all services in the same ecosystem maintain the same environment variable. Not a config option, not an option on the command line (although this is wrapped, of course), but immediately by default from the environment. You should be able to get as many logs as possible if something goes wrong and as few logs as possible if everything is fine.

Fixed versions of dependencies

Dependencies for package managers are fixed, including minor versions (For example, cool_framework = 2.5.3).

About this many where already mentioned, of course. Some fix dependencies on major versions, hoping that in minor versions there will be only bug fixes and security fixes. It is not right.
Each change in each dependency must be reflected by a separate commit . So that it can be canceled in case of problems. Is it hard to control with your hands? There are useful robots, like this , that will track the updates and create Pull Requests for each of them.

Dockerized

The repository contains the production-ready Dockerfile and docker-compose.yml

Docker has long become a standard for many companies. There are exceptions, but even if you do not have a Docker in production, then any engineer should be able to simply do the docker-compose up and not think about anything else to get the dev-build for local verification. And the system administrator should have an assembly already verified by developers with the necessary versions of libraries, utilities, and so on, in which the application at least somehow works in order to adapt it for production.

Configuration through the environment

All important configuration options are read from the environment and the environment takes precedence over configuration files (but lower than the command line arguments at startup).

No one will ever want to read your configuration files and study their format. Just accept it.

More details here: https://12factor.net/config

Readiness and Liveness probes

Contains the appropriate endpoints or cli commands to test the readiness to serve requests at startup and for uptime.

If an application serves HTTP requests, it should by default have two interfaces:

To verify that the application is live and not stuck, use the liveness test. If the application does not respond, it can be automatically stopped by orchestrators like Kubernetes, " but this is not accurate ." In fact, killing a hung application can cause a domino effect and permanently put your service. But this is not a developer problem, just make this endpoint.
To verify that the application is not just started, but is ready to accept requests, a Readiness test is performed. If an application has established a connection with a database, a queue system, and so on, it must respond with a status from 200 to 400 (for Kubernetes).

Resource limits

Contains limits for memory, CPU, disk space and any other available resources in a consistent format.

The concrete implementation of this item will be very different in different organizations and for different orchestrators. However, these limits must be set in the same format for all services, be different for different environments (prod, dev, test, ...) and be outside the repository with the application code .

Assembly and delivery is automated

The CI / CD system used in your organization or project is configured and can deliver the application to the desired environment according to the accepted workflow.

Nothing is ever delivered to production manually.

No matter how difficult it is to automate the assembly and delivery of your project, this must be done before this project gets into production. This item includes building and running Ansible / Chef cookbooks / Salt / ..., building mobile applications, building operating system forks, building virtual machine images, whatever.
Can't automate? So you can not run it into the world. After you, no one will collect it.

Graceful shutdown - correct shutdown

The application can process SIGTERM and other signals and systematically interrupt their work after the processing of the current task.

This is an extremely important point. Docker-processes become orphaned and work for months in the background, where no one sees them. Non-transactional operations terminate in the middle of execution, creating inconsistency of data between services and databases. This leads to errors that cannot be foreseen and can be very, very expensive.

If you are not in control of any dependencies and cannot guarantee that your code will correctly handle SIGTERM, use something like dumb-init .

More information here:

Database connection is checked regularly.

The application constantly pings the database and automatically responds to the "lost connection" exception for any requests, trying to restore it on its own or correctly completes its work.

I saw a lot of cases (it’s not just a turn of speech) when services created to process queues or events lost connection by timeout and started infinitely filling errors in the logs, returning messages in the queue, sending them to Dead Letter Queue or simply not doing their job.

Scaled horizontally

As the load grows, it is enough to run more instances of the application to ensure that all requests or tasks are processed.

Not all applications can scale horizontally. A prime example is Kafka Consumers . This is not necessarily bad, but if a specific application cannot be launched twice, all interested parties need to know about this in advance. This information should be a eyesore, hang in the Readme and wherever possible. Some applications in general can not be run in parallel under any circumstance, which creates serious difficulties in supporting it.

It is much better if the application itself controls these situations or a wrapper is written for it, which effectively monitors the "competitors" and simply prevents the process from starting or starting work until another process completes its own or some external configuration allows N processes to work simultaneously.

Dead letter queues and bad message resistance

If the service listens to queues or responds to events, changing the format or content of messages does not cause it to fall. Unsuccessful attempts to process the task are repeated N times, after which the message is sent to Dead Letter Queue.

Many times I have seen endlessly restarted consumers and queues swelled to such an extent that their subsequent processing took many days. Any listener of the queue must be ready to change the format, to random errors in the message itself (typing data in json, for example) or when processing it by a child code. I even came across a situation where the standard library for working with RabbitMQ for one extremely popular framework did not support retries at all, counters of attempts, etc.

Even worse, the message is simply destroyed in case of failure.

Limit on the number of processed messages and tasks by one process

It supports an environment variable that can be forced to limit the maximum number of processed tasks, after which the service will shut down correctly.

Everything flows, everything changes, especially memory. The continuously growing schedule of memory consumption and OOM Killed in the end - this is the norm of life of modern cubernetic minds. Implementing a primitive check that would just spare you the very need to examine all these memory leaks would make life easier. I have often seen people spend a lot of time and effort (and money) to stop this turnover, but there are no guarantees that your colleague’s next committ will not make everything worse. If the application can survive a week - this is an excellent indicator. Let it then simply end itself and restart. This is better than SIGKILL (about SIGTERM, see above) or the exception "out of memory". For a couple of decades, this gag is enough for you.

Does not use third-party IP address filtering integration

If an application makes requests to a third-party service that allows calls from limited IP addresses, the service makes these calls indirectly through a reverse proxy.

This is a rare case, but extremely unpleasant. It is very inconvenient when one tiny service blocks the possibility of changing the cluster or moving to another region of the entire infrastructure. If you need to communicate with someone who does not know how to oAuth or VPN, configure the reverse proxy in advance. Do not implement in your program the dynamic addition / deletion of similar external integrations, since by doing this you nail yourself to the only available runtime environment. It is better to immediately automate these processes to manage Nginx configs, and in your application, refer to it.

Obvious HTTP User-agent

The service replaces the User-agent header with customized one for all requests to any API and this header contains enough information about the service itself and its version.

When you have 100 different applications communicate with each other, you can go crazy, seeing in the logs something like "Go-http-client / 1.1" and the dynamic IP address of the Kubernetes container. Always identify your application and its version explicitly.

Does not violate the license

Does not contain dependencies that unduly limit the application, is not a copy of someone else's code, and so on.

This is a self-evident case, but it has been possible to see that even a lawyer who wrote the NDA now has hiccups.

Does not use unsupported dependencies

When you first start the service, it does not include dependencies that are already out of date.

If the library that you have taken into the project is no longer supported by anyone - look for another way to achieve the goal or develop the library itself.

Conclusion

In my list there are some more very specific checks for specific technologies or situations, and I just forgot to add something. I am sure you will also find something to remember.

Source: https://habr.com/ru/post/438064/