Apache Kafka and RabbitMQ: Semantics and Message Delivery Guarantee

We have prepared a translation of the next part of a multi-part article, which compares the functionality of Apache Kafka and RabbitMQ. This publication deals with the semantics and guarantees of message delivery. Please note that the author took into account Kafka before version 0.10 inclusive, and in version 0.11 appeared exactly-once. Nevertheless, the article remains relevant and full of useful points from a practical point of view.
Previous parts: first , second .

Both RabbitMQ and Kafka offer robust message delivery guarantees. Both platforms offer guarantees on the principles of “at a maximum single delivery” and “at least a single delivery”, but with the principle of “strictly one-time delivery”, Kafka guarantees operate under a very limited scenario.

First, let's figure out what these guarantees mean:

At-most-once delivery (“at a maximum single delivery”). This means that the message cannot be delivered more than once. In this case, the message may be lost.
At-least-once delivery (“at least one-time delivery”). This means that the message will never be lost. In this case, the message can be delivered more than once.
Exactly-once delivery (“strictly one-time delivery”). Holy Grail message systems. All messages are delivered strictly once.

The word “delivery” here is likely to be an incomplete term. It would be more accurate to say “processing”. In any case, we are now interested in whether the consumer can process messages and on what basis this happens: “no more than one”, “no less than one”, or “strictly once”. But the word “processing” complicates perception, and the expression “strictly one-time delivery” in this case will not be a precise definition, because it may be necessary to deliver the message twice in order to properly process it once. If the recipient has disconnected during processing, it is required that the message be sent a second time to the new recipient.

The second. Discussing the issue of message processing, we approach the topic of partial failures, which is a headache for developers. There are several stages in the processing of a message. It consists of communication sessions between the application and the message system at the beginning and at the end and the operation of the application itself with the data in the middle. Partial application failure scenarios must be processed by the application itself. If the operations performed are completely transactional and the results are formulated on the “all or nothing” principle, partial failures in the application logic can be avoided. But often, many steps involve interfacing with other systems where transactionality is impossible. If we include interrelationships between messaging systems, applications, the cache, and the database, can we guarantee that the processing is “strictly once”? The answer is no.

The “strictly once” strategy is limited to the scenario in which the only recipient of the processed messages is the messaging platform itself, and the platform itself provides full-fledged transactions. In this limited scenario, you can process messages, write them, send signals that they are processed as part of a “all or nothing” transaction. This is provided by the Kafka Streams library.

But if message processing is always idempotent, you can avoid the need to implement the strategy “strictly once” through transactions. If the final processing of messages is idempotent, you can easily accept duplicates. But not all actions can be implemented idempotently.

End-to-end notification

What is not represented in any devices of all messaging systems with which I have worked is a thorough confirmation. If we consider that in RabbitMQ a message can be delivered in several queues, end-to-end notification does not make sense. At Kafka, similarly, several different groups of recipients can read information simultaneously from one topic. In my experience, pass-through notification is what people who are new to the concept of messaging ask most often. In such cases, it is better to immediately explain that this is impossible.

Chain of responsibility

By and large, the sources of messages can not know that their messages are delivered to recipients. They can only know that the messaging system took their messages and took responsibility for ensuring their safe storage and delivery. There is a chain of responsibility that starts with the source, goes through the messaging system and ends at the recipient. Everyone must correctly perform their duties and clearly convey the message to the next. This means that you, as a developer, should competently design your applications to prevent the loss or incorrect use of messages while they are under your control.

Message Transfer Order

This article focuses primarily on how each platform provides for sending at least one and no more than one strategies. But there is still a messaging procedure. In the previous parts of this series, I wrote about the order in which messages are transmitted and how they are processed, and I advise you to turn to these parts.

In short, both RabbitMQ and Kafka provide a guarantee of the order of a simple sequence (first in first out, FIFO). RabbitMQ maintains such an order at the queue level, and Kafka maintains this order at the segment allocation level. The implications of such design decisions have been considered in previous articles.

Delivery guarantees in RabbitMQ

Delivery guarantees are provided:

reliability of messages - they will not disappear while being stored on RabbitMQ;
message notifications - RabbitMQ exchanges signals with senders and receivers.

Reliability components

Queue mirroring

Queues can be mirrored (replicated) on many nodes (servers). For each queue, a leading queue is provided at one of the nodes. For example, there are three nodes, 10 queues and two replicas per queue. 10 control queues and 20 replicas will be distributed over three nodes. The distribution of control queues across nodes can be configured. In case of node hangup:

instead of each leading queue on a hung node, a replica of this queue is provided on another node;
on other nodes, new replicas are created to replace lost replicas on the outgoing node, thereby maintaining the replication factor.

We are talking about fault tolerance in the next part of the article.

Reliable queues

There are two types of queues on RabbitMQ: reliable and unreliable. Reliable queues are written to disk and saved when the node is rebooted. When the node starts, they are overridden.

Resistant Posts

If the queue is reliable, it does not mean that its messages are saved when the node is restarted. Only messages marked sender by the sender will be restored.

When working on RabbitMQ, the more reliable the message, the lower the performance possible. If there is a stream of real-time events and it is not critical to lose several of them or a small time period of the stream, it is better not to apply queue replication and transmit all messages as unstable. But if it is undesirable to lose messages due to node failure, it is better to use robust replication queues and robust messages.

Message Receive Notifications

Messaging

Messages can be lost or duplicated during transmission. It depends on the sender's behavior.

“Shot and forget”

The source may decide not to ask the recipient for confirmation (notification of receipt of the message to the sender) and simply send the message automatically. Messages will not be duplicated, but may be lost (which satisfies the “one-time delivery maximum” strategy).

Confirmations to the sender

When the sender opens a channel for the queue broker, he can use the same channel to send acknowledgments. Now, in response to the received message, the queue broker should provide one of two things:

basic.ack. Positive acknowledgment. Message received, responsibility for it now lies on RabbitMQ;
basic.nack. Negative confirmation. Something happened and the message was not processed. Responsibility for it remains at the source. If desired, he can send the message again.

In addition to the positive and negative notifications on message delivery, a basic.return message is provided. Sometimes the sender needs to know not only that the message arrived in RabbitMQ, but also that it actually got into one or several queues. It may happen that the source sends a message to the distribution system in queues (topic exchange), in which the message is not routed to any of the delivery queues. In such a situation, the broker simply discards the message. In some scenarios, this is normal; in others, the source must know whether the message has been cleared and act accordingly. You can set the “Mandatory” flag for individual messages, and if the message has not been defined in any delivery queue, the message of basic.return will be returned to the sender.

The source may wait for confirmation after sending each message, but this will greatly reduce the performance of its work. Instead, sources can send a steady stream of messages, setting a limit on the number of unacknowledged messages. When the interim message limit is reached, sending will be suspended until all confirmations are received.

Now that there are a lot of messages in transit from the sender to RabbitMQ, confirmations are grouped together to increase performance using the multiple flag. All messages sent through the channel are assigned a monotonically increasing integer value, the “sequence number” (Sequence Number). The notification of the receipt of a message includes the sequence number of the corresponding message. And if at the same time the value multiple = true, the sender must track the sequence numbers of their messages in order to know which messages were successfully delivered and which did not. I wrote a detailed article on this topic.

Thanks to confirmations, we avoid losing messages in the following ways:

re-sending messages in case of a negative notification;
continue to store messages somewhere in case of receiving a negative notification or basic.return.

Transactions

Transactions are rarely used in RabbitMQ for the following reasons:

Weak guarantee. If messages are sent to multiple queues or have a mandatory icon, the continuity of transactions will not be supported;
Poor performance

Honestly, I never applied them, they do not give any additional guarantees, except confirmations to the sender, and only increase the uncertainty about how to interpret acknowledgments of receipt of messages arising from the completion of transactions.

Communication / channel errors

In addition to notifications on receiving messages, the sender needs to keep in mind the failures of communication tools and brokers. Both of these factors lead to a loss of communication channel. With the loss of channels, there is no way to receive any not yet delivered notification of receipt of messages. Here, the sender must choose between the risk of losing messages and the risk of duplicating them.

Broker failure can occur when the message was in the buffer of the operating system or pre-processed, and then the message will be lost. Or maybe the message was queued, but the message broker died before sending the confirmation. In this case, the message will be successfully delivered.

Similarly affects the situation of failure of communication. Did a failure occur while sending a message? Or after the message was queued, but before receiving a positive notification?

The sender cannot determine this, so he must choose one of the following options:

do not re-send the message, creating the risk of losing it;
re-send the message and create a risk of duplication.

If many sender messages are in transit, the problem becomes more complex. The only thing the sender can do is give the recipients a hint by adding a special header to the message indicating that the message is being sent a second time. Recipients may decide to check the messages for the presence of similar headers and, if they are found, to additionally check the received messages for duplicates (if such a check has not been done before).

Recipients

Recipients have two options governing notification of receipt:

no notification mode;
manual notification mode.

No notification mode

It is the automatic notification mode. And he is dangerous. First of all, because when a message gets into your application, it is removed from the queue. This may result in the loss of a message if:

The connection was disconnected until the message was received;
the message is still in the internal buffer, and the application is disabled;
unable to process message.

In addition, we lose backpressure mechanisms as a means of controlling the quality of message delivery. By setting the mode of sending notifications manually, you can set a prefetch (or set the level of services provided, QoS) to limit the one-time number of messages that the system has not yet confirmed. Without this, RabbitMQ sends messages as fast as the connection allows, and this can be faster than the recipient is able to process them. As a result, buffers overflow and memory errors occur.

Manual notification mode

The recipient must manually send notification of receipt of each message. He can set a prefetch in case the number of messages is more than one, and process many messages at the same time. He may decide to send a notification for each message, or he can use the multiple flag and send several notifications at the same time. Notification grouping improves performance.

When the recipient opens a channel, the messages going through it contain the Delivery Tag parameter, whose values are an integer, monotonically increasing number. It is included in each notification of receipt and is used as the message identifier.

Notifications can be the following:

basic.ack. After it, RabbitMQ deletes the message from the queue. The multiple flag can be applied here.
basic.nack. The recipient must set the flag to tell RabbitMQ whether to reissue the message. When re-staging the message gets to the top of the queue. From there it is sent to the recipient again (even to the same recipient). The basic.nack notification supports the multiple flag.
basic.reject. Same as basic.nack, but does not support the multiple flag.

Thus, semantically, basic.ack and basic.nack are the same when requeue = false. Both operators mean removing the message from the queue.

The next question is when to send receipt alerts. If the message was processed quickly, you may want to send a notification immediately after the completion of this operation (successful or unsuccessful). But if the message was in the RabbitMQ queue and it takes a lot of minutes to process? Sending a notification after this will be problematic, because if the channel closes, all messages to which there were no notifications will be returned to the queue, and sending will be made again.

Connection / Message Broker Error

If the connection was terminated or an error occurred in the broker, after which the channel ceases to work, then all messages that have not been acknowledged have been received again, are queued and re-sent. This is good because it prevents data loss, but badly, because it causes unnecessary duplication.

The longer the recipient has a long time there are messages, the receipt of which he did not confirm, the higher the risk of re-sending. When a message is sent again, RabbitMQ for the re-send flag is set to “true”. Due to this, the recipient at least has an indication that the message may have already been processed.

Idempotency

If idempotency is required and guarantees that no message will be lost, you should embed some kind of duplicate checking or other idempotent schemes. If checking for duplicate messages is too expensive, you can apply a strategy in which the sender always adds a special header to the resubmitted messages, and the recipient checks the received messages for the presence of such a header and a resend flag.

Conclusion

RabbitMQ предоставляет надёжные, долговременные гарантии обмена сообщениями, но есть много ситуаций, когда они не помогут.

Вот список моментов, которые следует запомнить:

Следует применять зеркалирование очередей, надёжные очереди, устойчивые сообщения, подтверждения для отправителя, флаг подтверждения и принудительное уведомление от получателя, если требуются надёжные гарантии в стратегии “как минимум однократная доставка”.
Если отправка производится в рамках стратегии “как минимум однократная доставка”, может потребоваться добавить механизм дедубликации или идемпотентности при дублировании отправляемых данных.
Если вопрос потери сообщений не так важен, как вопрос скорости доставки и высокой масштабируемости, то подумайте о системах без резервирования, без устойчивых сообщений и без подтверждений на стороне источника. Я всё же предпочел бы оставить принудительные уведомления от получателя, чтобы контролировать поток принимаемых сообщений путем изменения ограничений предвыборки. При этом вам потребуется отправлять уведомления пакетами и использовать флаг “multiple”.

Гарантии доставки в Kafka

Гарантии доставки обеспечиваются:

долговечностью сообщений — сообщения, сохранённые в сегменте, не теряются;
Уведомлениями о сообщениях — обмен сигналами между Kafka (и, возможно, хранилищем Apache Zookeeper) с одной стороны и источником/получателем — с другой.

Два слова о пакетировании сообщений

Одно из отличий RabbitMQ от Kafka заключается в использовании пакетов при обмене сообщениями.

RabbitMQ обеспечивает что-то похожее на пакетирование благодаря:

Приостановке отправки каждые Х сообщений до тех пор, пока не будут получены все уведомления. RabbitMQ обычно группирует уведомления, используя флаг «multiple».
Установке получателями параметра «prefetch» и группировкой уведомлений с помощью «multiple».

Но всё же сообщения не отправляются пакетами. Это больше похоже на непрерывный поток сообщений и отправку групп уведомлений в одном сообщении при отмеченном значке “multiple”. Как это делает протокол TCP.

Kafka обеспечивает более явное пакетирование сообщений. Пакетирование делается ради производительности, но иногда возникает необходимость в компромиссе между производительностью и другими факторами. Аналогичная ситуация возникает в RabbitMQ, когда сдерживающим фактором становится количество сообщений, которые ещё в пути, уведомлений о поступлении которых ещё нет. Чем больше сообщений было в пути в момент сбоя, тем больше возникает дублей и происходит повторной обработки сообщений.

Kafka более эффективно работает с пакетами со стороны получателя, потому что работа распределяется по разделам, а не по конкурирующим получателям. Каждый раздел закреплён за одним получателем, поэтому даже применение больших пакетов не влияет на распределение работы. Но если вместе с RabbitMQ используется устаревший API для считывания больших пакетов, это может привести к крайне неравномерной нагрузке между конфликтующими между собой получателями и значительным задержкам в обработке данных. RabbitMQ по своему устройству не подходит для пакетной обработки сообщений.

Элементы, обеспечивающие устойчивость

Репликация журнала

Для защиты от сбоев у Kafka предусмотрена архитектура ведущий-ведомый на уровне раздела журнала, и в этой архитектуре ведущие называются лидерами, а ведомые еще могут называться репликами. Лидер каждого сегмента может иметь несколько ведомых. Если на сервере, где находится лидер, происходит сбой, предполагается, что реплика становится лидером и все сообщения сохраняются, только обслуживание на короткое время прерывается.

Kafka придерживается концепции синхронизации реплик (In Sync Replicas, ISR). Каждая реплика может быть или не быть в синхронизированном состоянии. В первом случае она получает те же сообщения, что и лидер, за короткий отрезок времени (обычно за последние 10 секунд). Она выпадает из синхронизации, если не успевает эти сообщения принять. Такое может произойти из-за сетевой задержки, проблем с виртуальной машиной узла и т.д. Потеря сообщений может произойти только в случае сбоя лидера и отсутствия участвующих в синхронизации реплик. Я расскажу об этом подробнее в следующей части.

Уведомления о получении сообщений и отслеживание смещения

Учитывая то, как Kafka хранит сообщения, и то, как они доставляются получателям, Kafka полагается на уведомления о получении сообщений для источников и отслеживание смещения чтения топика для получателей.

Уведомление о получении сообщения для источника

Когда источник посылает сообщение, он даёт знать брокеру Kafka, какого рода уведомление он хочет получить, задав одну из настроек:

Без уведомлений, автоматический режим. Acks=0.
Уведомление о получении сообщения лидером. Acks=1
Уведомление о получении сообщения лидером и всеми участвующими в синхронизации репликами. Acks=All

Сообщения могут быть дублированы при отправке по тем же причинам, что и в RabbitMQ. В случае сбоя брокера или сети во время отправки, отправителю придётся ещё раз выслать сообщения, уведомления о принятии которых он не получил (если он не хочет, чтобы они были потеряны). Но вполне возможно, что это сообщение или сообщения были уже получены и среплицированы.

Однако у Kafka предусмотрена хорошая опция против проблем с дублированием. Для её работы должны быть соблюдены следующие условия:

для enable.idempotence установлено значение “true”,
для max.in.flight.requests.per.connection установлено значение 5 или менее,
для retries установлено значение 1 или выше,
для acks установлено значение “all”.

Следовательно, при пакетной отправке шести или более сообщений или если acks=0/1 для повышения производительности, данная опция не может использоваться.

Отслеживание смещения получателем

Получатели должны сохранять смещение своего последнего полученного сообщения, чтобы в случае сбоя новый получатель мог продолжить с того места, где остановился предыдущий. Эти данные могут быть сохранены в ZooKeeper или другом топике Kafka.

Когда получатель считывает пакеты сообщений из раздела (топика), у него есть несколько вариантов относительно того, когда сохранять смещение своего последнего полученного сообщения:

Периодически. По мере обработки сообщений клиентская библиотека контролирует периодические фиксации смещения. Это делает работу со смещениями очень простой с точки зрения программиста и также положительно влияет на производительность. Но такой подход повышает риск повторной доставки при сбое получателя. Получатель может успеть обработать пакет сообщений и упасть прежде, чем он успеет зафиксировать соответствующее смещение.
Немедленно, перед началом обработки сообщений. Это соответствует стратегии отправки сообщений “как максимум однократная доставка”. При этом не важно, когда могли быть сбои у получателя; сообщение не будет обработано дважды, но может остаться необработанным. Например, если 10 сообщений обрабатывались и на пятом у получателя произошёл сбой, обработанными окажутся только 4 сообщения, остальные будут сброшены, и следующий получатель начнёт с сообщений, которые поступят после этого пакета;
В конце, когда все сообщения обработаны. Это соответствует стратегии отправки сообщений “как минимум однократная доставка”. Не важно, когда могли быть сбои у получателя, ни одно сообщение не останется необработанным, но одно и то же сообщение может быть обработано несколько раз. Например, если 10 сообщений обрабатывались и на пятом у получателя произошёл сбой, все десять сообщений будут считаны следующим получателем, и 4 сообщения окажутся обработаны дважды;
Поочередно. Это сократит возможность дублирования, но сильно снизит производительность работы.

Стратегия “строго однократная доставка” ограничена Kafka Streams, клиентской библиотекой от Java. При использовании Java настоятельно рекомендую обратить внимание на неё. При использовании стратегии “строго однократная доставка”, главная сложность будет заключаться в том, что и обработка сообщения, и сохранение смещения последнего полученного сообщения должно производиться в одной транзакции. Например, если обработка сообщения предполагает отправку сообщения по электронной почте, сделать это в рамках стратегии “строго однократная доставка” не получится. Если после отправки электронного письма у получателя произошёл сбой до того, как он сохранил смещение последнего полученного сообщения, новый получатель (сообщения) должен будет снова отправить это письмо.

Приложения, использующие Kafka Streams, у которой последнее действие по обработке сообщения заключается в том, чтобы записать новое сообщение в другой топик, могут действовать в рамках стратегии “строго однократная доставка”. Это обеспечивается с помощью транзакционной функциональности Kafka: отправить сообщение в другой топик и записать смещение можно в рамках одной транзакции. Обе операции будут успешны, или обе будут неудачны. Вне зависимости от того, когда произойдет сбой получателя, и запись смещения, и запись в топик либо будут выполнены (и только один раз), либо нет одновременно.

О транзакциях и уровнях изоляции

Главным сценарием применения транзакций в Kafka является упомянутый выше сценарий “чтение-обработка-написание”. В транзакции могут участвовать сразу несколько топиков и разделов. Отправитель начинает транзакцию, создаёт пакет сообщений, завершает транзакцию.

Если получатели используют по умолчанию изоляционный уровень “читать незафиксированное”, они видят все сообщения, независимо от их транзакционного статуса (завершена, не завершена, отменена). Если получатели используют изоляционный уровень “чтение зафиксированного”, они не видят сообщения, транзакции которых не завершены или отменены. Они могут принимать сообщения только завершенных транзакций.

Может возникнуть вопрос: как изоляционный уровень “чтение завершенных транзакций” влияет на гарантии порядка отправки сообщений? Он не влияет никак. Получатели будут считывать все сообщения в нужном порядке, это прекратится на первом сообщении, транзакция которого не завершена. Незавершенные транзакции будут блокировать чтение. Смещение последней завершенной транзакции (Last Stable Offset, LSO) — это смещение до первой незавершенной транзакции; получатели с уровнем изоляции “чтение завершенных транзакций” могут читать только до данного смещения.

findings

Обе технологии предлагают устойчивые и надёжные механизмы обмена сообщениями. Если надёжность важна, вы можете быть уверены, что оба решения предлагают сравнимые гарантии. Но я думаю, что на настоящий момент Kafka имеет преимущество в идемпотентности отправки сообщений, а ошибки в контроле смещения последнего полученного сообщения не всегда приводит к потере такого сообщения навсегда.

Подведём итоги

Обе платформы способны реализовать стратегии “как максимум однократная доставка” и “как минимум однократная доставка”.
Обе платформы обеспечивают репликацию сообщений.
На обеих платформах действуют одинаковые факторы, в силу которых следует искать компромисс между пропускной способностью системы и риском дублирования сообщений. Kafka обеспечивает идемпотентную отправку сообщений, но только для ограниченного объёма трафика.
Обе платформы задают ограничения количества передаваемых сообщений, находящихся в пути, подтверждение о получении которых ещё не получено отправителем.
Обе платформы предоставляют гарантии относительно порядка отправки сообщений.
Kafka обеспечивает поддержку транзакций, в первую очередь в сценарии “чтение-обработка-написание”. При этом нужно принять меры против снижения пропускной способности системы.
В Kafka, если получатель не обрабатывает часть сообщений из-за сбоев техники и некорректного отслеживания смещения последнего полученного сообщения, все равно можно восстановить смещение этого сообщения (если такой случай обнаружен). В RabbitMQ соответствующие сообщения будут утеряны.
Kafka может повысить пользу от пакетирования благодаря своим возможностям по распределению пакетов, а в RabbitMQ пакетирование отсутствует в силу пассивной модели приёма, не препятствующей конфликтам получателей.

Source: https://habr.com/ru/post/437446/