Hello. help please understand. There is a cluster of Klykhaus of 4x 2x2 machines. Created table ReplicationMergeTree. Above it is a distibuted table. Insert is done in the ReplicationMergeTree table. Sample from Distributed table. The problem is that the number of written lines does not match what is given to select. I looked at the clickthrough logs - the message "Wrote block with ID .... N Rows". Here the quantity fits with the expected. If you replace ReplicationMergeTree with MergeTree, then there is no such problem. what could be the problem? where to looking for? thank
1 answer
There are two possible reasons why this may occur.
- Data deduplication when inserted.
Data blocks are deduplicated. If you repeatedly write the same data block (data blocks of the same size, containing the same lines in the same order), the block will be written only once. This is done so that in the event of a network failure, when the client application cannot understand whether the data was recorded in the database, it was possible to simply repeat the INSERT request. It does not matter what replica the INSERTs with the same data will be sent to. That is, the idempotency of INSERTs is provided. This only works for the last 100 blocks inserted in the table.
The log will correspond to this message
Block with ID ... already exists; ignoring it
- Incorrect replication setup or cluster configuration. If the replicas cannot download data from each other, then a corresponding message will be displayed in the log. Attempts to download data can be found according to the Fetching part ...
- Yes, the problem is solved. deduplication data was really collapsing - hacker13ua