There is an application for androyd that uses SQLite DB. And there is a java server that works with MySQL. It is necessary to know whether the data in these two databases are the same. Decided to use checksums for this. How to implement it using Java?

    1 answer 1

    The answer is very high-level, because the task is quite extensive.

    This is a bad idea, but if you do ...

    There are no built-in mechanisms in the databases themselves.

    Since checksums are usually defined for sequences of bytes, your task is to get the same sequence of bytes from the same byte on different databases. Each such sequence must uniquely identify the current state of the data.

    And hashing (removing the checksum) in this context will be only a kind of compression algorithm (lossy) for this sequence. Since there are losses, he will not be able to prove the identity of the data due to the non-zero (albeit very small) probability of collisions.

    Java has a MessageDigest class . You can initialize its object and feed it to the update method separate small arrays of data bytes, pieces of the sequence described above, getting them in pieces (by cursor or otherwise) from the database, so as not to store them in memory.

    The algorithm itself for obtaining a unique sequence is a whole field for creativity, so I will not give the finished code. But in this field there are some mines and rakes, which I consider it necessary to mention:

    • Sort the data by primary key. The database is likely to do so, but it is better to do it explicitly and to protect yourself from sudden bugs.
    • Check that the data encodings match.
    • It is necessary to accurately display in the byte array the boundaries of individual fields, records and tables, so that the similar, but different data sets do not appear to be identical arrays.
      • This can easily be, for example, if you simply dump a value after a value in a row (actually concatenating them). This may work well in practice, but may break at the most inopportune moment.
    • Do not use "rubber" formats like JSON, for which different implementations of marshallers on the same data can produce different things. Or at least use the same predictable implementation.

    ... or it may be wiser to use another data warehouse . For example, some, which initially provides for separate work of separate databases and synchronization "sometime later." After all, with a high probability you are interested not only in checking for the identity of the data, but also in bringing them into an identical form.

    For example, there are compatible (by a large percentage) among themselves CouchBase for Android, CouchDB for the server and PouchDB for the browser. But if you search, there may be some more.