I make a mirror to the cloud with indirect mirroring - the file structure does not locally coincide with the file structure on the server and for this purpose the correspondence "path-> path" is stored in the database.

I want to drive traffic at a minimum. To do this, you need to detect the moved files and simply record the changes in the database without reloading to the server.

We know about files: hash, size, modification time. Based on this data, we can scan the catalog and get:

  • file added
  • file changed (+ old data for this path)
  • file deleted (+ old data for this path)

Based on this, you need to understand where the movement was.

Take at once a difficult case. Here is a simple list of different files (unless otherwise noted)

1 2 3 4 (равен файлу 1 по размеру, хешу и дате - то есть дубликат с другим именем) 5 

there are transformations renaming

 5 -> 6 4 -> 5 2 -> 3 1 -> 2 

scan gets file status

 1 [удален] (1 -> 2) 2 [изменен] (1 -> 2) 3 [изменен] (2 -> 3) 4 [удален] (4 -> 5) 5 [изменен] (4 -> 5)(теперь это дубликат файла 2) 6 [добавлен] (5 -> 6) 

If you think, then the general detection algorithm is as follows:

  1. Go through the deleted files and find out whether this file is in the added or modified. If yes, then the file has been moved (and the "delete - add" link needs to be replaced by
  2. Go through the changed and see where there may be a new file assignment and if nowhere, then it means that we have just been overwritten and the file gets the status "deleted"

But we have collisions in which we cannot understand exactly where it was moved to the file (duplicates in the variant) and for such cases it will inevitably have something to be perezalivat. But I can’t think of an algorithm for working with collisions, so that as a result of all the manipulations, the correct result is always obtained.

    1 answer 1

    If you rely on the hash completely - that is, assume that no two different files can have the same hash - then the database is simply supported

     (хэш, путь в облаке, (пути на диске (может быть много))) 

    With each scan, a simple database update is sufficient. Is there such a hash? We register new ways, we delete old ones. Not? new file. Is there a hash that never lit up? delete from the database - possibly to the recycle bin (it will suddenly recover on the disk).

    It remains to use a hash that would not have collisions - but is it really not enough for some practical sha1?
    Yes, I know, broke it ... but do you really fear this? :)

    • The main focus on "take care of traffic." Duplicates are the norm and 2+ duplicates can be transferred and the algorithm will not be able to make the correct source-target chains. I have a persistent idea, “but what difference is correct if the database still stores the final version (and there are duplicates), and the sources disappear,” but my mind is so focused on minimizing network operations that I cannot formulate the steps of the full algorithm and how to make sure that there will be no problems in all possible options - vitidev