I make a mirror to the cloud with indirect mirroring - the file structure does not locally coincide with the file structure on the server and for this purpose the correspondence "path-> path" is stored in the database.
I want to drive traffic at a minimum. To do this, you need to detect the moved files and simply record the changes in the database without reloading to the server.
We know about files: hash, size, modification time. Based on this data, we can scan the catalog and get:
- file added
- file changed (+ old data for this path)
- file deleted (+ old data for this path)
Based on this, you need to understand where the movement was.
Take at once a difficult case. Here is a simple list of different files (unless otherwise noted)
1 2 3 4 (равен файлу 1 по размеру, хешу и дате - то есть дубликат с другим именем) 5 there are transformations renaming
5 -> 6 4 -> 5 2 -> 3 1 -> 2 scan gets file status
1 [удален] (1 -> 2) 2 [изменен] (1 -> 2) 3 [изменен] (2 -> 3) 4 [удален] (4 -> 5) 5 [изменен] (4 -> 5)(теперь это дубликат файла 2) 6 [добавлен] (5 -> 6) If you think, then the general detection algorithm is as follows:
- Go through the deleted files and find out whether this file is in the added or modified. If yes, then the file has been moved (and the "delete - add" link needs to be replaced by
- Go through the changed and see where there may be a new file assignment and if nowhere, then it means that we have just been overwritten and the file gets the status "deleted"
But we have collisions in which we cannot understand exactly where it was moved to the file (duplicates in the variant) and for such cases it will inevitably have something to be perezalivat. But I can’t think of an algorithm for working with collisions, so that as a result of all the manipulations, the correct result is always obtained.