We check the uniqueness of the file by this method. Is this the right approach?

UPDATE

All files are audio, photo and video of various formats.

What is the probability that the first 15 kb of different images, video and audio may be the same. Or is this not enough and you need to take the first 15 and last 15 KB of data?

  • And what is the criterion for the "correct" approach? Obviously, all files whose first 15kb are the same will be considered the same. But maybe someone (you) need it. Or, say, such files in your context can not be. - tum_
  • And how do you know that at the 16th kilobyte there will be no difference? - andreymal
  • That's the question, what is the probability that the first 15 KB of different images, video and audio can be the same. Or do you need to take the first 15 and 15 last KB of data? - Ivan_Aka
  • it is non-zero (like the non-zero probability that the hashes match for different data). More precisely, hardly anyone will determine. If you need the speed of determining whether such a file is among those downloaded or not, the Bloom filter can still help - a very compact and fast probabilistic structure - dSH

1 answer 1

In general, no. Suddenly it will be text log files, for example, which are added. Or a novel of some writer who periodically saves. But you can definitely answer only for your specific task - suddenly you have some additional task context that allows you to do so.

  • Yes, there is the fact that all files are audio, photos and videos of various formats. - Ivan_Aka
  • one
    Well, you just need to "run through" in all possible formats and determine that in the case of, for example, a full video and truncated from the end there will be different first 15 kb. Or if resources allow, you can do this with a background after downloading - eliminate duplicates (if the goal is to reduce the occupied space). It depends on what task - when and for what the uniqueness check is used - dSH