Is it correct to determine the uniqueness of a file by hash of the first 15kb of its data?

Question

We check the uniqueness of the file by this method. Is this the right approach?

UPDATE

All files are audio, photo and video of various formats.

What is the probability that the first 15 kb of different images, video and audio may be the same. Or is this not enough and you need to take the first 15 and last 15 KB of data?

Obviously, all files whose first 15kb are the same will be considered the same.
And how do you know that at the 16th kilobyte there will be no difference?
That's the question, what is the probability that the first 15 KB of different images, video and audio can be the same. Or do you need to take the first 15 and 15 last KB of data?
it is non-zero (like the non-zero probability that the hashes match for different data).
If you need the speed of determining whether such a file is among those downloaded or not, the Bloom filter can still help - a very compact and fast probabilistic structure

dSH dSH 1,031 four 17 · Answer 1 · 2019-03-19T12:16:53

In general, no. Suddenly it will be text log files, for example, which are added. Or a novel of some writer who periodically saves. But you can definitely answer only for your specific task - suddenly you have some additional task context that allows you to do so.

Yes, there is the fact that all files are audio, photos and videos of various formats.
Well, you just need to "run through" in all possible formats and determine that in the case of, for example, a full video and truncated from the end there will be different first 15 kb.
Or if resources allow, you can do this with a background after downloading - eliminate duplicates (if the goal is to reduce the occupied space).
It depends on what task - when and for what the uniqueness check is used

Is it correct to determine the uniqueness of a file by hash of the first 15kb of its data?

1 answer 1

More articles: