I am writing to php, and there is a real task of comparing 2 huge arrays of data. But the bug is that these arrays are broken down by file ...

A1 file - contains 5000 digits A2 file - contains 5000 digits .... And there are a lot of such files. All together this is an array of more than 10,000,000 digits.

There are also similar files b1, b2, b3 ... With the same structure of 5000 digits, and in fact it is the same array as in a1, only some digits were removed from there, and some were added.

So, I need to compare these 2 arrays in order to get at the output what they added there and what they deleted.

Why can't I just collect 2 arrays of 10,000,000 records and compare them to array_diff ?? Yes, because I can’t take it out from memory, and there can be much more in theory.

For this, I am looking for solutions of some kind of recursive comparison.

  • What are the values ​​in the files, integer or fractional. What is the spread of values, minimum, maximum. Do repetitions of numbers among a single set of files. Is it necessary to take into account the number of such repeats in one and in another set in case of repeats when comparing? Is the distribution of values ​​in one set close to normal? Is there any order of values ​​in the files / between files - Mike
  • No repetitions. There are no restrictions on the spread. Sorting is not known - Valentin Kruglikov
  • This is what I mean in order to effectively compare them. If the distribution were more or less close to normal, one would have to go by repartitioning the input data into files, each of which would have a specific set of values. Such files are then elementarily sorted. - Mike
  • If the distribution can be very far from normal, then in some files there may be too many values ​​and they will not fit into the memory. Then you have to approach on the other hand, read the set files to about 2/5 of the available RAM, sort and dump to disk, then read and merge sort. Which is quite a chore. - Mike
  • The task is to get sorted both sets in files on disk. After that, reading them in small pieces, you can safely compare walking parallel to them (as when sorting by merging). - Mike

1 answer 1

B1: Take A1 and B1. Same items removed. A2 was added to A1, B2 was added to B1, compared, the same ones were deleted again, and so on.

B2 (faster):

  1. Take A1 and B1. Same items removed.

  2. Take A2 and B2. Same items removed.

  3. Compared residues A with B2 and vice versa A2 with B. Identical elements were removed. (this item to do if the records could move between files)

  4. Added A2 to residues A, B2 to residues B.

  5. PP 2-3 in the cycle for the rest ...

As a result, those in the array A will remain those that were deleted, and in the array B, those that were added.

For the case if the files can vary greatly, you can use additional algorithms for selecting the next file for comparison. For example, initially sort through all the files and find the median of numbers in each file. Sort them in ascending order. And in this order to compare them. It is also possible to take into account when selecting files the minimum and maximum number in the file

  • And what if the resulting arrays do not fit into memory? or simply the worst case, in files A, the values ​​are all ascending, B is descending. In case of their successive comparison, the remains will be entered entirely in A and B, and only after half of the files have passed, they will start to be gradually removed from there. As a result, this is equivalent to loading one of the arrays entirely into memory - Mike
  • as far as I understood from the assignment, there were no such strong manipulations. In general, here you need to understand the goal - if the task is for the sake of a universal algorithm, then this is one thing. And if there are old files, and their new versions (with minor changes) and you need to compare, then this is different ... In any case, the amount of necessary memory was reduced at least twice. - Alex