Find two identical items in IEnumerable

Question

There is a method that returns a large number of items.

static IEnumerable<double> MultiplicatedNumberSequence(long n, long a, long beta, long m) { for (long i = 1; i <= n; i++) { a = (beta * a) % m; yield return (double)a / m; } }

The number of elements n = 2 ^ 31. The method returns 2 ^ 31 degrees of elements. a equals the remainder of dividing by m, so a is bounded and integer. So if you take more than 2 ^ 31 elements among them there will be the same. There are too many elements to drive everything into an array or sort. How can you quickly (without expecting hours and days) find in the collection two identical elements, their indices and the difference between the indices of two identical elements? I can not think of an adequate method. To drive into the List <> does not go out, there is not enough memory, but to take the elements in order and compare them with the remaining suicide, because the number of iterations will be (2 ^ 31) ^ 2 = 2 ^ 62.

Even the standard implementation of Enumerable.Distinct uses Set to store unique elements, i.e.
In the order of delirium: drive the sequence into the database and make select distinct.
A simple method returns exactly m unique values: 0/m , 1/m , ..., (m-1)/m , since
So if n> m, then in the collection there will always be identical elements.
More precisely, the variable a is limited to M and the method returns a / m.

Pavel Mayorov Pavel Mayorov 48.9k five 50 110 · Answer 1 · 2016-04-29T14:47:41

Apparently, this will not happen quickly with the above sequence - getting IEnumerable<double> as input will do very little.

However, if the sequence parameters are known, this problem can be approached analytically, without generating the sequence itself!

Your sequence, if you do not divide it by 0, is the same a * beta^i % m .

If beta is mutually simple with m, then, as follows from Fermat’s small theorem,

 beta ^ (φ(m) - 1) % m = 1

In other words, the cycle is always there, it begins with the first element, and its length must be a divisor φ(m) - 1 , where φ is the Euler function.

Now that if beta is not mutually simple with m, then their common factor, once inside a, no longer “leaves” this variable. The loop has a prefix (it does not start with 1).

Thus, the criterion for the start of a cycle is the fact that gcd(a, m) = gcd(a*beta, m) , where gcd is the greatest common divisor:

 // Алгоритм Евклида int gcd(int a, int b) { while (b > 0) { var c = a % b; a = b; b = c; } return a; } int first = 0; while (gcd(a, m) != gcd(a*beta, m)) { a = (a * beta) % m; first++; }

In principle, knowledge of the first element of the cycle is enough to find its repetition in a reasonable time. But if you want to dig deeper, you can go further. Now you can make an equivalent transformation of the remaining sequence, reducing a and m by their GCD - this does not change the length of the cycle, but allows you to use the formula:

 int d = gcd(a, m); m /= d; a /= d;

Now it remains to go through the dividers φ(m) - 1 and find among them the length of the cycle (for example, using the fast exponentiation algorithm). Here, the easiest way is to start with the obviously suitable length φ(m) - 1 and try to divide it into all simple dividers smaller than square root, each time checking the cycle using quick exponentiation.

The total running time is O (M ^ 1 * log M), that is, a time on the order of 2 ^ 21. In the special case when M is a power of two, it is still much simpler and only O ((log M) ^ 2) remains (and knowing in advance that M is a power of two, you can write the algorithm quite simply, dispensing with even the Euclidean algorithm).

I am interested in at least one of these 2 answers will help a person who will get here from Google?)

Answer 2 · 2016-04-29T13:41:02

Update: The function will be periodic from the very beginning only for the case when beta and m not coprime. To study the period, it is better to return from the MultiplicatedNumberSequence function not a converted double , but the number a . Apparently, while the current member of the sequence increases the GCD with m . Therefore, to pass the "prefix" do the following:

 static long gcd(long a, long b) { while (a != 0 && b != 0) { var temp = b; b = a % b; a = temp; } return a + b; } long prevgcd = -1; var seq = MultiplicatedNumberSequence(...) .SkipWhile(v => prevgcd != (prevgcd = gcd(v, m)));

The remaining “tail” of the sequence is periodic, starting from the beginning, so you can simply remember the first element and scroll until it repeats. However, this essentially coincides with the @Pavel Mayorov method outlined in another answer.

The classic method is “hare and tortoise”. You run the sequence twice, jumping over one element for the first time, following all the elements in a row for the second, and comparing them.

 var seq = MultiplicatedNumberSequence(...); using (var tortoise = seq.GetEnumerator()) using (var hare = seq.GetEnumerator()) { while (true) { // продвигаем первый обход на шаг if (!tortoise.MoveNext()) break; // а второй на два if (!hare.MoveNext() || !hare.MoveNext()) break; if (hare.Current != tortoise.Current) continue; // если вы тут, вы нашли повторяющиеся значения // делайте с ними что хотите // например, вы можете завести счётчик и подсчитать индексы } }

Literature: https://en.wikipedia.org/wiki/Cycle_detection#Tortoise_and_hare (there are other methods as well).

Since you are comparing real numbers, it is possible that a comparison of the type a != b should be replaced with Math.Abs(b - a) > eps .

This works for the case of a sequence ( x _n ) calculated by the rule x _k = f ( x _{k - 1} ) for the invertible function f .

Why does this algorithm work? Let us have for some k , m , x _k = x _{k + m} . Applying f in the opposite direction, we obtain x _{k - 1} = x _{k + m - 1} , ..., x ₀ = x _m . Hence x ₀ = x _m = x _{2‌ m} . This means that there will be a minimal cycle when the hare is 2 будет m and the turtle is m .

Comments are not intended for extended discussion; conversation moved to chat . - Nick Volynkin ♦

Answer 3 · 2016-04-29T16:36:55

A special algorithm for the case when `m` is a prime power

This algorithm works with the generated sequence without knowing the values of its parameters.

If beta is divisible by a prime number, then the sequence will quickly end with a series of zeros. If beta not divisible by it, then the sequence is cyclic.

Well, plus since the sequence is finite - it can end earlier than it happens. No more options.

Knowing this, you can write an algorithm, where these options will be tested:

 var seq = MultiplicatedNumberSequence(...).GetEnumerator(); const double eps = 1e-10; double firstElement = 0; int first = -1, second = -1; using (seq as IDisposable) { for (int i=1; seq.MoveNext(); i++) { if (Math.Abs(seq.Current) < eps) { // Достигли нуля - дальше будут только они first = i; second = i+1; break; } if (i == 1) firstElement = seq.Current; else if (Math.Abs(seq.Current - firstElement) < eps) { // Цикл замкнулся first = 1; second = i; break; } } } // Выход алгоритма: индексы совпадающих элементов - first и second.

The same will be in theory when replacing 2 with any prime number.
If m = p^k , then (1) if beta is divisible by p , the sequence degenerates to zero, (2) if beta not divisible by p , the sequence is periodic.

Find two identical items in IEnumerable

3 answers 3

A special algorithm for the case when `m` is a prime power

More articles:

Find two identical items in IEnumerable

3 answers 3

A special algorithm for the case when m is a prime power

More articles:

A special algorithm for the case when `m` is a prime power