I have a task: I need to quickly find the number of whole square roots of several trillions of integers not exceeding 100,000 ^ 2 Simply put:

100 = 10 ^ 2 - suitable

3 = not suitable

28 = not suitable

49 = 7 ^ 2 - suitable

...

etc.

As part of the task, I use the openMP directive to evenly distribute the threads between the processor cores, but even in this case, the speed of the program is not very satisfactory. At the moment, the construction I use looks like this:

int y = sqrt(x); if (y==int(y)) i++; 

I tried to use the search in the SET set, knowing that my number will never exceed 100,000 ^ 2

 set<int> mySet; for (int j = 1; j <= 100001; j++) mySet.insert(j*j); if (mySet.find(x) != mySet.end()) i++; 

But this method turned out to be many times slower than the usual calculation from under the square root. Tell me, is it possible with the help of the language tools to somehow speed up the execution of one iteration or to replace the calculation from under the root with something faster? I will be glad to any tips!

  • one
    A few trillion is a few thousand seconds to go around these numbers. - vegorov
  • 10**10/8.0/1024.0/1024.0/1024.0 = 1.1641532182693481 that is, instead of a tree you can get a bitmap of 1.16 GB in which the i-th bit will be 1 if i has an integer root and 0 if it does not. Then you will not need to look for logN in the tree, you only need to check the bit (conditionally - for O (1), although I don’t know how quickly the call will go to such a bit in the RAM) - vegorov
  • Answer: 100,000 units - AlexGlebe
  • @AlexGlebe you can have several trillion triples. The problem never states that among several trillion numbers there are all numbers from 1 to 10 ^ 10. Also there may be several trillions of nines. - vegorov
  • habr.com/company/infopulse/blog/336110 - and this (or something like that) is no good? - Qwertiy

5 answers 5

There is a very effective way to speed up computations — throwing away knowingly “non-squares”. If we recall school mathematics, we can guess that the last n digits of the square of a number depend only on the last n digits of the number itself. Hence the conclusion - if the number ends with 0 1 4 5 6 9, then it may be the square of some number. If 2 3 7 8 - then this is definitely not a square.

But the last digit gives 40% screening. This is not enough. According to the last two figures (00 01 04 09 16 21 24 25 29 36 41 44 49 56 61 64 69 76 81 84 89 96), the screening percentage is already 78%. And this is very good. That is, in 4 out of 5 cases, even it is not necessary to calculate. Go ahead. With the last three digits, it is already possible to weed out 84.1%, 4 digits 89.56 - and this is very, very good - you can get a tenfold increase on random data.

The data itself can be stored in a small array and calculated at the start. The last 3 digits can be obtained through % 1000 .

  • five
    The same idea with bytes will make it possible to do without a module (in hexadecimal digits it can be shown that the last nibble can be 1,4,9,0x10,0x19,0x24 etc) - MBo
  • I tried. How strange it turns out. the last byte is 82% of clipping, 2 bytes is 83, 3 is also 83. - KoVadim
  • Yeah, funny. By nibbles (hex) 1: 0.25; и 2, и 3 и 4: 0.17; 5: 0.08; 6: 0.006 7: 0.00037; 8:0.0000233 1: 0.25; и 2, и 3 и 4: 0.17; 5: 0.08; 6: 0.006 7: 0.00037; 8:0.0000233 1: 0.25; и 2, и 3 и 4: 0.17; 5: 0.08; 6: 0.006 7: 0.00037; 8:0.0000233 - i.e. 3 bytes (0xFFFFFF mask) can be relatively profitable - MBo
  • Yeah, these fractions depend on the range - I gave for squares in the range of 10 ^ 10, and for 10 ^ 12 for three bytes the fraction changes to 0.055, and for 10 ^ 14 it converges to the same value around 1/6. For decimal digits, by the way, the same eggs - 4 and 5 digits give about the same thing, 6 is better. - MBo
  • if you take the last three digits, then it doesn’t matter if you drive numbers in the range of 0..999 or 0..99999. - KoVadim

I would try my own implementation with set on bitmasks. We allocate ~ 1.2 Gb of memory in one piece and set 100,000 necessary bits in it.

Since the set will be strongly discharged, we first check that the entire byte is not 0, if this is the case, we check the required bit.

Perhaps processing by 4 or 8 bytes will immediately be faster than 1 byte.

    To avoid wasting a lot of memory, create a bloom filter.

    This is a data structure that allows you to determine whether an element belongs to a set (in this case, the set of squares 1,4,9...10^10 ).

    It has a feature - the algorithm can report that the element is present in the set, although in fact it does not exist - a false positive response.

    In this case, you have to check with your hands, removing the root. However, for the majority of numbers (the proportion of false positives depends on the allocated memory), such a test is not necessary, since there are no false negatives.

    Delphi example

     var b: TSynBloomFilter; i, cnt: Integer; ii: Int64; begin //выбираем 5xразмер словаря, больший размер уже не особо влияет на количество коллизий b := TSynBloomFilter.Create(500000); try //загоняем квадраты до 10^10 for i := 1 to 100000 do begin ii := int64(i) * i; b.Insert(@ii, sizeof(ii)); end; //считаем, сколько обнаружений выпало на диапазоне 10^8 //истинных совпадений 10000 cnt := 0; for i := 1 to 100000000 do begin ii := i; if b.MayExist(@ii, sizeof(ii)) then Inc(cnt); end; //через две секунды видим - //получается, что потребовалось проверить 11359 чисел Memo1.Lines.Add(cnt.ToString); finally b.Free; end; 

      The idea with std :: set is good, but instead of std :: set you need to use a sorted array of squares, which will generate with a script, you should get something like:

       const int64_t n2[] = { 1, 4, 9, 16, 25, ... 100000^2 }; 

      Further, using binary search std::lower_bound(n2, n2+size, testValue) you will quickly find the desired value (or not). Such an array occupies only 800kb and easily caches the processor.

      • The same set, only the overhead projector is smaller. There is also a search for logN - because std :: set is a balanced binary search tree. - vegorov
      • The problem is that several trillion times will be called either a search in the tree or a binary search in the array. We need to remove O (logN) and get O (1) to get a good gain in speed - vegorov
      • @vegorov O (logN) vs O (1) is understandable, but here the main thing is that this array does not leave the processor’s cache, it can be faster than O (1) access to 1.2 GB - ffk
      • possibly. On an array of 100,000 items, binary search will perform up to 17 (or 16?) Iterations - vegorov
      • @vegorov it so, it is necessary to check. For example, bubble sorting that does not leave the L1 cache wins any other sorting. - ffk

      Try the integer root. Maybe it will be faster, but I do not believe.

       # include <stdbool.h> # include <stdio.h> long isqrt(long ) ; long isqrt(long num) { long res = 0; long bit = 1L << (sizeof(long)*8-2); // 2^62 while (bit > num) bit >>= 2; while (bit) { if (num >= res + bit) { num -= res + bit; res = (res >> 1) + bit; } else res >>= 1; bit >>= 2; } return res; } bool isSquare(long x) ; bool isSquare(long x){ long y = isqrt(x); //fprintf(stdout,"y = %ld\n",y); return y*y == x;} int main(){ long x; fprintf(stdout,"print number ?"); fscanf(stdin,"%ld",&x); bool res = isSquare(x); fprintf(stdout,"result = %s\n",(res?"true":"false")); return 0;}