optimize calculations

Question

in the implementation of this task, faced with the problem of slow computing

First I decided to speed up log10 - it did not help, although it calculates faster than log10 from the standard library - (see the comparison in the questionnaire), but using the degree table 10 was not very good ideas, although bin search should have affected ...

In general, I can not understand what else and most importantly how to optimize in addition to calculating the decimal logarithm

 #include <iostream> #include <vector> #include <math.h> #include <stdint.h> const long double _10_POWERS[40] = { 1e+0, 1e+1, 1e+2, 1e+3, 1e+4, 1e+5, 1e+6, 1e+7, 1e+9, 1e+10, 1e+11, 1e+12, 1e+13, 1e+13, 1e+14, 1e+15, 1e+16, 1e+17, 1e+18, 1e+19, 1e+20, 1e+21, 1e+22, 1e+23, 1e+24, 1e+25, 1e+26, 1e+27, 1e+28, 1e+29, 1e+30, 1e+31, 1e+32, 1e+33, 1e+34, 1e+35, 1e+36, 1e+37, 1e+38, 1e+39 }; static inline uint32_t log10_fast(long double x) { //uint32_t res = 0; int l = 0, r = 40 - 1; while (l <= r) { int mid = l + ((r -l) >> 1); if ( x >= _10_POWERS[mid] && x < _10_POWERS[mid + 1] ) { return mid; } if (x >= _10_POWERS[mid]) l = mid; else r = mid; } return 0; }; uint32_t compute(int n, std::vector<std::vector<uint32_t>>& a) { long double x = 0.0f; uint32_t s = 0; const long double _2_96 = pow(2, 96); const long double _2_64 = pow(2, 64); const long double _2_32 = pow(2, 32); for (int i = 0; i < n; ++i) { for (int j = i + 1; j < n; ++j) { x = _2_96 * (a[i][0] ^ a[j][0]); x += _2_64 * (a[i][1] ^ a[j][1]); x += _2_32 * (a[i][2] ^ a[j][2]); x += (a[i][3] ^ a[j][3]); s += log10_fast(x); } } return 2 * s; } int main(int argc, char const *argv[]) { int n; std::cin >> n; std::vector<std::vector<uint32_t>> a(n, std::vector<uint32_t>(4)); for (int i = 0; i < n; ++i) { for (int j = 0; j < 4; ++j) { std::cin >> a[i][j]; } } std::cout << compute(n, a) << '\n'; return 0; }

What are some ideas about optimizations here?

Maybe there is another faster way to calculate log10 ? Or maybe it's not logarithm?

ps

comparison log10 and log10_fast

 uint32_t s = 0; high_resolution_clock::time_point t1 = high_resolution_clock::now(); for (unsigned i = 0; i < 1e+8; ++i) { s += log10( static_cast<long double>(rand()) ); } high_resolution_clock::time_point t2 = high_resolution_clock::now(); duration<double> dur = duration_cast<milliseconds>( t2 - t1 ); std::cout << dur.count() << '\n'; // 6.374 sec s = 0; t1 = high_resolution_clock::now(); for (unsigned i = 0; i < 1e+8; ++i) { s += log10_fast( static_cast<long double>(rand()) ); } t2 = high_resolution_clock::now(); dur = duration_cast<milliseconds>( t2 - t1 ); std::cout << dur.count() << '\n'; // 5.907 sec

I am plagued by vague doubts about the accuracy of calculations ... After all, there may be such an option that the number will be more / less border, and will be taken as the boundary ...
if we are talking about the number x in any case, accuracy, I think that is not critical here since the logarithm should give the same number for a whole set of numbers, for example, lying in the range 10^5 до 99^5 - the number 5, and as for speed operations with integers over floating-point numbers, then yes, it is understandable faster, but that is another question
Well, think for yourself - imagine what you have in bits - exactly to some extent.
Are you sure that the logarithm will not return a value 1 less, for example? ...
@Harry it depends in what direction to round, I think, I do not understand what is the relationship between the accuracy of calculations and performance?
@Harry, in any case, with my "not accurate calculations" of incorrect results, I did not see launching this solution on the site

Accepted Answer · 2017-05-21T23:10:57

It seems to me that the main problem is that you call the log10 function many times. Let's write the sum of logarithms as the logarithm of the product:

If the numbers are few, so that their product does not cause overflow, then you can count.

Another problem is the use of fractional numbers, they are not multiplied as quickly as integers. I want to offer an approximate solution in integers. Note that if, for example, a[i][1] xor a[j][1] non-zero, then a[i][3] xor a[j][3] and a[i][4] xor a[j][4] can not be considered, since their addition to xor will be very small. Consider the following algorithm:

For each pair (i, j) , we approximately estimate A_i xor A_j as x * 2^k , where x, k some integers, and 0 <= x < 2^32 .
Multiply the obtained values as follows: (x1 * 2^k1) * (x2 * 2^k2) = (x1 * x2) * 2^(k1+k2) = x * 2^(k1+k2) , with 0 <= x <= 2^64 . We represent x as x=y * 2^m' , with 0 <= y < 2^32 . So, (x1 * 2^k1) * (x2 * 2^k2) = y * 2^(k1+k2+m)
In fact, we calculated not the decimal logarithm, but the binary one, in order to get the decimal logarithm, we need to multiply by log_10 (2).

Actually, the code:

 #include <iostream> #include <vector> #include <cassert> #include <cmath> #include "/home/dima/C++/debug.h" using namespace std; double compute(const vector<uint64_t[2]> &a) { static const uint64_t two_power_32 = 1ull << 32; int n = a.size(); // текущий накопленный результат равен value * 2^power_index uint64_t value = 1; int power_index = 0; for (int i = 0; i < n; ++i) { for (int j = 0; j < i; ++j) { uint64_t xor1 = a[i][0] ^a[j][0]; uint64_t xor2 = a[i][1] ^a[j][1]; uint64_t value_current; if (xor1 == 0) { value_current = xor2; } else if (xor1 >= two_power_32) { value_current = xor1; power_index += 64; } else { assert(0 <= xor1 && xor1 < two_power_32); value_current = (xor1 << 32) + (xor2 >> 32); power_index += 32; } while (value_current >= two_power_32) { value_current /= 2; ++power_index; } assert(0 <= value_current && value_current < two_power_32); assert(0 <= value && value < two_power_32); value *= value_current; while (value >= two_power_32) { value /= 2; ++power_index; } } } // result = log10(value * 2^power_index) // result = log10(value) + log10(2^power_index) // result = log10(value) + power_index * log10(2) double result = log10(value) + power_index * log10(2); return result * 2; } int main() { freopen("input.txt", "r", stdin); int n; cin >> n; vector<uint64_t[2]> a(n); for (int i = 0; i < n; ++i) { for (int j = 0; j < 2; ++j) { uint32_t ai1, ai2; cin >> ai1 >> ai2; a[i][j] = (uint64_t(ai1) << 32) + ai2; } } cout << compute(a) << endl; return 0; }

Unfortunately, I didn’t compare performance, but I truly believe that it works faster than n^2 times to calculate log10 .

Update: I tested it here, with n=5000 my implementation is slightly slower than your original one. It's all about these cycles:

 while (value >= two_power_32) { value /= 2; ++power_index; }

They can be rewritten in different ways, here’s an option for GCC:

 static const uint64_t two_power_32 = 1ull << 32; inline void divide_until_less_then_two_power_32(uint64_t &value, int &power_index) { // Эта функция эквивалентна этим строчкам: // while (value >= two_power_32) { // value /= 2; // ++power_index; // } if (value < two_power_32) { return; } int power_index_delta = 32 - __builtin_clzll(value); power_index += power_index_delta; value >>= power_index_delta; assert(0 <= value && value < two_power_32); }

Full code:

 #include <iostream> #include <vector> #include <cassert> #include <cmath> using namespace std; #include "/home/dima/C++/debug.h" static const uint64_t two_power_32 = 1ull << 32; inline void divide_until_less_then_two_power_32(uint64_t &value, int &power_index) { // Эта функция эквивалентна этим строчкам: // while (value >= two_power_32) { // value /= 2; // ++power_index; // } if (value < two_power_32) { return; } int power_index_delta = 32 - __builtin_clzll(value); power_index += power_index_delta; value >>= power_index_delta; assert(0 <= value && value < two_power_32); } double compute(const vector<uint64_t[2]> &a) { int n = a.size(); // текущий накопленный результат равен value * 2^power_index uint64_t value = 1; int power_index = 0; for (int i = 0; i < n; ++i) { for (int j = 0; j < i; ++j) { uint64_t xor1 = a[i][0] ^a[j][0]; uint64_t xor2 = a[i][1] ^a[j][1]; uint64_t value_current; if (xor1 == 0) { value_current = xor2; } else if (xor1 >= two_power_32) { value_current = xor1; power_index += 64; } else { assert(0 <= xor1 && xor1 < two_power_32); value_current = (xor1 << 32) + (xor2 >> 32); power_index += 32; } divide_until_less_then_two_power_32(value_current, power_index); assert(0 <= value_current && value_current < two_power_32); assert(0 <= value && value < two_power_32); value *= value_current; divide_until_less_then_two_power_32(value, power_index); } } // result = log10(value * 2^power_index) // result = log10(value) + log10(2^power_index) // result = log10(value) + power_index * log10(2) double result = log10(value) + power_index * log10(2); return result * 2; } int main() { freopen("input.txt", "r", stdin); int n; cin >> n; vector<uint64_t[2]> a(n); for (int i = 0; i < n; ++i) { for (int j = 0; j < 2; ++j) { uint32_t ai1, ai2; cin >> ai1 >> ai2; a[i][j] = (uint64_t(ai1) << 32) + ai2; } } cout << compute(a) << endl; return 0; }

If I correctly counted everything, then this version works twice as fast.

Update 2: fixed the error (added power_index += 32; )

ideas, of course, are good, regarding optimizations, but launching your last version for some reason is 26.0014, when the correct answer is 22 for the test 2 0 0 0 2324 0 2332 0 0
I corrected, now on this test the result is 45.2673, which is very close to the correct answer , which, by the way, is not equal to 44.
44 would be if we take the integer value of the logarithm rounded down and multiplied by 2, so this is also the right answer, and there the real logarithm is considered to be

Answer 2 · 2017-05-22T04:24:05

I have not ready code, but ...

I would do so - at least to increase accuracy (because your double does not catch the exact value, for example, the same 10 ³⁹ ). Would make a 128-bit number - like

 unsigned long a[4];

Further, just such a plate of degrees 10 in such an exact representation -

 unsigned long p10[39][4] = { { 0x00000001, 0x00000000, 0x00000000, 0x00000000 }, // 1 { 0x0000000a, 0x00000000, 0x00000000, 0x00000000 }, // 10 { 0x00000064, 0x00000000, 0x00000000, 0x00000000 }, // 100 { 0x000003e8, 0x00000000, 0x00000000, 0x00000000 }, // 1000 { 0x00002710, 0x00000000, 0x00000000, 0x00000000 }, // 10000 { 0x000186a0, 0x00000000, 0x00000000, 0x00000000 }, // 100000 ... { 0x00000000, 0x098a2240, 0x5a86c47a, 0x4b3b4ca8 }, // 10000000000....0 { 0x00000000, 0x5f655680, 0x8943acc4, 0xf050fe93 }, // 10000000000....00

(fully - here: http://vpaste.net/RljZF ). Well, or if it is more convenient - then vice versa, from the older to the younger. Next, I would write a simple comparison function for such 128-bit numbers - very simple, starting with the older one - and go ...

And I would logarithm without any translations in double s - similar to your search when taking the logarithm. And you still have to measure whether binary search gives something with such a small amount or not. You can play, starting with the search for the senior element. If there is a mismatch, the logarithm is immediately determined, if it coincides, we proceed to the next one and so on ... I think you will have no problems writing to you.

Pros - accuracy , not used floating-point arithmetic.

Update accuracy ...

Consider the value 0x00000000 0x00000000 0x00038d7e 0xa4c67FFE , i.e. the number 999999999999998.

Obviously, the value of its logarithm, floor to a whole - 14.

Now we calculate your value -

 double x = pow(2,32)*0x00038d7e + 0xa4c67FFE;

VC ++ 2015 gives for

 double x = pow(2,32)*0x00038d7e + 0xa4c67FFE; printf("%.10lf\n",x); printf("%.10lf\n",log10(x)); int l = log10(x); printf("%d\n",l);

following results:

 999999999999998.0000000000 15.0000000000 15

You may argue that the logarithm you think is wrong ... but check for yourself that the number 0x00000000 0x00000000 0x0DE0B6B3 0xA763FFF8 - that is, 999999999999999992 - will give in the calculation by your method - with multiplication by pow(2,...) - number

 double x == 1000000000000000000.00000

So the values of logarithms you still for some numbers will be wrong.

in fact, I already solved the problem, it was in the log10_fast function where the cycle was looping when x == 0 from here and the brakes are infinite - so this function significantly increased performance, on these servers it took only 0.358 seconds, and with the standard log10 crashed in time, but something tells me that if you use your table, you can still win in time :)
@ampawd Once again - the main thing is that it will provide ACCURACY - see my updated answer: it contains specific examples of trouble.
The fact that you didn’t fall into them in your calculations is not my fault :)

optimize calculations

2 answers 2

More articles: