Optimization examples by grouping data in memory

Question

Not really a question, rather an article-example to a question . Optimization methods based on efficient use of equipment , item “Bypassing data access latency”, paragraphs. "Grouping the desired data."

Quite an example from life - a system of particles. The first option is an array of structures (in the results, two left columns), the second is two arrays with structures broken down by purpose (the main data are in one structure, the auxiliary ones are in the other).

Naturally, the example does not contain many different data processing procedures. It only demonstrates the change in performance on a small fragment of the real system. Ie, let's say, we have a large complex software system (say, Folding @ Home), in which a large amount of data is an array of records. One of the typical computational problems on this data, with the speed of calculation of which the problem arose - the processing of a subset of data. This technique can accelerate this calculation. This will cause some problems (less beautiful code, a slight slowdown of some other data processing procedures). In general, as usual with optimization.

#include <stdafx.h> #include <conio.h> #include <math.h> #include <Windows.h> const int size = 1 << 24; const int repeatCount = 10; typedef float TVector[3]; struct TSparkle_Full { TVector coords, speed; COLORREF color; float startSize; float startLuminTime, fadingTime; float lifeTime; }; TSparkle_Full g_sparkles_Full[size]; struct TSparkle1 { TVector coords, speed; }; struct TSparkle2 { COLORREF color; float startSize; float startLuminTime, fadingTime; float lifeTime; }; TSparkle1 g_sparkles1[size]; TSparkle2 g_sparkles2[size]; void test1() { LARGE_INTEGER start, end, freq; QueryPerformanceFrequency(&freq); int i; QueryPerformanceCounter(&start); memset(g_sparkles_Full, 0, sizeof(g_sparkles_Full)); for (i = 0; i < size; i++) { g_sparkles_Full[i].speed[0] = float(rand()); g_sparkles_Full[i].speed[1] = float(rand()); g_sparkles_Full[i].speed[2] = float(rand()); } QueryPerformanceCounter(&end); printf("%16.6g", double(end.QuadPart - start.QuadPart) / freq.QuadPart); QueryPerformanceCounter(&start); for (int n = 0; n < repeatCount; n++) for (i = 0; i < size; i++) { g_sparkles_Full[i].coords[0] += g_sparkles_Full[i].speed[0]; g_sparkles_Full[i].coords[1] += g_sparkles_Full[i].speed[1]; g_sparkles_Full[i].coords[2] += g_sparkles_Full[i].speed[2]; } QueryPerformanceCounter(&end); printf("%16.6g", double(end.QuadPart - start.QuadPart) / freq.QuadPart); } void test2() { LARGE_INTEGER start, end, freq; QueryPerformanceFrequency(&freq); int i; QueryPerformanceCounter(&start); memset(g_sparkles1, 0, sizeof(g_sparkles1)); memset(g_sparkles2, 0, sizeof(g_sparkles2)); for (i = 0; i < size; i++) { g_sparkles1[i].speed[0] = float(rand()); g_sparkles1[i].speed[1] = float(rand()); g_sparkles1[i].speed[2] = float(rand()); } QueryPerformanceCounter(&end); printf("%16.6g", double(end.QuadPart - start.QuadPart) / freq.QuadPart); QueryPerformanceCounter(&start); for (int n = 0; n < repeatCount; n++) for (i = 0; i < size; i++) { g_sparkles1[i].coords[0] += g_sparkles1[i].speed[0]; g_sparkles1[i].coords[1] += g_sparkles1[i].speed[1]; g_sparkles1[i].coords[2] += g_sparkles1[i].speed[2]; } QueryPerformanceCounter(&end); printf("%16.6g\n", double(end.QuadPart - start.QuadPart) / freq.QuadPart); } int _tmain(int argc, wchar_t* argv[]) { printf("%16s%16s%16s%16s\n", "Fill", "Add speed", "Fill 2", "Add speed 2"); for (int i = 0; i < 5; i++) { test1(); test2(); } return 0; }

MS Visual Studio 2010, intel Core i5 processor about 2.5 GHz, DDR3 memory. Each test is run several times (each run corresponds to a line in the results. I do not pretend to measure accuracy - processes against the background, hyperthreading, ... But everything is visible and this: the processing speed of coordinates and speeds differs twice (1st and 3rd I column):

  Fill Add speed Fill 2 Add speed 2 2.20742 2.67801 1.66939 1.44776 1.63304 2.62465 1.66015 1.45007 1.65731 2.61806 1.63312 1.43443 1.6489 2.64562 1.70421 1.48133 1.65934 2.67894 1.63182 1.47675

And if you add std :: string, the difference is 3 times :)

Well, about any optimization you can say "just began to do less work" or "just began to do work more efficiently."
It is not always easy to do this, so a win, if there is one, cannot be considered free
@Michael M, IMHO You are testing here not latency, but "effective" throughput.
In fact, you demonstrate that the RAM / L3 Cache exchange is carried out by portions of data (I think 64 bytes each) (they are called the Cache line) of a fixed size.
Naturally, the larger the proportion of data needed for computing in the Cache line, the better the performance.
But do not worry, for practical conclusions it really does not matter.
@Yura Ivanov, well, you are too strict in evaluating the optimization.
O (n ^ 2) => O (n log n) at least this is precisely from the area of change of the algorithm, and the vehicle still speaks about optimizing the use of iron.
I just wanted to give an example that demonstrates its manifestation and how simple seeming changes lead to an increase in productivity by several times.
> let's just create an array on each field. Firstly, it will sometimes be slower, secondly, I don’t say that the optimal code is always nice and convenient to maintain.
I urge to collect techniques and examples so that people can at least think about architecture and optimization

avp avp 37.4k 3 35 90 · Answer 1 · 2013-11-03T14:39:25

@ superhackkiller1997 , I tested (?) (in the sense I still compiled and launched) I have your code with "pastebin":

 avp@avp-ubu1:~/hashcode$ gcc hacker3.c -fopenmp -O3 -std=gnu99 ... Оооооооооо сколько warning-ов!!! avp@avp-ubu1:~/hashcode$ ./a.out MEMSET: 0.061sec - 61.18GB/s MEMSET: 0.064sec - 58.29GB/s MEMADD: 0.188sec - 39.84GB/s MEMRAND: 0.559sec - 6.71GB/s avp@avp-ubu1:~/hashcode$ grep CPU /proc/cpuinfo | tail -1 model name : Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz avp@avp-ubu1:~/hashcode$

The result is impressive (absolutely serious). Only you would write with what keys to collect it (not everyone will want to understand what libs are missing - :)). Well, the abundance of warnings will not decorate ...

Program @ Michael M , tried to build, but because of all these M $ -x DWORD, LONGLONG, etc. threw.

To be honest, to carefully understand how to compare your result with the result of the vehicle did not. At first glance, its 'Fill' column is the sum of your 'MEMSET' / 2, its 'Add speed' should be compared with 'MEMADD' / 2.

Why such a huge difference in the time of execution, I really can not explain (calculate).

@ superhackkiller1997 , try to analyze and really tell everyone ( only without silly jargon, pls ).

Optimization examples by grouping data in memory

1 answer 1

More articles: