Here is an example of C ++ code that looks very strange. For some reason, when the data is sorted, the code runs almost six times faster.

#include <algorithm> #include <ctime> #include <iostream> int main() { // Заполнение данными const unsigned arraySize = 32768; int data[arraySize]; for (unsigned c = 0; c < arraySize; ++c) data[c] = std::rand() % 256; // !!! С этой строкой следующий цикл работает быстрее std::sort(data, data + arraySize); // Проверка clock_t start = clock(); long long sum = 0; for (unsigned i = 0; i < 100000; ++i) { // Основной цикл for (unsigned c = 0; c < arraySize; ++c) { if (data[c] >= 128) sum += data[c]; } } double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC; std::cout << elapsedTime << std::endl; std::cout << "sum = " << sum << std::endl; } 
  • Without std::sort(data, data + arraySize); The code is executed 11.54 seconds.
  • With sorted data - 1.93 seconds.

At first, I thought something was wrong with the language or the compiler. So I tried using java.

 import java.util.Arrays; import java.util.Random; public class Main { public static void main(String[] args) { // Заполнение данными int arraySize = 32768; int data[] = new int[arraySize]; Random rnd = new Random(0); for (int c = 0; c < arraySize; ++c) data[c] = rnd.nextInt() % 256; // !!! С этой строкой следующий цикл работает быстрее Arrays.sort(data); // Проверка long start = System.nanoTime(); long sum = 0; for (int i = 0; i < 100000; ++i) { // Основной цикл for (int c = 0; c < arraySize; ++c) { if (data[c] >= 128) sum += data[c]; } } System.out.println((System.nanoTime() - start) / 1000000000.0); System.out.println("sum = " + sum); } } 

The result was similar results, but with a smaller gap.


The first thought was that when sorting the data gets into the cache, but then I thought it was stupid, because the array was just created.

  • What's happening?
  • Why is sorted array processed faster than non-sorted?

Why is it sorted array than an unsorted array?

4 answers 4

Translation of the answer: @Mysticial

You have been the victim of a predictor of transition errors.


What is Conversion Prediction?

Consider the railway junction:


Picture Mecanismo, from Wikimedia Commons. Used under license CC-By-SA 3.0 .

Now imagine that we returned to the XIX century - before the invention of the radio.

You are the operator of the railway junction and hear the train arrives You have no idea which path he should take. you stop the train and ask the driver where he needs it. And then set the switch to the desired position.

The trains are heavy and with great inertia. Therefore, starting and stopping take a lot of time.

Is there a better way? You can guess where the train will go!

  • If you guessed correctly, the train will continue without stopping.
  • If you make a mistake, the train driver will stop the train, come back, and he will bawl at you so that you can cross the tracks. Then he can continue driving.

If you guess correctly every time , the train will never stop.
If you make mistakes very often , the train will lose a lot of time to stop, return, and accelerate.


Consider if-statement: at the processor level, this is a branch instruction:

enter image description here

You are a processor and you see branching. You have no idea which branch will be selected. What do you do? You stop the execution and wait for the completion of the previous instruction. Then you continue running along the right path.

Modern processors are complex and have long conveyors. Therefore, “warming up” and “stopping” take a lot of time.

Is there a better way? You can guess which branch will be executed!

  • If you guessed correctly, execution will continue.
  • If you make a mistake, you need to reset the pipeline and roll back to the branch. Then you can continue with the desired branch.

If you guess correctly every time , execution will never stop.
If you make mistakes very often , you will spend a lot of time stopping, rolling back and restarting.


This is a prediction of transitions. I admit this is not the best analogy, because the train can indicate the direction of giving a signal with a flag. But in computers, the processor does not know which direction will be chosen until the very last moment.

So what strategy to choose when guessing. to minimize the number of times. when should the train go back and go the other way? You can see the story! If the train goes left 99% of the time, then the guess will be: left. If alternate, then the conjectures also alternate. If one path is chosen every three times, you can assume the same ...

In other words, you are trying to define a pattern of behavior and follow it. This is how the predictor of transitions works.

Most applications have well-defined branches. Therefore, in modern predictors of transitions, the percentage of correct guesswork is usually> 90%. But when they encounter unpredictable branches with undefined patterns, transition predictors are almost useless.

Further reading: "Predictor of transitions" article on Wikipedia .


As mentioned above, the problem is in this if-statement:

 if (data[c] >= 128) sum += data[c]; 

Note that the data is evenly distributed in the range from 0 to 255.
When the data is sorted, the first half will not go into the if-statement. Then, the remaining will go into the if-statement.

This is very good for the soothsayer, because the same branch is chosen consistently many times.
Even a simple counter with saturation correctly predicts a direction, except in the case after a change of direction.

Fast visualization:

 T = ветка выбрана N = ветка не выбрана data[] = 0, 1, 2, 3, 4, ... 126, 127, 128, 129, 130, ... 250, 251, 252, ... branch = NNNNN ... NNTTT ... TTT ... = NNNNNNNNNNNN ... NNNNNNNTTTTTTTTT ... TTTTTTTTTT (легко прогнозировать) 

But when the data is random, the predictor is useless, because it cannot predict random data.
Thus, the probability of incorrect prediction may be about 50%. (no better than random guesses)

 data[] = 226, 185, 125, 158, 198, 144, 217, 79, 202, 118, 14, 150, 177, 182, 133, ... branch = T, T, N, T, T, T, T, N, T, N, N, T, T, T, N ... = TTNTTTTNTNNTTTN ... (полностью случайно - тяжело прогнозировать) 

So what can you do?

If the compiler cannot optimize the choice of a branch, you can try using several hacks if you are ready to sacrifice readability for the sake of performance.

Replace:

 if (data[c] >= 128) sum += data[c]; 

on:

 int t = (data[c] - 128) >> 31; sum += ~t & data[c]; 

This eliminates branching and replaces it with some bitwise operations.

(Note that this hack is not exactly equivalent to the original condition. But in this case it gives the correct result for all incoming values ​​from data[] .)

Benchmarks: Core i7 920 @ 3.5 GHz

C ++ - Visual Studio 2010 - x64 Release

 // Ветвление - случайно seconds = 11.777 // Ветвление - сортировано seconds = 2.352 // Без ветвления - случайно seconds = 2.564 // Без ветвления - сортировано seconds = 2.587 

Java - Netbeans 7.1.1 JDK 7 - x64

 // Ветвление - случайно seconds = 10.93293813 // Ветвление - сортировано seconds = 5.643797077 // Без ветвления - случайно seconds = 3.113581453 // Без ветвления - сортировано seconds = 3.186068823 

Observations:

  • With branching: a huge difference between sorted and non-sorted data.
  • With hack: There is no difference between sorted and non-sorted data.
  • In the case of C ++, when using hack, the results are slightly slower than when using branching on sorted data.

The general rule is to avoid using data-dependent conditions in critical cycles. (as in this example)


Update:

  • GCC 4.6.1 with -O3 or -ftree-vectorize on x64 may generate CMOV instructions. In this case, there is no difference between the sorted or non-sorted data - both cases are fast.

  • VC ++ 2010 does not enable CMOV generation for branching even when using /Ox .

  • Intel Compiler 11 does something amazing. It interchanges two cycles , thereby raising unpredictable branching into an external cycle. So it becomes not only not susceptible to prediction errors, but also twice as fast as VC ++ and GCC can generate! In other words, the ICC used the test cycles to defeat the test ...

  • If you give Intel Compiler a code without branching, it vectorizes it ... and this is as fast as with branching (with changing cycles).

This shows that even mature compilers may differ from each other in their ability to optimize code ...

  • 3
    And it is a pity that with a confident victory hyperthreading, speculative execution (this is when we begin to execute the commands of both branches in parallel, and then we discard irrelevant results) is no longer mentioned. - avp
  • The explanation is somehow "far-fetched." If before that there were 128 N-branches, and T times 129, why should the processor choose T to 130-200 times? After all, all this time, the percentage of N-positives is higher, which means that N must be constantly predicted! those. Prediction will work only in the first half of the sorted data, and in the second half everything will go down? - Isaev
  • one
    @Isaev, there is no description of the algorithms for the prediction in the answer. Examples of different variations can be seen at the link to Wikipedia. Actually, the implementation of the predictor is not very important in this question / answer. - Grundy

This effect most likely arises due to the presence of a block of predictions of processor transitions . When data is sorted, branch in

 if (data[c] >= 128) 

changes only once (the condition does not work first, and then it always works), and the prediction works better. It is more difficult to predict if or not on an unsorted sequence. If you remove if at all, you can see that the presence or absence of sort does not affect the duration of the calculations.

PS The TC code seems to be borrowed from the question on enSO. So for a more detailed description you can follow the link. My answer is not a translation, as I have already indicated in the comment .

  • 3
  • @jfs yes. Already unsubscribed in the comments question. Something comments rubbed. - αλεχολυτ
  • 2
    Such a link should be explicitly included in your answer (considering that you describe the quality of the source code with the same approach). - jfs
  • @jfs I actually, although the original question and saw earlier. But when I wrote the answer, I completely forgot about it, i.e. one cannot say that the idea was borrowed, although I do not pretend to the source. - αλεχολυτ

You are watching the work of a transition predictor . This device is part of the processors and has a pipeline architecture that predicts whether a conditional transition will be performed in the executable program.

From a sorted array, the data[c] >= 128 changes once for a band of values ​​and becomes true for all subsequent values. They are easy to predict the processor (the accuracy of transition prediction in modern processors exceeds 90% ). With an unsorted array, you pay program speed for branching.

Without transition prediction, the pipeline must wait for the conditional branch instruction to be executed to make the next selection. The predictor of transitions allows you to avoid wasting time trying to figure out a branch.

    Most likely, when sorting, the data array was cached by the processor. There is to take into account the small size of the array, then it should completely fall into 1 cache line.

    • Not a cache matter at all. - αλεχολυτ
    • four
      The newly generated and not sorted array will also be in the cache. - avp