Translation of the answer: @Mysticial
You have been the victim of a predictor of transition errors.
What is Conversion Prediction?
Consider the railway junction:

Picture Mecanismo, from Wikimedia Commons. Used under license CC-By-SA 3.0 .
Now imagine that we returned to the XIX century - before the invention of the radio.
You are the operator of the railway junction and hear the train arrives You have no idea which path he should take. you stop the train and ask the driver where he needs it. And then set the switch to the desired position.
The trains are heavy and with great inertia. Therefore, starting and stopping take a lot of time.
Is there a better way? You can guess where the train will go!
- If you guessed correctly, the train will continue without stopping.
- If you make a mistake, the train driver will stop the train, come back, and he will bawl at you so that you can cross the tracks. Then he can continue driving.
If you guess correctly every time , the train will never stop.
If you make mistakes very often , the train will lose a lot of time to stop, return, and accelerate.
Consider if-statement: at the processor level, this is a branch instruction:

You are a processor and you see branching. You have no idea which branch will be selected. What do you do? You stop the execution and wait for the completion of the previous instruction. Then you continue running along the right path.
Modern processors are complex and have long conveyors. Therefore, “warming up” and “stopping” take a lot of time.
Is there a better way? You can guess which branch will be executed!
- If you guessed correctly, execution will continue.
- If you make a mistake, you need to reset the pipeline and roll back to the branch. Then you can continue with the desired branch.
If you guess correctly every time , execution will never stop.
If you make mistakes very often , you will spend a lot of time stopping, rolling back and restarting.
This is a prediction of transitions. I admit this is not the best analogy, because the train can indicate the direction of giving a signal with a flag. But in computers, the processor does not know which direction will be chosen until the very last moment.
So what strategy to choose when guessing. to minimize the number of times. when should the train go back and go the other way? You can see the story! If the train goes left 99% of the time, then the guess will be: left. If alternate, then the conjectures also alternate. If one path is chosen every three times, you can assume the same ...
In other words, you are trying to define a pattern of behavior and follow it. This is how the predictor of transitions works.
Most applications have well-defined branches. Therefore, in modern predictors of transitions, the percentage of correct guesswork is usually> 90%. But when they encounter unpredictable branches with undefined patterns, transition predictors are almost useless.
Further reading: "Predictor of transitions" article on Wikipedia .
As mentioned above, the problem is in this if-statement:
if (data[c] >= 128) sum += data[c];
Note that the data is evenly distributed in the range from 0 to 255.
When the data is sorted, the first half will not go into the if-statement. Then, the remaining will go into the if-statement.
This is very good for the soothsayer, because the same branch is chosen consistently many times.
Even a simple counter with saturation correctly predicts a direction, except in the case after a change of direction.
Fast visualization:
T = ветка выбрана N = ветка не выбрана data[] = 0, 1, 2, 3, 4, ... 126, 127, 128, 129, 130, ... 250, 251, 252, ... branch = NNNNN ... NNTTT ... TTT ... = NNNNNNNNNNNN ... NNNNNNNTTTTTTTTT ... TTTTTTTTTT (легко прогнозировать)
But when the data is random, the predictor is useless, because it cannot predict random data.
Thus, the probability of incorrect prediction may be about 50%. (no better than random guesses)
data[] = 226, 185, 125, 158, 198, 144, 217, 79, 202, 118, 14, 150, 177, 182, 133, ... branch = T, T, N, T, T, T, T, N, T, N, N, T, T, T, N ... = TTNTTTTNTNNTTTN ... (полностью случайно - тяжело прогнозировать)
So what can you do?
If the compiler cannot optimize the choice of a branch, you can try using several hacks if you are ready to sacrifice readability for the sake of performance.
Replace:
if (data[c] >= 128) sum += data[c];
on:
int t = (data[c] - 128) >> 31; sum += ~t & data[c];
This eliminates branching and replaces it with some bitwise operations.
(Note that this hack is not exactly equivalent to the original condition. But in this case it gives the correct result for all incoming values from data[] .)
Benchmarks: Core i7 920 @ 3.5 GHz
C ++ - Visual Studio 2010 - x64 Release
// Ветвление - случайно seconds = 11.777 // Ветвление - сортировано seconds = 2.352 // Без ветвления - случайно seconds = 2.564 // Без ветвления - сортировано seconds = 2.587
Java - Netbeans 7.1.1 JDK 7 - x64
// Ветвление - случайно seconds = 10.93293813 // Ветвление - сортировано seconds = 5.643797077 // Без ветвления - случайно seconds = 3.113581453 // Без ветвления - сортировано seconds = 3.186068823
Observations:
- With branching: a huge difference between sorted and non-sorted data.
- With hack: There is no difference between sorted and non-sorted data.
- In the case of C ++, when using hack, the results are slightly slower than when using branching on sorted data.
The general rule is to avoid using data-dependent conditions in critical cycles. (as in this example)
Update:
GCC 4.6.1 with -O3 or -ftree-vectorize on x64 may generate CMOV instructions. In this case, there is no difference between the sorted or non-sorted data - both cases are fast.
VC ++ 2010 does not enable CMOV generation for branching even when using /Ox .
Intel Compiler 11 does something amazing. It interchanges two cycles , thereby raising unpredictable branching into an external cycle. So it becomes not only not susceptible to prediction errors, but also twice as fast as VC ++ and GCC can generate! In other words, the ICC used the test cycles to defeat the test ...
If you give Intel Compiler a code without branching, it vectorizes it ... and this is as fast as with branching (with changing cycles).
This shows that even mature compilers may differ from each other in their ability to optimize code ...