Wednesday, December 3, 2014

Branch Prediction Penalty

Consider the following piece of code that calculates the checksum of a selection from a large array of data:

UCHAR u8Data[40000];
for (USHORT i = 0; i < 40000; i++)
  u8Data[i] = std::rand() % 256;
UCHAR  u8Checksum = (UCHAR)0;
for (USHORT j = 0; j < 40000; j++){  
  if (u8Data[j] >= 128)
    u8Checksum += u8Data[j];

You can consider the if condition “if (u8Data[j] >= 128)” as a railroad junction just as in the following picture:

Suppose that the microprocessor is the operator of that junction and it hears a train coming and have no idea which way it will go. It will stop the train to ask the captain which direction he wants, and then sets the switch appropriately. Then it will starts up the heavy train that have a lot of momentum once again, which will consume a lot of time. This is why microprocessors don’t work in such way.

A more intelligent operator of that junction would ‘guess’ which direction the train will go. And if he suggested right, the train continues on. If he suggested wrong, he will stop the train and reallocate the junction. On the average, this technique will save him half of the time he was wasting without ‘guessing’. This is why modern microprocessors use what is called Branch Prediction techniques.

Modern microprocessors are complicated and have long pipelines. So they take a long time to "warm up" and "slow down". That if condition, at the processor level, is a branch instruction like the following:

Photo by [1]

When the microprocessor sees that branch instruction, it has no idea which way it will go until it evaluates the branch condition. Evaluating the condition of the branch is not decided in the moment of executing the branch instruction itself because of using pipelining. The microprocessor can act like the non-intelligent railroad junction operator and halts execution and waits until the previous instructions are completed. Then he continues down the correct path.
But is there a better way? Yes, there is, the microprocessor guesses which direction the branch will go before evaluating the branch condition.
  • If it guessed right, it continues executing, and no time is wasted.
  • If it guessed wrong, it needs to flush the pipeline and roll back to the branch. Then it can restart down the other path.
Going back to our original piece of code, you may be surprised to know that you can speed its execution by around 6 times just by adding the following line of code before the second for loop:

    std::sort(u8Data, u8Data + 40000);

With a sorted array, the condition u8Data[j] >= 128 is first false for a streak of values, then becomes true for all later values. That's easy to predict for the microprocessor. With an unsorted array, you pay all the branching costs.


No comments: