Consider the
following piece of code that calculates the checksum of a selection from a
large array of data:
UCHAR u8Data[40000];
for (USHORT i = 0; i < 40000; i++){
u8Data[i] = std::rand() % 256;
UCHAR u8Checksum = (UCHAR)0;
for (USHORT j = 0; j < 40000; j++){
if (u8Data[j] >= 128)
{
u8Checksum += u8Data[j];
}
}
Suppose
that the microprocessor is the operator of that junction and it hears a train
coming and have no idea which way it will go. It will stop the train to ask the
captain which direction he wants, and then sets the switch appropriately. Then it
will starts up the heavy train that have a lot of momentum once again, which
will consume a lot of time. This is why microprocessors don’t work in such way.
A
more intelligent operator of that junction would ‘guess’ which direction
the train will go. And if he suggested right, the train continues on. If he
suggested wrong, he will stop the train and reallocate the junction. On the average,
this technique will save him half of the time he was wasting without ‘guessing’.
This is why modern microprocessors use what is called Branch Prediction techniques.
Modern
microprocessors are complicated and have long pipelines. So they take a long
time to "warm up" and "slow down". That if condition, at
the processor level, is a branch instruction like the following:
Photo by [1]
When
the microprocessor sees that branch instruction, it has no idea which way it
will go until it evaluates the branch condition. Evaluating the condition of
the branch is not decided in the moment of executing the branch instruction
itself because of using pipelining. The microprocessor can act like the non-intelligent
railroad junction operator and halts execution and waits until the previous
instructions are completed. Then he continues down the correct path.
But
is there a better way? Yes, there is, the microprocessor guesses which
direction the branch will go before evaluating the branch condition.
- If it guessed right,
it continues executing, and no time is wasted.
- If it guessed
wrong, it needs to flush the pipeline and roll back to the branch. Then it
can restart down the other path.
Going
back to our original piece of code, you may be surprised to know that you can
speed its execution by around 6 times just by adding the following line of code
before the second for loop:
std::sort(u8Data,
u8Data
+ 40000);
With
a sorted array, the condition “u8Data[j]
>= 128” is first false for a streak of values, then becomes true for all later values.
That's easy to predict for the microprocessor. With an unsorted array, you pay all
the branching costs.
References:
1- http://stackoverflow.com/questions/11227809
2- http://en.wikipedia.org/wiki/Branch_predication
No comments:
Post a Comment