A CPU processes instructions in an assembly-line manner, with different instructions existing in different stages of completion as they move down the line. For instance, each instruction on the original Pentium passes through the following, five-stage pipeline:
- Prefetch/Fetch: Instructions are fetched from the instruction cache and aligned for decoding.
- Decode1: Instructions are decoded into the Pentium’s internal instruction format. Branch prediction also takes place at this stage.
- Decode2: Same as above. Also, address computations take place at this stage.
- Execute: The integer hardware executes the instruction.
- Write-back: The results of the computation are written back to the register file.
An instruction enters the pipeline at stage 1, and leaves it at stage 5. Since the instruction stream that flows into the CPU’s front-end is an ordered sequence of instructions that are to be executed one after the other, it makes sense to feed them into the pipeline one after the other. When the pipeline is full, there is an instruction at each stage.
Each pipeline stage takes one clock cycle to complete, so the smaller the clock cycle, the more instructions per second the CPU can push through its pipeline. This is why, in general, a faster clockspeed means more instructions per second and therefore higher performance.
Most modern processors, however, divide their pipelines up into many more, smaller stages than the Pentium. The later iterations of the Pentium 4 had some 21 stages in their pipelines. This 21-stage pipeline accomplished the same basic steps (with some important additions for instruction reordering) as the Pentium pipeline above, but it sliced each stage into many small stages. Because each pipeline stage was smaller and took less time, the Pentium 4’s clock cycles were much shorter and its clockspeed much higher.
In a nutshell, the Pentium 4 took many more clock cycles to do the same amount of work as the original Pentium, so its clockspeed was much higher for the equivalent amount of work. This is one core reason why there’s little point in comparing clockspeeds across different processor architectures and families—the amount of work done per clock cycle is different for each architecture, so the relationship between clockspeed and performance (measured in instructions per second) is different.
Now, the clockspeed-to-performance ratio is stable within the same family of processors, so a 3.4GHz Core i5 CPU will outperform a 3.1GHz Core i5 CPU, all else being equal.
A closer look: stalls, flushes, and pipeline length
Astute readers who have thought through the above might come to the conclusion that a 3.8GHz Pentium 4 should, on average, still perform equally as well as a 3.8GHz Sandy Bridge, even if the former’s pipeline is longer. Why? Because if both pipelines are full, the number of instructions per second that pop out the end of each pipeline is the same. Think about it: if one company is operating a 21-stage assembly line at 1 stage per second, and another is operating a 12-stage line at one stage per second, then they’ll both still produce one finished product every second when both pipelines are full.
But notice the last part of the previous sentence: “when both pipelines are full.” Every time the processor switches threads or mispredicts a branch, it has to flush its pipeline and refill it. And sometimes, instructions get stalled in the pipeline for multiple cycles, leaving the downstream stages idle, with nothing to do. These flushes and stalls are a key reason why, at a given clockspeed, it’s better to have a shorter pipeline than a faster one—the shorter pipeline takes less time to refill after a flush, and it’s quicker to begin completing instructions again after a stall.
At 3.8GHz, the Core i5’s shorter pipeline, which does more work per stage, can beat the pants off the Pentium 4’s longer pipeline, because anytime the pipeline goes even partially empty the Core i5 can refill it and recover much faster. The end result is that, on average, the Core i5’s pipeline stays full longer than the Pentium 4’s, which makes the Core i5 faster.
Of course, a shorter pipeline isn’t the only reason the Core i5’s pipeline finds itself full more often than the Pentium 4’s. Other reasons include a superior branch predictor, which keeps performance-killing mispredicts to a minimum, and larger caches, which keep the Core i5’s pipeline fed with readily accessible instructions.
Summary: When comparing CPU clockspeeds within the exact same processor family, clockspeed is a good guide to performance because a higher clockspeed means more instructions are completed per second.
But when comparing the CPU clockspeeds of different processor designs, it’s generally apples-to-oranges. For two CPUs with the same clockspeed, a shorter pipeline has an advantage because it stays full more often. For two CPUs with the same pipeline depth but different clockspeeds, the higher clockspeed gives it an advantage. For two CPUs with different clockspeeds, it depends on the pipeline depth and other factors.