Thursday, March 18, 2010

ARCHITECTURE OF INTEL ATOM PROCESSORS


While Atom started as a single-issue, in-order microprocessor the Austin team quickly widened it to be a dual-issue core. The in-order decision stuck however.
Modern day x86 processors can operate on instructions out of program order. It's like if you had to tie your shoe and turn on the TV, you may choose to tie your shoe first and then walk over to the remote control to turn on the TV. You'd complete the quicker task first before moving onto the one that took more time since you didn't have the remote on hand. Processors that are capable of OoOE (Out of Order Execution) work in the same way; when data isn't available in their caches instead of idly waiting on the data, they can execute other instructions that are ready while waiting for the required data to be fetched from memory.
The problem with these out of order processors is that all of this instruction reordering takes up additional die space and increases power consumption. Performances go up as well but remember, Intel's goal here wasn't to be the fastest, but to be fast enough. Thus the Atom remained an in-order CPU, incapable of executing instructions out of program order and Intel's first in-order x86 cores since the original Pentium processor.
The decision to go in-order eliminated the need for much complex, power hungry circuitry. While you get good performance from out-of-order execution, the corresponding increase in scheduling complexity was simply too great for Atom at 45nm. Keep in mind that just as out-of-order execution wasn't feasible on Intel CPUs until the Pentium Pro, there may come a time where transistor size is small enough that it makes sense to implement an OoOE engine on Atom.
The Austin design team started with a single-issue in-order core but quickly expanded it to be a superscalar, 2-issue design, in other words it is capable of sending up to two instructions down the pipeline at the same time. By comparison most desktop x86 microprocessors are 3 or 4-issue designs. In order to feed the 2-issue machine, Intel equipped Atom with two decoders. These decoders take instructions fetched from the L1 instruction cache and sequence the series of 1s and 0s into figuring out what the
instructions are telling the CPU to do. While the decoders are equal in their ability to decode instructions, there are two paths that an instruction may take: slow and fast. order to feed the 2-issue
Atom's slow decoding path does not include any speculative decoding. The instructions are sequenced manually, meaning that each bit is looked at (which takes time) but the instruction is guaranteed to be decoded properly. The instruction is also tagged so that the next time it comes through it can be sent through the fast path.
The fast path obviously employs some speculative decoding and is aided by the tag bit that's set after an instruction goes through the slow path. The slow path yields 1 instruction every 3 clocks, while the fast path can produce 2 instructions every clock.
As Intel learned with Banias (Pentium M), the power penalty for incorrect speculation is unacceptable in a device running on a battery. You'll see a number of tradeoffs where speculative performance tricks are sacrificed in order to maintain low power operation with the Atom processor.

5.1 Instructions Gone Wild: Safe Instruction Recognition

The biggest fear with conventional in-order architectures is what happens if you have a high latency instruction that needs a piece of data that isn't available in the caches.
Since in-order microprocessors have to execute the instructions in order, the execution units remain idle until the CPU is able to retrieve the data it needs from main memory - a process that could easily take over a hundred clock cycles. The problem is that during these clock cycles, power is expended but no work is getting done - which is the exact opposite of what we want in an ultra low power microprocessor.
Out of order processors would get around this problem by simply scheduling around the dependent instruction. The scheduler would simply select the next instruction that was ready for execution and work would progress while the data dependent instruction waited for data for main memory. We've already established that a full OoOE core would be too power hungry for Atom, but relying on a pure in-order design also has the potential to be inefficient. Intel's Austin team found a clever middle ground for Atom.
It's called the Safe Instruction Recognition (SIR) algorithm and it works like this. If Atom is executing a long latency floating point operation followed by a short latency integer op you would traditionally stall until the FP op is complete (as we described above). The SIR algorithm looks at the two instructions and determines whether or not there are any data dependencies between the two (e.g. C = A + B followed by D = C + F), if there aren't then Atom will allow the "younger", shorter latency operation to proceed ahead of the longer FP operation.
SIR addresses a very specific case but it sprinkles a little bit of out-of-order goodness into the Atom's otherwise very strict in-order design. I wouldn't be too surprised if future iterations of Atom expand the situations in which these sort of out-of-order tricks are allowed.


5.2 Return of the CISC: Macro-op Execution

The Pentium Pro was Intel's first CPU that finally ended the RISC vs. CISC debates of the early 1990s. To the programmer it was still a x86 CISC machine like every previous Intel processor, but internally once it received its x86 instructions it decoded them into smaller micro-ops to run on a simpler, faster and more efficient RISC core.

By maintaining backwards compatibility with all previous x86 processors Intel was able to leverage one of the major strengths of its CISC architecture (mainly the installed x86 user base) while continuing to evolve by relying on a high performance RISC core.
It turns out that some x86 instructions shouldn't be broken up into smaller micro-ops because they tend to augment each other. With the Pentium M Intel began fusing certain micro-ops into single operations so that they would go down the processor's pipelines atomically, thus saving power and improving efficiency. Intel called this feature micro-op fusion. If two micro-ops were treated as one when going down the pipeline that effectively increased the "width" of the CPU, allowing more instructions to be operated on at once. The internal core was still very much a RISC machine, it was just able to do a little more in certain circumstances.
The Atom takes things one step further and most x86 instructions aren't even broken down into micro-ops internally. As Atom isn't an out-of-order core, it doesn't exactly make sense for it to have tons of micro-ops in flight since it can't reorder them for optimal execution. Furthermore, by keeping most instructions as single operations Intel is able to effectively increase the "width" of Atom.
Instructions that are of the format load-op-store or load-op execution are treated as a single micro-op by Atom's decoder. In other words, if you have an instruction that loads data, operates on it, and stores the result - that's now treated as a single micro-op instead of being broken up into three. The benefit being that there's only a single micro-op that's going down the pipeline, leaving room for another one. Atom may only be a 2-issue architecture, but in certain situations it can behave like a much wider machine.
Intel has spent much of the past decade perfecting its ability to break down x86 instructions into smaller, RISC-like operations and building very high performance cores to deal with these small atomic operations. What's most interesting is that we've now come full circle where in the quest for greater performance per watt Intel is now doing the opposite and not breaking down these x86 instructions in many cases.

5.3 It Does Multiple Threads Though: The Case for SMT

Despite being 2-issue, it's not always easy to execute two instructions from a single thread in parallel due to data dependencies between the two. Intel's solution to this problem was to enable SMT (Simultaneous Multi-Threading) on Atom (not all models unfortunately) to allow the concurrent execution of up to two threads. Welcome the return of Hyper Threading.
Remember the rule of thumb for power/performance tradeoffs? Intel's decision to enable SMT on Atom was the perfect example of just that. SMT increased power consumption by less than 20% on Atom, however it also yielded a 30 - 50% increase in performance on the in-order core. The decision couldn't be easier.
The Atom has a 32-entry instruction scheduling queue, but when running with SMT enabled each thread has its own 16-entry queue. The scheduler doesn't have to switch between threads each clock, it can do so intelligently, the only limitation is that it can only dispatch 2 ops per clock (since it is a 2-wide machine). If one thread is waiting on data to complete an instruction, on the next clock tick the scheduler can choose to dispatch an op from a separate thread that will hopefully be able to execute.
Making Atom multithreaded made perfect sense from a logical standpoint. The downside to an in-order core is that if there is an instruction that is waiting on data to begin execution the rest of the pipeline stalls while that dependency is resolved. The chances that you'll have two independent instructions from two independent threads both with misses in cache is highly unlikely.

5.4 Fighting Power Consumption...with a Longer Pipeline?

Atom's pipeline is a fairly deep 16 stages, with a 13 stage mispredict penalty. Note that this is longer than the Core 2 Duo's 14 stage pipeline, which is surprising given the low power focus the design team had for Atom.

Longer pipelines are generally associated with greater power consumption especially as of late due to the Pentium 4's tenure. Intel gave us three reasons for the long pipeline:
1)Caches
2)Decoder
3) SMT
When faced with a decision between trading off latency for power, the Austin design team always favored keeping power low, even if it meant increasing latencies. Atom doesn't fire the large banks of its caches unless the cache controller knows there's a true hit in the cache, unfortunately this increases the access latency of the cache. In order to keep clock speeds high, these cache accesses have to be further pipelined. The benefit is that power is kept low; Atom keeps things as physically tagged caches to avoid the power burden of a virtually tagged cache.
The same sort of latency tradeoff is made in the decoding stages. Remember the slow vs. fast paths through the decoders? The slow path is higher latency but is guaranteed to properly decode an instruction, the added latency forces Atom to have three decoding stages instead of two.
Finally there are some algorithms in which SMT added a stage or two to the pipeline, the end result being a fairly lengthy pipeline for such a simple CPU. The reasoning however makes sense; there is no NetBurst nonsense here, all of these decisions were made to keep power consumption as low as possible while hitting the right frequency targets. As a fairly simple two-issue core, Atom needs clock speed in order to give us the sort of performance we are expecting of it.

No comments:

Post a Comment