Borislav Stanimirov / @stanimirovb
code::dive 2019
#include <iostream>
int main()
{
std::cout << "Hi, I'm Borislav!\n";
std::cout << "These slides are here: is.gd/lifeofi\n";
return 0;
}
It's not about C++
We won't deal with those.
If you're intersted, "Inside the Machine" by Jon Stokes is a great entry-level book on the subject.
Many ways to classify
By instruction set:
By purpose:
Microcontrollers, virtual CPUs, exotics, MISC, and more
Very, very similar
Some differences:
Our focus is x86, but many things apply to ARM too
... well, in a perfect world at least
Slow instructions. Getting data. Wait state. Empty ticks
5 different parts of the cpu:
i 1: [F ][D1][D2][EX][WB] i 2: [F ][D1][D2][EX][WB] i 3: [F ][D1][D2][EX][WB] i 4: [F ][D1][D2][EX][WB] i 5: [F ][D1][D2][EX][WB] cpu: TickTickTickTickTickTickTickTickTick
Inherited from before: Cache miss → Wait state → Sad panda
Swap a and b:
XOR a, b; XOR b, a; XOR a, b;
On the pipeline:
XOR a,b: [F ][D1][D2][EX][WB] XOR b,a: [F ][D1][PS][PS][D2][EX][WB] XOR a,b: [F ][D1][PS][PS][PS][PS][D2][EX][WB] cpu: TickTickTickTickTickTickTickTickTickTickTick
An instruction waiting means a stall
OoO that's cool. Out-of-order execution. 486 on steroids
Fetch many instructions from instruction cache
x
Take several instructions from Fetch buffer
Split them into μ-ops (pieces of instructions)
Arrange several μ-ops at a time
Register renaming: virtual registers to μ-ops
Reorder μ-ops and send to the reservation station
x
OoO magic. Execute μ-ops in a ultra-mega-fast-fashion
Make maximal usage of the available ports (execution units)
Now there's a distinction between latency and throughput
x
Wait for μ-ops per instruction
Wait for instructions and rearrange in initinal order
We solved the pipeline stall problem
All this magic is indistinguishable from grandpa 8086 to the programmer and program
We created a new problem
... when we have branching
... we invest too much
We get a monstrous pipeline stall:
Not really acceptable if we have 100 (or more) instructions in the pipeline
The answer to our waste problems
To our program the world looks exactly like it did in 8086
But to the instruction...
We have an instruction whose program is running
Life is good
Suddenly the clever prefetch adds it to the instruction cache
with some thousands of other istructions
In instruction cache the Instruction Pointer is coming near
x
The instruction loads our hero along with some tens of others
to the fetch buffer
It's our hero's turn to enter the decoder
The decoder splits it into several μ-ops
The decoder finds out that some μ-ops need extra data
On the other end of the world loading of data into data cache is initiated
μ-ops enter the register alias table
They get distributed renamed "fake" registers
μ-ops enter the reorder buffer
They are orderd by dendencies
At the first possible opportunity they are sent to the reservation station
Some μ-ops get executed right away. No one here knows why.
Some μ-ops of our hero keep waiting, while μ-ops of
other instructions get executed
... and waiting
... and waiting
Finally their data is ready
They get executed
Now completely executed our instruction gets merged on its
way of out of the reservation station and its result is ready
In the retirement phase our hero gets put back in line
with its original neighbors so it can leave the CPU
Borislav Stanimirov / ibob.github.io / @stanimirovb
These slides: ibob.github.io/slides/life-of-i/
Slides license Creative Commons By 4.0