Physical Structure

  • Simple:

    • .

  • Overall:

    • .

  • Skylake Block Diagram:

    • .

    • This is not a pipeline diagram, it doesn't tell us how the chip works. It just a logical flow, some rough directions.

  • Casey - A (Very) Simplified CPU Diagram .

Front-end

  • "Figure out what to run".

  • Instruction stream :

    • Assembly code.

  • I$ :

  • micro ops  / u OPs  / μOP :

    • In the past, instructions were executed right away, but now they are decoded into smaller instructions: micro ops.

    • uops.info .

    • Analysis with Compiler Explorer and UICA .

      • {29:10} Show the usage of this site above.

      • UICA (Uops Info Code Analyzer), it's a tool to get a complete analysis of how the port brake down will go for your program.

        • There's also some graphs and execution traces.

        • "Predicted Throughput": you can use this to estimate how many cycles your loops is going to take.

  • Decoder :

    • Get the instructions from the instruction stream and decode them into micro ops.

    • "If we can shrink the size of our stream down, this might be a win for the act of getting the streams".

  • Loop cache  / Trace cache :

    • Stores the result of decoding. It remembers the micro ops.

    • Sometimes decoding instructions can actually be the bottleneck in loops where the instructions are cheap to execute.

  • All the a "jump" does is tell us where to get more instructions, so we don't want to send that to the backend. For that reason, the frontend also handles jmp, stack, etc.

  • The reason for having branch prediction  is so the chip doesn't halt waiting for the backend to finish just to finally know there was a branch, while the frontend was starving doing nothing. The backend is much more latent. The frontend tries to predict the branching to avoid this.

Back-end

  • Ports  / execution units :

    • "Port is kind of a weird hardware term that doesn't give more intuition of what they do".

    • They are things that can do work.

    • If you want to add things together, one or more of these ports has a "adder" circuit in it.

  • RAT (register alias/allocation table) :

    • In the past, a register was a real register, it meant a real place in the chip. This is not what happens anymore. Now the register name goes into something called RAT, which will translate the names you gave (RAX, for example) to names that corresponds to "slots" used by a micro op instruction.

    • There are 16 registers in assembly (or 32 registers if you count vector registers), but there's around ~192 entries in a modern RAT, or maybe more in newer chips. The reason for that, is so it can extract the parallelism of things that are independent from each other, to optimize the pipeline. This tries to fill in a window of things we could be doing. The RAT job is to expand the instruction stream into independent dependency chains that use those slots, to decompose that 16 register stream into something much more verbose that involves much more registers than that.

  • Scheduler :

    • "Only wait if the things you need are still being worked on". For that reason it has to keep track of where things are, this is what the Scheduler does.

  • Retirement Buffer :

    • The micro ops are always retired in order. It writes micro ops in as they are finished in the slot where the need to go, so they are retired in the order they came in.

Manufacturing