retired instruction
https://easyperf.net/blog/2018/09/04/Performance-Analysis-Vocabulary
https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/analyze-performance/custom-analysis/custom-analysis-options/hardware-event-list/instructions-retired-event.html
The Instructions Retired is an important hardware performance event that shows how many instructions were completely executed.
Modern processors execute much more instructions that the program flow needs. This is called a speculative execution.
Instructions that were “proven” as indeed needed by the program execution flow are “retired”.
In the Core Out Of Order pipeline leaving the Retirement Unit means that the instructions are finally executed and their results are correct and visible in the architectural state as if they execute in-order.
So, instruction processed by the CPU can be executed but not necessary retired. And retired instruction is usually executed, except those times when it does not require an execution unit. An example of it can be mov elimination (see my post What optimizations you can expect from CPU?). Taking this into account we can usually expect the number of executed instructions to be higher than the number of retired instructions.
branch misprediction(分支错误预测)
branch misprediction
speculative execution(投机/推测执行)
Speculative execution is an optimization technique where a computer system performs some task that may not be needed. Work is done before it is known whether it is actually needed, so as to prevent a delay that would have to be incurred by doing the work after it is known that it is needed. If it turns out the work was not needed after all, most changes made by the work are reverted and the results are ignored.
The objective is to provide more concurrency if extra resources are available. This approach is employed in a variety of areas, including branch prediction
in pipelined processors, value prediction for exploiting value locality, prefetching memory and files, and optimistic concurrency control in database systems.
Modern pipelined microprocessors
use speculative execution to reduce the cost of conditional branch instructions using schemes that predict the execution path of a program based on the history of branch executions. In order to improve performance and utilization of computer resources, instructions can be scheduled at a time when it has not yet been determined that the instructions will need to be executed, ahead of a branch.
Execute and writeback decoupling allows program restart:
The queue for results is necessary to resolve issues such as branch mispredictions and exceptions/traps. The results queue allows programs to be restarted after an exception, which requires the instructions to be completed in program order. The queue allows results to be discarded due to mispredictions on older branch instructions and exceptions taken on older instructions.
The ability to issue instructions past branches that are yet to resolve is known as speculative execution.
Instruction window
An instruction window in computer architecture refers to the set of instructions which can execute out-of-order in a speculative processor.
In particular, in a conventional design, the instruction window consists of all instructions which are in the re-order buffer (ROB).[1] In such a processor, any instruction within the instruction window can be executed when its operands are ready. Out-of-order processors derive their name because this may occur out-of-order (if operands to a younger instruction are ready before those of an older instruction).
The instruction window has a finite size, and new instructions can enter the window (usually called dispatch or allocate) only when other instructions leave the window (usually called retire
or commit). Instructions enter and leave the instruction window in program order, and an instruction can only leave the window when it is the oldest instruction in the window and it has been completed. Hence, the instruction window can be seen as a sliding window in which the instructions can become out-of-order. All execution within the window is speculative (i.e., side-effects are not applied outside the CPU) until it is committed in order to support asynchronous exception handling like interrupts.
In-order processors
In earlier processors, the processing of instructions is performed in an instruction cycle normally consisting of the following steps:
- Instruction fetch.
- If input operands are available (in processor registers, for instance), the instruction is dispatched to the 3. appropriate functional unit. If one or more operands are unavailable during the current clock cycle (generally because they are being fetched from memory), the processor
stalls
until they are available. - The instruction is executed by the appropriate functional unit.
- The functional unit writes the results back to the register file.
Often, an in-order processor will have a straightforward “bit vector” into which it is recorded which registers a pipeline will (eventually) write to. If any input operands have the corresponding bit set in this vector, the instruction stalls. Essentially, the vector performs a greatly simplified role of protecting against register hazards. Thus we observe that Out-of-order uses 2D Matrices where In-order uses a 1D vector for hazard avoidance.
Out of order execution
This new paradigm breaks up the processing of instructions into these steps:
- Instruction fetch.
- Instruction dispatch to an instruction queue (also called instruction buffer or reservation stations
- The instruction waits in the queue until its input operands are available. The instruction can leave the queue before older instructions.
- The instruction is issued to the appropriate functional unit and executed by that unit.
- The results are queued.
- Only after all older instructions have their results written back to the register file, then this result is written back to the register file. This is called the graduation or retire stage.
The key concept of OoOE processing is to allow the processor to avoid a class of stalls
that occur when the data needed to perform an operation are unavailable. In the outline above, the OoOE processor avoids the stall that occurs in step (2) of the in-order processor when the instruction is not completely ready to be processed due to missing data.