· ☕ 3 分钟
https://iximiuz.com/en/posts/prometheus-functions-agg-over-time/ Almost all the functions in the aggregation family accept just a single parameter - a range vector. It means that the over time part, i.e., the duration of the aggregation period, comes from the range vector definition itself. The only way to construct a range vector in PromQL is by appending a bracketed duration to a vector selector. E.g. http_requests_total[5m]. Therefore, an <agg>_over_time() function can be applied only to a vector selector, meaning the aggregation will always be done using raw scrapes.

· ☕ 1 分钟
https://en.wikipedia.org/wiki/Cache_coherence Theoretically, coherence can be performed at the load/store granularity. However, in practice it is generally performed at the granularity of cache blocks.[3] https://www.geeksforgeeks.org/cache-coherence/ Cache coherence is the discipline that ensures that changes in the values of shared operands are propagated throughout the system in a timely fashion. http://tutorials.jenkov.com/java-concurrency/cache-coherence-in-java-concurrency.html

· ☕ 1 分钟
https://events.static.linuxfound.org/sites/events/files/slides/Optimizing%20Application%20Performance%20in%20Large%20Multi-core%20Systems_0.pdf Ticket spinlock is the spinlock implementation used in the Linux kernel prior to 4.2. A lock waiter gets a ticket number and spin on the lock cacheline until it sees its ticket number. By then, it becomes the lock owner and enters the critical section. Queued spinlock is the new spinlock implementation used in 4.2 Linux kernel and beyond. A lock waiter goes into a queue and spins in its own cacheline until it becomes the queue head.

· ☕ 2 分钟
reference cycle https://easyperf.net/blog/2018/09/04/Performance-Analysis-Vocabulary Majority of modern CPUs including Intel’s and AMD’s ones don’t have fixed frequency on which they operate. Instead, they have dynamic frequency scaling. In Intel’s CPUs this technology is called Turbo Boost, in AMD’s processors it’s called Turbo Core. There is nice explanation of the term “reference cycles” on this stackoverflow thread: Having a snippet A to run in 100 core clocks and a snippet B in 200 core clocks means that B is slower in general (it takes double the work), but not necessarily that B took more time than A since the units are different.

· ☕ 5 分钟
retired instruction https://easyperf.net/blog/2018/09/04/Performance-Analysis-Vocabulary https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/analyze-performance/custom-analysis/custom-analysis-options/hardware-event-list/instructions-retired-event.html The Instructions Retired is an important hardware performance event that shows how many instructions were completely executed. Modern processors execute much more instructions that the program flow needs. This is called a speculative execution. Instructions that were “proven” as indeed needed by the program execution flow are “retired”. In the Core Out Of Order pipeline leaving the Retirement Unit means that the instructions are finally executed and their results are correct and visible in the architectural state as if they execute in-order.

· ☕ 1 分钟
Ref https://rigtorp.se/low-latency-guide/

· ☕ 3 分钟
CPI/IPC CPI Let us assume a ‘classic RISC pipeline’, with the following five stages: Instruction fetch cycle (IF). Instruction decode/Register fetch cycle (ID). Execution/Effective address cycle (EX). Memory access (MEM). Write-back cycle (WB). Each stage requires one clock cycle and an instruction passes through the stages sequentially. Without pipelining, in a multi-cycle processor, a new instruction is fetched in stage 1 only after the previous instruction finishes at stage 5, therefore the number of clock cycles it takes to execute an instruction is five (CPI = 5 > 1).

· ☕ 2 分钟
PMC: retired instruction http://web.eece.maine.edu/~vweaver/projects/perf_counters/retired_instructions.html x86 and x86_64 Retired instruction counts on x86 in general also include at least one extra instruction each time a hardware interrupt happens, even if only user space code is being monitored. The one exception to this is the Pentium 4 counter. Another special case are rep prefixed string instructions. Even if the instruction repeats many times, the instruction is only counted as one instruction. A page fault that brings a page into memory for the first time (on a load or store) also counts as an additional instruction.

· ☕ 3 分钟
https://easyperf.net/blog/2018/06/01/PMU-counters-and-profiling-basics CPU mental model and simplest PMU counter In a really simplified view our processor looks like this: There is a clock generator that sends pulses to every piece of the system to make everything moving to the next stage. This is called a cycle. If we add just a little bit of silicon and connect it to the pulse generator we can count a number of cycles, yay!

· ☕ 1 分钟
If we don’t set the scaling governor policy to be performance kernel can decide that it’s better to save power and throttle. Setting scaling_governor to ‘performance’ helps to avoid sub-nominal clocking. Here is the documentation about Linux CPU frequency governors. Here is how we can set it for all the cores: 1 2 3 for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor do echo performance > $i done Ref https://easyperf.net/blog/2019/08/02/Perf-measurement-environment-on-Linux https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt

· ☕ 1 分钟
pool1-n104-vpod1-wpool1-n23:~/pmu-tools # x86info -c x86info vVERSION Found 80 identical CPUsMP Configuration Table Header MISSING! Extended Family: 0 Extended Model: 5 Family: 6 Model: 85 Stepping: 7 Type: 0 (Original OEM) CPU Model (x86info's best guess): Unknown model. Processor name string (BIOS programmed): Intel(R) Xeon(R) Gold 6230N CPU @ 2.30GHz Cache info L1 Data Cache: 32KB, 8-way associative, 64 byte line size L1 Instruction Cache: 32KB, 8-way associative, 64 byte line size L2 Unified Cache: 1024KB, 16-way associative, 64 byte line size L3 Unified Cache: 28160KB, 11-way associative, 64 byte line size TLB info Instruction TLB: 2M/4M pages, fully associative, 8 entries Instruction TLB: 4K pages, 8-way associative, 64 entries Data TLB: 1GB pages, 4-way set associative, 4 entries Data TLB: 4KB pages, 4-way associative, 64 entries Shared L2 TLB: 4KB/2MB pages, 6-way associative, 1536 entries 64 byte prefetching.

· ☕ 4 分钟
What is TLB https://en.wikipedia.org/wiki/Translation_lookaside_buffer A translation lookaside buffer (TLB) is a memory cache that is used to reduce the time taken to access a user memory location. It is a part of the chip’s memory-management unit (MMU). The TLB stores the recent translations of virtual memory to physical memory and can be called an address-translation cache. A TLB may reside between the CPU and the CPU cache, between CPU cache and the main memory or between the different levels of the multi-level cache

· ☕ 1 分钟
turboboost Intel Turbo Boost is a feature that automatically raises CPU operating frequency when demanding tasks are running. It can be permanently disabled in BIOS. Check FAQ for more information. To disable turbo in Linux do: # Intel echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo # AMD echo 0 > /sys/devices/system/cpu/cpufreq/boost Also you might want to take a look at how it’s done in uarch-bench. Example (single-threaded workload running on Intel® Core™ i5-8259U):