寫點東西吧,懒人。
· ☕ 3 分钟
CPI/IPC CPI Let us assume a ‘classic RISC pipeline’, with the following five stages:
Instruction fetch cycle (IF). Instruction decode/Register fetch cycle (ID). Execution/Effective address cycle (EX). Memory access (MEM). Write-back cycle (WB). Each stage requires one clock cycle and an instruction passes through the stages sequentially. Without pipelining, in a multi-cycle processor, a new instruction is fetched in stage 1 only after the previous instruction finishes at stage 5, therefore the number of clock cycles it takes to execute an instruction is five (CPI = 5 > 1).
· ☕ 2 分钟
PMC: retired instruction http://web.eece.maine.edu/~vweaver/projects/perf_counters/retired_instructions.html
x86 and x86_64 Retired instruction counts on x86 in general also include at least one extra instruction each time a hardware interrupt happens, even if only user space code is being monitored. The one exception to this is the Pentium 4 counter.
Another special case are rep prefixed string instructions. Even if the instruction repeats many times, the instruction is only counted as one instruction.
A page fault that brings a page into memory for the first time (on a load or store) also counts as an additional instruction.
· ☕ 3 分钟
https://easyperf.net/blog/2018/06/01/PMU-counters-and-profiling-basics
CPU mental model and simplest PMU counter In a really simplified view our processor looks like this:
There is a clock generator that sends pulses to every piece of the system to make everything moving to the next stage. This is called a cycle. If we add just a little bit of silicon and connect it to the pulse generator we can count a number of cycles, yay!
· ☕ 1 分钟
If we don’t set the scaling governor policy to be performance kernel can decide that it’s better to save power and throttle. Setting scaling_governor to ‘performance’ helps to avoid sub-nominal clocking. Here is the documentation about Linux CPU frequency governors.
Here is how we can set it for all the cores:
1 2 3 for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor do echo performance > $i done Ref https://easyperf.net/blog/2019/08/02/Perf-measurement-environment-on-Linux
https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt
· ☕ 1 分钟
pool1-n104-vpod1-wpool1-n23:~/pmu-tools # x86info -c x86info vVERSION Found 80 identical CPUsMP Configuration Table Header MISSING! Extended Family: 0 Extended Model: 5 Family: 6 Model: 85 Stepping: 7 Type: 0 (Original OEM) CPU Model (x86info's best guess): Unknown model. Processor name string (BIOS programmed): Intel(R) Xeon(R) Gold 6230N CPU @ 2.30GHz Cache info L1 Data Cache: 32KB, 8-way associative, 64 byte line size L1 Instruction Cache: 32KB, 8-way associative, 64 byte line size L2 Unified Cache: 1024KB, 16-way associative, 64 byte line size L3 Unified Cache: 28160KB, 11-way associative, 64 byte line size TLB info Instruction TLB: 2M/4M pages, fully associative, 8 entries Instruction TLB: 4K pages, 8-way associative, 64 entries Data TLB: 1GB pages, 4-way set associative, 4 entries Data TLB: 4KB pages, 4-way associative, 64 entries Shared L2 TLB: 4KB/2MB pages, 6-way associative, 1536 entries 64 byte prefetching.
· ☕ 4 分钟
What is TLB https://en.wikipedia.org/wiki/Translation_lookaside_buffer
A translation lookaside buffer (TLB) is a memory cache that is used to reduce the time taken to access a user memory location. It is a part of the chip’s memory-management unit (MMU). The TLB stores the recent translations of virtual memory to physical memory and can be called an address-translation cache. A TLB may reside between the CPU and the CPU cache, between CPU cache and the main memory or between the different levels of the multi-level cache
· ☕ 1 分钟
turboboost Intel Turbo Boost is a feature that automatically raises CPU operating frequency when demanding tasks are running. It can be permanently disabled in BIOS. Check FAQ for more information. To disable turbo in Linux do:
# Intel echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo # AMD echo 0 > /sys/devices/system/cpu/cpufreq/boost Also you might want to take a look at how it’s done in uarch-bench.
Example (single-threaded workload running on Intel® Core™ i5-8259U):
· ☕ 2 分钟
Normal TCP Close Phases https://accedian.com/blog/close-tcp-sessions-diagnose-disconnections/
Figure 1 – Simplified TCP closing with FIN.
The standard way to close TCP sessions is to send a FIN packet, then wait for a FIN response from the other party.
A sends a FIN packet and waits for a response; it can release some resources but awaits the response of the other part (Fin Wait) B receives the FIN packet and must release resources; it waits for a closing application level (Close Wait) B can now send a FIN to A and then await its acknowledgement (Last Ack wait).
· ☕ 1 分钟
https://accedian.com/blog/diagnose-tcp-connection-setup-issues/
A TCP connection, also called 3-way Handshake is achieved with SYN, SYN+ACK and ACK packets. From this handshake, we can extract a performance metric called Connection Time (CT), which summarizes how fast session a can be set up between a client and a server over a network. For more details, see this excellent article on Wikipedia.