📅 0001年01月01日 · ☕ 2 分钟

PMC: retired instruction

http://web.eece.maine.edu/~vweaver/projects/perf_counters/retired_instructions.html

x86 and x86_64

Retired instruction counts on x86 in general also include at least one extra instruction each time a hardware interrupt happens, even if only user space code is being monitored. The one exception to this is the Pentium 4 counter.

Another special case are rep prefixed string instructions. Even if the instruction repeats many times, the instruction is only counted as one instruction.

📅 0001年01月01日 · ☕ 3 分钟

https://easyperf.net/blog/2018/06/01/PMU-counters-and-profiling-basics

CPU mental model and simplest PMU counter

In a really simplified view our processor looks like this:

There is a clock generator that sends pulses to every piece of the system to make everything moving to the next stage. This is called a cycle. If we add just a little bit of silicon and connect it to the pulse generator we can count a number of cycles, yay!

📅 0001年01月01日 · ☕ 1 分钟

If we don’t set the scaling governor policy to be performance kernel can decide that it’s better to save power and throttle. Setting scaling_governor to ‘performance’ helps to avoid sub-nominal clocking. Here is the documentation about Linux CPU frequency governors.

Here is how we can set it for all the cores:

1
2
3


for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
  do echo performance > $i
done

Ref

https://easyperf.net/blog/2019/08/02/Perf-measurement-environment-on-Linux
https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt

📅 0001年01月01日 · ☕ 1 分钟

pool1-n104-vpod1-wpool1-n23:~/pmu-tools # x86info -c
x86info vVERSION
Found 80 identical CPUsMP Configuration Table Header MISSING!

Extended Family: 0 Extended Model: 5 Family: 6 Model: 85 Stepping: 7
Type: 0 (Original OEM)
CPU Model (x86info's best guess): Unknown model.
Processor name string (BIOS programmed): Intel(R) Xeon(R) Gold 6230N CPU @ 2.30GHz

Cache info
 L1 Data Cache: 32KB, 8-way associative, 64 byte line size
 L1 Instruction Cache: 32KB, 8-way associative, 64 byte line size
 L2 Unified Cache: 1024KB, 16-way associative, 64 byte line size
 L3 Unified Cache: 28160KB, 11-way associative, 64 byte line size
TLB info
 Instruction TLB: 2M/4M pages, fully associative, 8 entries
 Instruction TLB: 4K pages, 8-way associative, 64 entries
 Data TLB: 1GB pages, 4-way set associative, 4 entries
 Data TLB: 4KB pages, 4-way associative, 64 entries
 Shared L2 TLB: 4KB/2MB pages, 6-way associative, 1536 entries
 64 byte prefetching.
Total processor threads: 80
This system has 2 20-core processors with hyper-threading (2 threads per core) running at an estimated 2.30GHz

Ref

Huge pages part 5: A deeper look at TLBs and costs
TLB and Java

📅 0001年01月01日 · ☕ 4 分钟

What is TLB

https://en.wikipedia.org/wiki/Translation_lookaside_buffer

A translation lookaside buffer (TLB) is a memory cache that is used to reduce the time taken to access a user memory location. It is a part of the chip’s memory-management unit (MMU). The TLB stores the recent translations of virtual memory to physical memory and can be called an address-translation cache. A TLB may reside between the CPU and the CPU cache, between CPU cache and the main memory or between the different levels of the multi-level cache

📅 0001年01月01日 · ☕ 1 分钟

turboboost

Intel Turbo Boost is a feature that automatically raises CPU operating frequency when demanding tasks are running. It can be permanently disabled in BIOS. Check FAQ for more information. To disable turbo in Linux do:

# Intel
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
# AMD
echo 0 > /sys/devices/system/cpu/cpufreq/boost

Also you might want to take a look at how it’s done in uarch-bench.

Example (single-threaded workload running on Intel® Core™ i5-8259U):

# TurboBoost enabled
$ cat /sys/devices/system/cpu/intel_pstate/no_turbo
0
$ perf stat -e task-clock,cycles -- ./a.out
      11984.691958      task-clock (msec)         #    1.000 CPUs utilized
    32,427,294,227      cycles                    #    2.706 GHz
      11.989164338 seconds time elapsed
# TurboBoost disabled
$ echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
1
$ perf stat -e task-clock,cycles -- ./a.out
      13055.200832      task-clock (msec)         #    0.993 CPUs utilized
    29,946,969,255      cycles                    #    2.294 GHz
      13.142983989 seconds time elapsed

You can see the average frequency is much higher when TurboBoost is on.

📅 0001年01月01日 · ☕ 0 分钟

📅 0001年01月01日 · ☕ 2 分钟

Normal TCP Close Phases

https://accedian.com/blog/close-tcp-sessions-diagnose-disconnections/

Figure 1 – Simplified TCP closing with FIN.

The standard way to close TCP sessions is to send a FIN packet, then wait for a FIN response from the other party.

A sends a FIN packet and waits for a response; it can release some resources but awaits the response of the other part (Fin Wait)
B receives the FIN packet and must release resources; it waits for a closing application level (Close Wait)
B can now send a FIN to A and then await its acknowledgement (Last Ack wait).
A can now fully close its job, but it must wait for network collision (?) (Time Wait); it may have to send the final ACK another time.
B eventually receives the final ACK and destroys (kills) the connection.

This works fine in a perfect world. However, what happens when one part of the conversation is broken? That’s why the Reset (RST) packet exists.

📅 0001年01月01日 · ☕ 1 分钟

https://accedian.com/blog/diagnose-tcp-connection-setup-issues/

A TCP connection, also called 3-way Handshake is achieved with SYN, SYN+ACK and ACK packets. From this handshake, we can extract a performance metric called Connection Time (CT), which summarizes how fast session a can be set up between a client and a server over a network. For more details, see this excellent article on Wikipedia.

Figure 1 – How TCP handshake is analyzed