kernel
程序员的平行宇宙 —— eBPF 系统级跟踪技术简单入门
· ☕ 5 åˆ†é’Ÿ
Linus Torvalds in 1991 程序员的平行宇宙 程序员有两个世界: 一个是编码世界,我们很容易认为,我们考虑了一切,也完成了一切的代码。 然后是运行世界,我们发现,无论

系统级跟踪 eBPF 工具 —— bpftrace 入门
· ☕ 1 åˆ†é’Ÿ
bpftrace 简介 bpftrace 简单使用 查询可以跟踪的内核函数,以 sleep 为关键字 1 2 3 4 5 6 7 8 9 $ bpftrace -l '*open*' tracepoint:syscalls:sys_exit_open_tree tracepoint:syscalls:sys_enter_open ... kprobe:vfs_open kprobe:tcp_try_fastopen ... 跟踪所有 sys_enter_open() 系统调用 1 $ bpftrace -e 'tracepoint:syscalls:sys_enter_open{ printf("%s %s\n", comm,str(args->filename)); }' | grep vi 然后在另

Kernel - Page Frame 回收
· ☕ 3 åˆ†é’Ÿ
Page Frame 回收 之前我们了解到,Linux 倾向用最多的内存做 Page Cache。这使我们不得不考虑如何在内存不足前回收内存。问题是,回收内存的程序本身也可

Kernel - 内存寻址
· ☕ 1 åˆ†é’Ÿ
CPU Cache Cache 有两种写策略: write-through:同步写 Cache 和 Main Memory write-back:不同步写 Main Memory,直到CPU发出 flush 指令,或收到了 FLUSH

Kernel - Memory Area
· ☕ 1 åˆ†é’Ÿ
Memory Area Management 使用 buddy system algorithm来分配大块内存是合理的,但小块内存就会做成空间浪费。 Slab Allocator 在 buddy system algorithm之上做一个内存分配算法会很低

Kernel - Pagecache
· ☕ 1 åˆ†é’Ÿ
简介 page cache 存放的数据的类型 普通的文件 目录数据 直接读取自 block device file 的数据 已经被swap out的用户进程内存的数据(可以强制内核在page cahce中

Kernel - Pagecache - Core
· ☕ 3 åˆ†é’Ÿ
address_space 数据结构 Page cahce 的核心数据结构是 addrees_space。一般来说,每个 inode (Kernel 用来存放文件元信息的内存中的数据结构,可以视为一个文

Kernel - 进程内存地址
· ☕ 4 åˆ†é’Ÿ
进程内存地址 通过之前的内容,我们学习到内核通过: _get_free_pages( ) 或 alloc_pages() 从zoned page frame allocator 中分配内存。 kmem_cache_alloc( ) 或 kmalloc( ) 分配小块的数据结构空间 vmalloc( ) 或 vmalloc_32() 分配不连续的空

Kernel -Page Frame Management
· ☕ 2 åˆ†é’Ÿ
Page Frame Management Page Descriptors 一些地址的转换: 宏 virt_to_page(addr) :输入一个线性的虚拟地址,返回相关的 Page Descriptor 宏 pfn_to_page(pfn) :输入一个 page frame 的 number pfn,返回相关的 Page Descriptor page descriptor 的数据结构: 图出处:U

· ☕ 1 åˆ†é’Ÿ
Base cgroup files tasks: list of tasks (by PID) attached to that cgroup. This list is not guaranteed to be sorted. Writing a thread ID into this file moves the thread into this cgroup. cgroup.procs: list of thread group IDs in the cgroup. This list is not guaranteed to be sorted or free of duplicate TGIDs, and userspace should sort/uniquify the list if this property is required.

· ☕ 4 åˆ†é’Ÿ
What are cpusets ? Cpusets provide a mechanism for assigning a set of CPUs and Memory Nodes to a set of tasks. In this document “Memory Node” refers to an on-line node that contains memory. Cpusets constrain the CPU and Memory placement of tasks to only the resources within a task’s current cpuset. They form

· ☕ 1 åˆ†é’Ÿ
Monitor https://www.kernel.org/doc/html/latest/admin-guide/numastat.html /sys/devices/system/node/node*/numastat In more detail: numa_hit A process wanted to allocate memory from this node, and succeeded. numa_miss A process wanted to allocate memory from another node, but ended up with memory from this node. numa_foreign A process wanted to allocate on this node, but ended up with memory from another node. local_node A process ran on this node’s CPU, and got memory from this node.

CPU 负载测量误差
· ☕ 1 åˆ†é’Ÿ
CPU 负载测量误差 CPU load Linux exports various bits of information via /proc/stat and /proc/uptime that userland tools, such as top(1), use to calculate the average time system spent in a particular state, for example: $ iostat Linux 2.6.18.3-exp (linmac) 02/20/2007 avg-cpu: %user %nice %system %iowait %steal %idle 10.01 0.00 2.92 5.44 0.00 81.63 ... Here the system thinks that over the default sampling period the