写在前面
我不是网络专家,只是在经历了多年的生产和测试环境网络问题排查后,不想再得过且过,于是记录下所学到的知识。由于对 TCP 栈的实现了解有限,所以内容仅作参考。
TCP 连接健康的重要性
TCP 连接健康最少包括:
- TCP 重传统计,这是网络质量的风向标
- MTU/MSS 大小,拥挤窗口的大小,这是带宽与吞吐的重要指标
- 各层收发队列与缓存的统计
这个问题在《从性能问题定位,扯到性能模型,再到 TCP - 都微服务云原生了,还学 TCP 干嘛系列 Part 1》中我聊过,不再重复。
如何查看 TCP 连接健康
Linux 的 TCP 连接健康指标有两种:
-
整机的统计
聚合了整机(严格来说,是整个 network namespace 或 整个 container) 的网络健康指标。可用
nstat
查看。 -
每个 TCP 连接的统计
每个 TCP 连接均在内核中保存了统计数据。可用
ss
查看。
本文只关注 每个 TCP 连接的统计
,整机的统计
请到 这篇 查看。
容器化时代
了解过 Linux 下容器化原理的同学应该知道,在内核层都是 namespace + cgroup。而上面说的 TCP 连接健康指标,也是 namespace aware
的。即每个 network namespace 独立统计。在容器化时,什么是 namespace aware
,什么不是,一定要分清楚。
曾神秘的 ss
相信很多人用过 netstat
。但netstat
由于在连接量大时性能不佳的问题,已经慢慢由 ss 代替。如果你好奇 ss 的实现原理,那么转到本文的 “原理” 一节。
参考:https://www.net7.be/blog/article/network_activity_analysis_1_netstat.html
更神秘的无文档指标
ss
简介
ss
是个查看连接明细统计的工具。示例:
|
|
详细见手册:https://man7.org/linux/man-pages/man8/ss.8.html
字段说明
⚠️ 我不是网络专家,以下说明是我最近的一些学习结果,不排除有错。请谨慎使用。
Recv-Q与Send-Q
- 当socket是listen 状态(eg: ss -lnt)
Recv-Q: 全连接队列的大小,也就是当前已完成三次握手并等待服务端 accept() 的 TCP 连接
Send-Q: 全连接最大队列长度 - 当socket 是非listen 状态(eg: ss -nt)
Recv-Q: 未被应用进程读取的字节数;
Send-Q: 已发送但未收到确认的字节数;
Recv-Q
Established: The count of bytes not copied by the user program connected to this socket.
Listening: Since Kernel 2.6.18 this column contains the current syn backlog.
Send-Q
Established: The count of bytes not acknowledged by the remote host.
Listening: Since Kernel 2.6.18 this column contains the maximum size of the syn backlog.
基本信息
-
ts
连接是否包含时间截。show string “ts” if the timestamp option is set
-
sack
连接时否打开 sackshow string “sack” if the sack option is set
-
cubic
拥挤窗口算法名。congestion algorithm name
-
wscale:<snd_wscale>:<rcv_wscale>
发送与接收窗口大小的放大系数
。因 19xx 年代时,网络和计算机资源有限,当时制订的 TCP 协议留给窗口大小的字段取值范围很小。到现在高带宽时代,需要一个放大系数
才可能有大窗口。if window scale option is used, this field shows the
send scale factor
andreceive scale factor
. -
rto
动态计算出的 TCP 重传用的超时参数,单位毫秒。tcp re-transmission timeout value, the unit is millisecond.
-
rtt:<rtt>/<rttvar>
RTT,测量与估算出的一个IP包发送对端和反射回来的用时。rtt
是平均值,rttvar
是中位数。rtt is the average round trip time,
rttvar
is the mean deviation of rtt, their units are millisecond. -
ato:<ato>
delay ack 超时时间。ack timeout, unit is millisecond, used for delay ack mode.
其它:
bytes_acked:<bytes_acked>
bytes acked
bytes_received:<bytes_received>
bytes received
segs_out:<segs_out>
segments sent out
segs_in:<segs_in>
segments received
send <send_bps>bps
egress bps
lastsnd:<lastsnd>
how long time since the last packet sent, the unit
is millisecond
lastrcv:<lastrcv>
how long time since the last packet received, the
unit is millisecond
lastack:<lastack>
how long time since the last ack received, the unit
is millisecond
pacing_rate <pacing_rate>bps/<max_pacing_rate>bps
the pacing rate and max pacing rate
内存/TCP Window/TCP Buffer 相关
ESTAB 0 0 192.168.1.14:43674 192.168.1.17:1080 users:(("chrome",pid=3387,fd=66)) timer:(keepalive,27sec,0)
skmem:(r0,rb131072,t0,tb87040,f0,w0,o0,bl0,d13) ts sack cubic wscale:7,7 rto:204 rtt:3.482/6.013 ato:40 mss:1448 pmtu:1500 rcvmss:1448 advmss:1448 cwnd:10 bytes_sent:2317 bytes_acked:2318 bytes_received:2960 segs_out:36 segs_in:34 data_segs_out:8 data_segs_in:9 send 33268237bps lastsnd:200048 lastrcv:199596 lastack:17596 pacing_rate 66522144bps delivery_rate 31911840bps delivered:9 app_limited busy:48ms rcv_space:14480 rcv_ssthresh:64088 minrtt:0.408
skmem
skmem:(r<rmem_alloc>,rb<rcv_buf>,t<wmem_alloc>,tb<snd_buf>,
f<fwd_alloc>,w<wmem_queued>,o<opt_mem>,
bl<back_log>,d<sock_drop>)
<rmem_alloc>
the memory allocated for receiving packet
<rcv_buf>
the total memory can be allocated for receiving
packet
<wmem_alloc>
the memory used for sending packet (which has been
sent to layer 3)
<snd_buf>
the total memory can be allocated for sending
packet
<fwd_alloc>
the memory allocated by the socket as cache, but
not used for receiving/sending packet yet. If need
memory to send/receive packet, the memory in this
cache will be used before allocate additional
memory.
<wmem_queued>
The memory allocated for sending packet (which has
not been sent to layer 3)
<ropt_mem>
The memory used for storing socket option, e.g.,
the key for TCP MD5 signature
<back_log>
The memory used for the sk backlog queue. On a
process context, if the process is receiving
packet, and a new packet is received, it will be
put into the sk backlog queue, so it can be
received by the process immediately
<sock_drop>
the number of packets dropped before they are de-
multiplexed into the socket
-
skmem_r
is the actual amount of memory that is allocated, which includes not only user payload (
Recv-Q
) but also additional memory needed by Linux to process the packet (packet metadata
). This is known within the kernel assk_rmem_alloc
.如果应用层能及时消费 TCP 内核层接收到的数据,这个数字基本为 0。
Note that there are other buffers associated with a socket, so
skmem_r
does not represent the total memory that a socket might have allocated. -
skmem_rb
is the maximum amount of memory that could be allocated by the socket for the receive buffer. This is higher than
rcv_ssthresh
to account for memory needed for packet processing that is not packet data. Autotuning can increase this value (up totcp_rmem
max) based on how fast the L7 application is able to read data from the socket and the RTT of the session. This is known within the kernel assk_rcvbuf
.
rcv_space
rcv_space:<rcv_space>
a helper variable for TCP internal auto tuning
socket receive buffer
rcv_space
is the high water mark of the rate of the local application reading from the receive buffer during any RTT. This is used internally within the kernel to adjust sk_rcvbuf
.
http://darenmatthews.com/blog/?p=2106#:~:text=%E2%80%9D-,rcv_space,-is%20used%20in
rcv_space
is used in TCP’s internal auto-tuning to grow socket buffers based on how much data the kernel estimates the sender can send. It will change over the life of any connection. It’s measured in bytes. You can see where the value is populated by reading the tcp_get_info() function in the kernel.
The value is not measuring the actual socket buffer size, which is what net.ipv4.tcp_rmem
controls. You’d need to call getsockopt()
within the application to check the buffer size. You can see current buffer usage with the Recv-Q
and Send-Q
fields of ss
.
Note that if the buffer size is set with setsockopt()
, the value returned with getsockopt()
is always double the size requested to allow for overhead. This is described in man 7 socket.
rcv_ssthresh
rcv_ssthresh
is the window clamp, a.k.a. the maximum receive window size
. This value is not known to the sender. The sender receives only the current window size
, via the TCP header field. A closely-related field in the kernel, tp->window_clamp
, is the maximum window size allowable based on the amount of available memory. rcv_ssthresh
is the receiver-side slow-start threshold value.
以下用一个例子,说明缓存大小与配置关系:
|
|
MTU/MSS 相关
mss
连接当前使用的,用于限制发送报文大小的 MSS。current effective sending MSS.
1
s.mss = info->tcpi_snd_mss
https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/tcp.c#L3258
1
info->tcpi_snd_mss = tp->mss_cache;
https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/tcp_output.c#L1576
1 2 3 4 5 6 7 8 9 10 11 12 13 14
/* tp->mss_cache is current effective sending mss, including all tcp options except for SACKs. It is evaluated, taking into account current pmtu, but never exceeds tp->rx_opt.mss_clamp. ... */ unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu) { ... tp->mss_cache = mss_now; return mss_now; }
advmss
连接建立时,由本机发送出的 SYN 报文中,包含的 MSS Option。其目标是在建立连接时,就告诉对端本机可以接收的最大报文大小。Advertised MSS by the host when conection started(in SYN packet).
https://elixir.bootlin.com/linux/v5.4/source/include/linux/tcp.h#L217
pmtu
通过 Path MTU Discovery 发现到的对端 MTU 。Path MTU value.
这里有几点注意的:
- Linux 会把每个测量过的对端 IP 的 MTU 值缓存到 Route Cache,这可以避免相同对端重复走 Path MTU Discovery 流程
- Path MTU Discovery 在 Linux 中有两种不同的实现方法
- 传统基于 ICMP 的 RFC1191
- 但现在很多路由和 NAT 不能正确处理 ICMP
- Packetization Layer Path MTU Discovery (PLPMTUD, RFC 4821 and RFC 8899)
- 传统基于 ICMP 的 RFC1191
https://github.com/shemminger/iproute2/blob/f8decf82af07591833f89004e9b72cc39c1b5c52/misc/ss.c#L3075
1
s.pmtu = info->tcpi_pmtu;
https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/tcp.c#L3272
1
info->tcpi_pmtu = icsk->icsk_pmtu_cookie;
https://elixir.bootlin.com/linux/v5.4/source/include/net/inet_connection_sock.h#L96
1 2 3 4
//@icsk_pmtu_cookie Last pmtu seen by socket struct inet_connection_sock { ... __u32 icsk_pmtu_cookie;
https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/tcp_output.c#L1573
1 2 3 4
unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu) { /* And store cached results */ icsk->icsk_pmtu_cookie = pmtu;
https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/tcp_input.c#L2587
https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/tcp_ipv4.c#L362
https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/tcp_timer.c#L161
rcvmss
老实说,这个我没看明白。一些参考:
MSS used for delayed ACK decisions.
https://elixir.bootlin.com/linux/v5.4/source/include/net/inet_connection_sock.h#L122
1
__u16 rcv_mss; /* MSS used for delayed ACK decisions */
https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/tcp_input.c#L502
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
/* Initialize RCV_MSS value. * RCV_MSS is an our guess about MSS used by the peer. * We haven't any direct information about the MSS. * It's better to underestimate the RCV_MSS rather than overestimate. * Overestimations make us ACKing less frequently than needed. * Underestimations are more easy to detect and fix by tcp_measure_rcv_mss(). */ void tcp_initialize_rcv_mss(struct sock *sk) { const struct tcp_sock *tp = tcp_sk(sk); unsigned int hint = min_t(unsigned int, tp->advmss, tp->mss_cache); hint = min(hint, tp->rcv_wnd / 2); hint = min(hint, TCP_MSS_DEFAULT); hint = max(hint, TCP_MIN_MSS); inet_csk(sk)->icsk_ack.rcv_mss = hint; }
Flow control 流控
cwnd
cwnd
: 拥塞窗口大小。congestion window size
拥塞窗口字节大小 = cwnd
* mss
.
ssthresh
在本机TCP层检测到网络拥塞发生后,会缩小拥塞窗口到最少値,然后尝试快速增加,回到 ssthresh * mss 个字节。
ssthresh:<ssthresh>
tcp congestion window slow start threshold
ssthresh 的计算逻辑见:
https://witestlab.poly.edu/blog/tcp-congestion-control-basics/#:~:text=Overview%20of%20TCP%20phases
retrans 重传相关
retrans
TCP 重传统计。格式为:
重传且未收到 ack 的 segment 数 / 整个连接的总重传 segment 次数。
https://unix.stackexchange.com/questions/542712/detailed-output-of-ss-command
(Retransmitted packets out) / (Total retransmits for entire connection)
retrans:X/Y
X: number of outstanding retransmit packets
Y: total number of retransmits for the session
- s.retrans_total
https://github.com/shemminger/iproute2/blob/f8decf82af07591833f89004e9b72cc39c1b5c52/misc/ss.c#L3068
1
s.retrans_total = info->tcpi_total_retrans;
https://elixir.bootlin.com/linux/v5.19/source/include/uapi/linux/tcp.h#L232
1 2 3
struct tcp_info { __u32 tcpi_retrans; __u32 tcpi_total_retrans;
https://elixir.bootlin.com/linux/v5.19/source/net/ipv4/tcp.c#L3791
1
info->tcpi_total_retrans = tp->total_retrans;
https://elixir.bootlin.com/linux/v5.19/source/include/linux/tcp.h#L347
1 2
struct tcp_sock { u32 total_retrans; /* Total retransmits for entire connection */
- s.retrans
https://github.com/shemminger/iproute2/blob/f8decf82af07591833f89004e9b72cc39c1b5c52/misc/ss.c#L3068
1
s.retrans = info->tcpi_retrans;
https://elixir.bootlin.com/linux/v5.19/source/net/ipv4/tcp.c#L3774
1
info->tcpi_retrans = tp->retrans_out;
https://elixir.bootlin.com/linux/v5.19/source/include/linux/tcp.h#L266
1 2
struct tcp_sock { u32 retrans_out; /* Retransmitted packets out */
bytes_retrans
重传输的总数据字节数。Total data bytes retransmitted
timer 定时器
初入門 TCP 实现的同学,很难想像, TCP 除了输入与输出事件驱动外,其实还由很多定时器去驱动的。ss 可以查看这些定时器。
Show timer information. For TCP protocol, the output
format is:
timer:(<timer_name>,<expire_time>,<retrans>)
<timer_name>
the name of the timer, there are five kind of timer
names:
on : means one of these timers: TCP retrans timer,
TCP early retrans timer and tail loss probe timer
keepalive: tcp keep alive timer
timewait: timewait stage timer
persist: zero window probe timer
unknown: none of the above timers
<expire_time>
how long time the timer will expire
Other
app_limited
https://unix.stackexchange.com/questions/542712/detailed-output-of-ss-command
limit TCP flows with application-limiting in request or responses. 我理解是,这是个 boolean,如果 ss 显示了 app_limited
这个标记,表示应用未完全使用所有 TCP 发送带宽,即,连接还有余力发送更多。
tcpi_delivery_rate: The most recent goodput, as measured by
tcp_rate_gen(). If the socket is limited by the sending
application (e.g., no data to send), it reports the highest
measurement instead of the most recent. The unit is bytes per
second (like other rate fields in tcp_info).
tcpi_delivery_rate_app_limited: A boolean indicating if the goodput
was measured when the socket's throughput was limited by the
sending application.
https://github.com/shemminger/iproute2/blob/f8decf82af07591833f89004e9b72cc39c1b5c52/misc/ss.c#L3138
1
s.app_limited = info->tcpi_delivery_rate_app_limited;
https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/tcp_rate.c#L182
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
/* If a gap is detected between sends, mark the socket application-limited. */ void tcp_rate_check_app_limited(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); if (/* We have less than one packet to send. */ tp->write_seq - tp->snd_nxt < tp->mss_cache && /* Nothing in sending host's qdisc queues or NIC tx queue. */ sk_wmem_alloc_get(sk) < SKB_TRUESIZE(1) && /* We are not limited by CWND. */ tcp_packets_in_flight(tp) < tp->snd_cwnd && /* All lost packets have been retransmitted. */ tp->lost_out <= tp->retrans_out) tp->app_limited = (tp->delivered + tcp_packets_in_flight(tp)) ? : 1; }
特别操作
specified network namespace
指定 ss 用的 network namespace 文件,如 ss -N /proc/322/ns/net
-N NSNAME, --net=NSNAME
Switch to the specified network namespace name.
kill socket
强制关闭 TCP 连接。
-K, --kill
Attempts to forcibly close sockets. This option displays
sockets that are successfully closed and silently skips
sockets that the kernel does not support closing. It
supports IPv4 and IPv6 sockets only.
|
|
监听连接关闭事件
ss -ta -E
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
UNCONN 0 0 10.0.2.15:40612 172.67.141.218:http
过滤器
如:
|
|
监控使用例子
非容器化的例子:
|
|
容器化例子:
|
|
原理
Netlink
|
|
- Fetch information about sockets - Used by ss (“another utility to investigate sockets”)
NETLINK_INET_DIAG
idiag_ext
这里可以看看 ss 的数据源。就是另一个侧面的文档了。
The fields of struct inet_diag_req_v2 are as follows:
idiag_ext
This is a set of flags defining what kind of extended
information to report. Each requested kind of information
is reported back as a netlink attribute as described
below:
INET_DIAG_TOS
The payload associated with this attribute is a
__u8 value which is the TOS of the socket.
INET_DIAG_TCLASS
The payload associated with this attribute is a
__u8 value which is the TClass of the socket. IPv6
sockets only. For LISTEN and CLOSE sockets, this
is followed by INET_DIAG_SKV6ONLY attribute with
associated __u8 payload value meaning whether the
socket is IPv6-only or not.
INET_DIAG_MEMINFO
The payload associated with this attribute is
represented in the following structure:
struct inet_diag_meminfo {
__u32 idiag_rmem;
__u32 idiag_wmem;
__u32 idiag_fmem;
__u32 idiag_tmem;
};
The fields of this structure are as follows:
idiag_rmem
The amount of data in the receive queue.
idiag_wmem
The amount of data that is queued by TCP but
not yet sent.
idiag_fmem
The amount of memory scheduled for future
use (TCP only).
idiag_tmem
The amount of data in send queue.
INET_DIAG_SKMEMINFO
The payload associated with this attribute is an
array of __u32 values described below in the
subsection "Socket memory information".
INET_DIAG_INFO
The payload associated with this attribute is
specific to the address family. For TCP sockets,
it is an object of type struct tcp_info.
INET_DIAG_CONG
The payload associated with this attribute is a
string that describes the congestion control
algorithm used. For TCP sockets only.
idiag_timer
For TCP sockets, this field describes the type of timer
that is currently active for the socket. It is set to one
of the following constants:
0 no timer is active
1 a retransmit timer
2 a keep-alive timer
3 a TIME_WAIT timer
4 a zero window probe timer
For non-TCP sockets, this field is set to 0.
idiag_retrans
For idiag_timer values 1, 2, and 4, this field contains
the number of retransmits. For other idiag_timer values,
this field is set to 0.
idiag_expires
For TCP sockets that have an active timer, this field
describes its expiration time in milliseconds. For other
sockets, this field is set to 0.
idiag_rqueue
For listening sockets: the number of pending connections.
For other sockets: the amount of data in the incoming
queue.
idiag_wqueue
For listening sockets: the backlog length.
For other sockets: the amount of memory available for
sending.
idiag_uid
This is the socket owner UID.
idiag_inode
This is the socket inode number.
Socket memory information
The payload associated with UNIX_DIAG_MEMINFO and
INET_DIAG_SKMEMINFO netlink attributes is an array of the
following __u32 values:
SK_MEMINFO_RMEM_ALLOC
The amount of data in receive queue.
SK_MEMINFO_RCVBUF
The receive socket buffer as set by SO_RCVBUF.
SK_MEMINFO_WMEM_ALLOC
The amount of data in send queue.
SK_MEMINFO_SNDBUF
The send socket buffer as set by SO_SNDBUF.
SK_MEMINFO_FWD_ALLOC
The amount of memory scheduled for future use (TCP only).
SK_MEMINFO_WMEM_QUEUED
The amount of data queued by TCP, but not yet sent.
SK_MEMINFO_OPTMEM
The amount of memory allocated for the socket's service
needs (e.g., socket filter).
SK_MEMINFO_BACKLOG
The amount of packets in the backlog (not yet processed).
注意上面的:INET_DIAG_INFO
与
For TCP sockets, it is an object of type
struct tcp_info
Netlink in deep
https://wiki.linuxfoundation.org/networking/generic_netlink_howto
https://medium.com/thg-tech-blog/on-linux-netlink-d7af1987f89d
参考
https://djangocas.dev/blog/huge-improve-network-performance-by-change-tcp-congestion-control-to-bbr/