This page looks best with JavaScript enabled

A Complete Guide of 'ss' Output Metrics - TCP Connection Inspecting Tool

 ·  ☕ 19 min read

image-20220801231018488

Introduction

Understanding metrics from Linux ss command output.

I am not a network expert, but after years of troubleshooting network problems in production and test environments, I no longer want to muddle through, so I record what I have learned. Due to limited understanding of TCP stack implementation, the content is for reference only.

ss is a perfect tool to inspect TCP connection level statistics. But it is not well-documented. This article tries to explain the output of ss and how to read the metrics from the ss output.

Importance of TCP connection health

TCP connection health includes at least:

  • Statistics of TCP retransmission, which is the indicator of network quality.
  • MTU/MSS size, the size of the congestion window, which is an important indicator of bandwidth and throughput.
  • Statistics of sending and receiving queues and buffer at each layer.

This question was discussed in “From performance issue investigation to performance models, to TCP - why should we learn TCP even after all microservices are running on the cloud? Series Part 1”, so I won’t repeat it.

How to check TCP connection health

There are two types of TCP connection health metrics in Linux:

  • Statistics of the whole system

    Aggregates the network health metrics of the entire operation system (strictly speaking, the entire network namespace or the entire container). Can be viewed with nstat.

  • Statistics for each TCP connection

    Statistics are saved in the kernel for each TCP connection. Can be viewed with ss.

This article only focuses on the statistics of each TCP connection. For the statistics of the entire operation system, please go to this article.

Containerization Era

People who have understood the principles of containerization under Linux should know that the kernel layer is namespace + cgroup. The TCP connection health metrics mentioned above are also namespace aware. That is, each network namespace counts independently. When containerizing, you must clearly distinguish what is namespace aware and what is not.

Mysterious ss

I believe many people have used netstat. However, netstat has been slowly replaced by ss due to poor performance when the number of connections is large. If you are curious about the implementation principle of ss, then go to the “Principles” section of this article.

Reference: https://www.net7.be/blog/article/network_activity_analysis_1_netstat.html

More mysterious undocumented metrics

ss Introduction

ss is a tool for viewing detailed statistics of connections. e.g:

1
2
3
4
5
$ ss -taoipnm
State        Recv-Q   Send-Q      Local Address:Port         Peer Address:Port  Process                                                                         
ESTAB 0      0               159.164.167.179:55124           149.139.16.235:9042  users:(("envoy",pid=81281,fd=50))
        skmem:(r0,rb131072,t0,tb87040,f0,w0,o0,bl0,d13) ts sack cubic wscale:9,7 rto:204 rtt:0.689/0.065 ato:40 mss:1448 pmtu:9000 rcvmss:610 advmss:8948 cwnd:10 bytes_sent:3639 bytes_retrans:229974096 bytes_acked:3640 bytes_received:18364 segs_out:319 segs_in:163 data_segs_out:159 data_segs_in:159 send 168.1Mbps lastsnd:16960 las
trcv:16960 lastack:16960 pacing_rate 336.2Mbps delivery_rate 72.4Mbps delivered:160 app_limited busy:84ms retrans:0/25813 rcv_rtt:1 rcv_space:62720 rcv_ssthresh:56588 minrtt:0.16

See the manual for details: https://man7.org/linux/man-pages/man8/ss.8.html

Metrics description

⚠️ I am not a network expert. The following instructions are some of my recent learning results. Maybe errors. Please use with caution.

Recv-Q and Send-Q

  • When the socket is in listen state (eg: ss -lnt)
    Recv-Q: The size of the established connection queue, that is, the TCP connection that has completed the three-way handshakes and is waiting for the user-space process to call accept().
    Send-Q: Maximum queue length of established connections that is waiting for the user-space process to call accept().
  • When the socket is in non-listen state (eg: ss -nt)
    Recv-Q: The number of bytes not read by the user-space process;
    Send-Q: Number of bytes sent by kernel tcp stack but no acknowledgment received;

Recv-Q

Established: The count of bytes not copied by the user program connected to this socket.

Listening: Since Kernel 2.6.18 this column contains the current syn backlog.

Send-Q

Established: The count of bytes not acknowledged by the remote host.

Listening: Since Kernel 2.6.18 this column contains the maximum size of the syn backlog.

Basic Information

  • ts Whether the TCP connection contains TCP timestamp.

    show string “ts” if the timestamp option is set

  • sack Whether TCP SACK is enabled.

    show string “sack” if the sack option is set

  • cubic The name of the congestion window algorithm.

    congestion algorithm name

  • wscale:<snd_wscale>:<rcv_wscale> The scale factor of the send and receive window sizes. Because network and computer resources were limited in the 19xx era, the TCP protocol formulated at that time reserve a small value range for the window size field. In today’s wide-bandwidth era, an ‘scale factor’ is needed to have a large window.

    if window scale option is used, this field shows the send scale factor and receive scale factor.

  • rto Dynamically calculated timeout parameter for TCP retransmission, in milliseconds.

    tcp re-transmission timeout value, the unit is millisecond.

  • rtt:<rtt>/<rttvar> RTT, measures and estimates the time it takes for an IP packet to be sent to the peer and ACK back. rtt is the average and rttvar is the median.

    rtt is the average round trip time, rttvar is the mean deviation of rtt, their units are millisecond.

  • ato:<ato> delay ack timeout.

    ack timeout, unit is millisecond, used for delay ack mode.

other:

              bytes_acked:<bytes_acked>
                     bytes acked

              bytes_received:<bytes_received>
                     bytes received

              segs_out:<segs_out>
                     segments sent out

              segs_in:<segs_in>
                     segments received

              send <send_bps>bps
                     egress bps

              lastsnd:<lastsnd>
                     how long time since the last packet sent, the unit
                     is millisecond

              lastrcv:<lastrcv>
                     how long time since the last packet received, the
                     unit is millisecond

              lastack:<lastack>
                     how long time since the last ack received, the unit
                     is millisecond

              pacing_rate <pacing_rate>bps/<max_pacing_rate>bps
                     the pacing rate and max pacing rate

ss output e.g:

ESTAB                         0                         0                                             192.168.1.14:43674                                           192.168.1.17:1080                     users:(("chrome",pid=3387,fd=66)) timer:(keepalive,27sec,0)
	 skmem:(r0,rb131072,t0,tb87040,f0,w0,o0,bl0,d13) ts sack cubic wscale:7,7 rto:204 rtt:3.482/6.013 ato:40 mss:1448 pmtu:1500 rcvmss:1448 advmss:1448 cwnd:10 bytes_sent:2317 bytes_acked:2318 bytes_received:2960 segs_out:36 segs_in:34 data_segs_out:8 data_segs_in:9 send 33268237bps lastsnd:200048 lastrcv:199596 lastack:17596 pacing_rate 66522144bps delivery_rate 31911840bps delivered:9 app_limited busy:48ms rcv_space:14480 rcv_ssthresh:64088 minrtt:0.408

skmem

https://man7.org/linux/man-pages/man8/ss.8.html

          skmem:(r<rmem_alloc>,rb<rcv_buf>,t<wmem_alloc>,tb<snd_buf>,
                        f<fwd_alloc>,w<wmem_queued>,o<opt_mem>,
                        bl<back_log>,d<sock_drop>)

          <rmem_alloc>
                 the memory allocated for receiving packet

          <rcv_buf>
                 the total memory can be allocated for receiving
                 packet

          <wmem_alloc>
                 the memory used for sending packet (which has been
                 sent to layer 3)

          <snd_buf>
                 the total memory can be allocated for sending
                 packet

          <fwd_alloc>
                 the memory allocated by the socket as cache, but
                 not used for receiving/sending packet yet. If need
                 memory to send/receive packet, the memory in this
                 cache will be used before allocate additional
                 memory.

          <wmem_queued>
                 The memory allocated for sending packet (which has
                 not been sent to layer 3)

          <ropt_mem>
                 The memory used for storing socket option, e.g.,
                 the key for TCP MD5 signature

          <back_log>
                 The memory used for the sk backlog queue. On a
                 process context, if the process is receiving
                 packet, and a new packet is received, it will be
                 put into the sk backlog queue, so it can be
                 received by the process immediately

          <sock_drop>
                 the number of packets dropped before they are de-
                 multiplexed into the socket

https://blog.cloudflare.com/optimizing-tcp-for-high-throughput-and-low-latency/#:~:text=is%20Linux%20autotuning.-,Linux%20autotuning,-Linux%20autotuning%20is

  • skmem_r

    is the actual amount of memory that is allocated, which includes not only user payload (Recv-Q) but also additional memory needed by Linux to process the packet (packet metadata). This is known within the kernel as sk_rmem_alloc.

    If the user-space process can consume the data received by the TCP kernel stack in time, this number is basically 0.

    Note that there are other buffers associated with a socket, so skmem_r does not represent the total memory that a socket might have allocated.

  • skmem_rb

    is the maximum amount of memory that could be allocated by the socket for the receive buffer. This is higher than rcv_ssthresh to account for memory needed for packet processing that is not packet data. Autotuning can increase this value (up to tcp_rmem max) based on how fast the L7 application is able to read data from the socket and the RTT of the session. This is known within the kernel as sk_rcvbuf.

rcv_space

          rcv_space:<rcv_space>
                 a helper variable for TCP internal auto tuning
                 socket receive buffer

https://blog.cloudflare.com/optimizing-tcp-for-high-throughput-and-low-latency/#:~:text=is%20Linux%20autotuning.-,Linux%20autotuning,-Linux%20autotuning%20is

rcv_space is the high water mark of the rate of the local application reading from the receive buffer during any RTT. This is used internally within the kernel to adjust sk_rcvbuf.

http://darenmatthews.com/blog/?p=2106#:~:text=%E2%80%9D-,rcv_space,-is%20used%20in

rcv_space is used in TCP’s internal auto-tuning to grow socket buffers based on how much data the kernel estimates the sender can send. It will change over the life of any connection. It’s measured in bytes. You can see where the value is populated by reading the tcp_get_info() function in the kernel.

The value is not measuring the actual socket buffer size, which is what net.ipv4.tcp_rmem controls. You’d need to call getsockopt() within the application to check the buffer size. You can see current buffer usage with the Recv-Q and Send-Q fields of ss.
Note that if the buffer size is set with setsockopt(), the value returned with getsockopt() is always double the size requested to allow for overhead. This is described in man 7 socket.

rcv_ssthresh

https://blog.cloudflare.com/optimizing-tcp-for-high-throughput-and-low-latency/#:~:text=is%20Linux%20autotuning.-,Linux%20autotuning,-Linux%20autotuning%20is

rcv_ssthresh is the window clamp, a.k.a. the maximum receive window size. This value is not known to the sender. The sender receives only the current window size, via the TCP header field. A closely-related field in the kernel, tp->window_clamp, is the maximum window size allowable based on the amount of available memory. rcv_ssthresh is the receiver-side slow-start threshold value.

The following uses an example to illustrate the relationship between buffer size and configuration:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
$ sudo sysctl -a | grep tcp
net.ipv4.tcp_base_mss = 1024
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.tcp_max_tw_buckets = 262144
net.ipv4.tcp_mem = 766944	1022593	1533888 (page)
net.ipv4.tcp_moderate_rcvbuf = 1
net.ipv4.tcp_retries1 = 3
net.ipv4.tcp_retries2 = 15
net.ipv4.tcp_rfc1337 = 0
net.ipv4.tcp_rmem = 4096	131072	6291456
net.ipv4.tcp_adv_win_scale = 1 (½ memory in receive buffer as TCP window size)
net.ipv4.tcp_syn_retries = 6
net.ipv4.tcp_synack_retries = 5
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_wmem = 4096	16384	4194304

net.core.rmem_default = 212992
net.core.rmem_max = 212992
net.core.wmem_default = 212992
net.core.wmem_max = 212992


$ ss -taoipnm 'dst 100.225.237.27'

ESTAB                                0                                     0                                                                     192.168.1.14:57174                                                                100.225.237.27:28101                                 users:(("ssh",pid=49183,fd=3)) timer:(keepalive,119min,0)
	 skmem:(r0,rb131072,t0,tb87040,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:376 rtt:165.268/11.95 ato:40 mss:1440 pmtu:1500 rcvmss:1080 advmss:1448 cwnd:10 bytes_sent:5384 bytes_retrans:1440 bytes_acked:3945 bytes_received:3913 segs_out:24 segs_in:23 data_segs_out:12 data_segs_in:16 send 697050bps lastsnd:53864 lastrcv:53628 lastack:53704 pacing_rate 1394088bps delivery_rate 73144bps delivered:13 busy:1864ms retrans:0/1 dsack_dups:1 rcv_rtt:163 rcv_space:14480 rcv_ssthresh:64088 minrtt:157.486
#You can see: rb131072 = net.ipv4.tcp_rmem[1] = 131072

###############Stop the receiving application process and let the receiving buffer in kernel full####################
$ export PID=49183
$ kill -STOP $PID
$ ss -taoipnm 'dst 100.225.237.27'
State                                Recv-Q                                Send-Q                                                               Local Address:Port                                                                   Peer Address:Port                                 Process                                
ESTAB                                0                                     0                                                                     192.168.1.14:57174                                                                100.225.237.27:28101                                 users:(("ssh",pid=49183,fd=3)) timer:(keepalive,115min,0)
	 skmem:(r24448,rb131072,t0,tb87040,f4224,w0,o0,bl0,d4) ts sack cubic wscale:7,7 rto:376 rtt:174.381/20.448 ato:40 mss:1440 pmtu:1500 rcvmss:1440 advmss:1448 cwnd:10 bytes_sent:6456 bytes_retrans:1440 bytes_acked:5017 bytes_received:971285 segs_out:1152 segs_in:2519 data_segs_out:38 data_segs_in:2496 send 660622bps lastsnd:1456 lastrcv:296 lastack:24 pacing_rate 1321240bps delivery_rate 111296bps delivered:39 app_limited busy:6092ms retrans:0/1 dsack_dups:1 rcv_rtt:171.255 rcv_space:14876 rcv_ssthresh:64088 minrtt:157.126
# Here appears: app_limited

###################################
$ ss -taoipnm 'dst 100.225.237.27'
State                                Recv-Q                                Send-Q                                                               Local Address:Port                                                                   Peer Address:Port                                 Process                                
ESTAB                                67788                                 0                                                                     192.168.1.14:57174                                                                100.225.237.27:28101                                 users:(("ssh",pid=49183,fd=3)) timer:(keepalive,115min,0)
	 skmem:(r252544,rb250624,t0,tb87040,f1408,w0,o0,bl0,d6) ts sack cubic wscale:7,7 rto:376 rtt:173.666/18.175 ato:160 mss:1440 pmtu:1500 rcvmss:1440 advmss:1448 cwnd:10 bytes_sent:6600 bytes_retrans:1440 bytes_acked:5161 bytes_received:1292017 segs_out:1507 segs_in:3368 data_segs_out:42 data_segs_in:3340 send 663342bps lastsnd:9372 lastrcv:1636 lastack:1636 pacing_rate 1326680bps delivery_rate 111296bps delivered:43 app_limited busy:6784ms retrans:0/1 dsack_dups:1 rcv_rtt:169.162 rcv_space:14876 rcv_ssthresh:64088 minrtt:157.126
#Here: r252544 rb250624 are increased. Recv-Q = 67788 means size of TCP window is 67788(bytes). Because `net.ipv4.tcp_adv_win_scale = 1`, so ½ receive buffer can used by TCP window, So receive buffer = 67788 * 2 = 135576(bytes)

###################################
$ kill -CONT $PID
$ ss -taoipnm 'dst 100.225.237.27'
State                                Recv-Q                                Send-Q                                                               Local Address:Port                                                                   Peer Address:Port                                 Process                                
ESTAB                                0                                     0                                                                     192.168.1.14:57174                                                                100.225.237.27:28101                                 users:(("ssh",pid=49183,fd=3)) timer:(keepalive,105min,0)
	 skmem:(r14720,rb6291456,t0,tb87040,f1664,w0,o0,bl0,d15) ts sack cubic wscale:7,7 rto:368 rtt:165.199/7.636 ato:40 mss:1440 pmtu:1500 rcvmss:1440 advmss:1448 cwnd:10 bytes_sent:7356 bytes_retrans:1440 bytes_acked:5917 bytes_received:2981085 segs_out:2571 segs_in:5573 data_segs_out:62 data_segs_in:5524 send 697341bps lastsnd:2024 lastrcv:280 lastack:68 pacing_rate 1394672bps delivery_rate 175992bps delivered:63 app_limited busy:9372ms retrans:0/1 dsack_dups:1 rcv_rtt:164.449 rcv_space:531360 rcv_ssthresh:1663344 minrtt:157.464
#Here: rb6291456 = net.ipv4.tcp_rmem[2] = 6291456

MTU/MSS

mss

The MSS currently used by the connection to limit the size of sent packets. current effective sending MSS.

https://github.com/CumulusNetworks/iproute2/blob/6335c5ff67202cf5b39eb929e2a0a5bb133627ba/misc/ss.c#L2206

1
s.mss		 = info->tcpi_snd_mss

https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/tcp.c#L3258

1
  info->tcpi_snd_mss = tp->mss_cache;

https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/tcp_output.c#L1576

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
/*
tp->mss_cache is current effective sending mss, including
all tcp options except for SACKs. It is evaluated,
taking into account current pmtu, but never exceeds
tp->rx_opt.mss_clamp.
...
*/
unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu)
{
...
  tp->mss_cache = mss_now;

  return mss_now;
}

advmss

When the connection is established, the SYN message sent by the local machine contains the MSS Option. Its goal is to tell the peer the maximum message size that the machine can receive when establishing a connection. Advertised MSS by the host when conection started(in SYN packet).

https://elixir.bootlin.com/linux/v5.4/source/include/linux/tcp.h#L217

pmtu

The MTU of peer can be discover by Path MTU Discovery. Path MTU value.

There are a few things to notice:

  • Linux will cache the MTU value of each measured peer IP in the Route Cache, which can avoid repeating the Path MTU Discovery process for the same peer.
  • Path MTU Discovery has two different implementation methods in Linux
    • Legacy ICMP based RFC1191
      • But now many routes and NATs don’t handle ICMP correctly
    • Packetization Layer Path MTU Discovery (PLPMTUD, RFC 4821 and RFC 8899)

https://github.com/shemminger/iproute2/blob/f8decf82af07591833f89004e9b72cc39c1b5c52/misc/ss.c#L3075

1
		s.pmtu		 = info->tcpi_pmtu;

https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/tcp.c#L3272

1
	info->tcpi_pmtu = icsk->icsk_pmtu_cookie;

https://elixir.bootlin.com/linux/v5.4/source/include/net/inet_connection_sock.h#L96

1
2
3
4
//@icsk_pmtu_cookie	   Last pmtu seen by socket
struct inet_connection_sock {
	...
	__u32			  icsk_pmtu_cookie;

https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/tcp_output.c#L1573

1
2
3
4
unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu)
{
 /* And store cached results */
	icsk->icsk_pmtu_cookie = pmtu;

https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/tcp_input.c#L2587

https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/tcp_ipv4.c#L362

https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/tcp_timer.c#L161

rcvmss

To be honest, I didn’t fully understand rcvmss. Some references:

MSS used for delayed ACK decisions.

https://elixir.bootlin.com/linux/v5.4/source/include/net/inet_connection_sock.h#L122

1
		__u16		  rcv_mss;	 /* MSS used for delayed ACK decisions	   */

https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/tcp_input.c#L502

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
/* Initialize RCV_MSS value.
 * RCV_MSS is an our guess about MSS used by the peer.
 * We haven't any direct information about the MSS.
 * It's better to underestimate the RCV_MSS rather than overestimate.
 * Overestimations make us ACKing less frequently than needed.
 * Underestimations are more easy to detect and fix by tcp_measure_rcv_mss().
 */
void tcp_initialize_rcv_mss(struct sock *sk)
{
	const struct tcp_sock *tp = tcp_sk(sk);
	unsigned int hint = min_t(unsigned int, tp->advmss, tp->mss_cache);

	hint = min(hint, tp->rcv_wnd / 2);
	hint = min(hint, TCP_MSS_DEFAULT);
	hint = max(hint, TCP_MIN_MSS);

	inet_csk(sk)->icsk_ack.rcv_mss = hint;
}

Flow control

cwnd

cwnd: congestion window size

https://en.wikipedia.org/wiki/TCP_congestion_control#:~:text=set%20to%20a-,small%20multiple,-of%20the%20maximum

Congestion window size in bytes= cwnd * mss.

ssthresh

After the local TCP layer detects that network congestion has occurred, it will reduce the congestion window to the minimum value, and then try to increase it quickly, back to ssthresh * mss bytes.

              ssthresh:<ssthresh>
                     tcp congestion window slow start threshold

For the calculation logic of ssthresh:

https://witestlab.poly.edu/blog/tcp-congestion-control-basics/#:~:text=Overview%20of%20TCP%20phases

img

TCP retransmission

retrans

Statistics of TCP retransmission. Format:

The number of pending(on the fly) segments that are retransmitted without receiving an ack / the total number of cumulative retransmitted segments of the connection.

retrans is a dynamic change metric.

https://unix.stackexchange.com/questions/542712/detailed-output-of-ss-command

(Retransmitted packets out) / (Total retransmits for entire connection)

add more TCP_INFO components

retrans:X/Y

  X: number of outstanding retransmit packets

​ Y: total number of retransmits for the session

  • s.retrans_total

https://github.com/shemminger/iproute2/blob/f8decf82af07591833f89004e9b72cc39c1b5c52/misc/ss.c#L3068

1
		s.retrans_total  = info->tcpi_total_retrans;

https://elixir.bootlin.com/linux/v5.19/source/include/uapi/linux/tcp.h#L232

1
2
3
struct tcp_info {
    	__u32	tcpi_retrans;
	__u32	tcpi_total_retrans;

https://elixir.bootlin.com/linux/v5.19/source/net/ipv4/tcp.c#L3791

1
	info->tcpi_total_retrans = tp->total_retrans;

https://elixir.bootlin.com/linux/v5.19/source/include/linux/tcp.h#L347

1
2
struct tcp_sock {
	u32	total_retrans;	/* Total retransmits for entire connection */
  • s.retrans

Number of segments that were retransmitted and did not receive ack

https://github.com/shemminger/iproute2/blob/f8decf82af07591833f89004e9b72cc39c1b5c52/misc/ss.c#L3068

1
		s.retrans	 = info->tcpi_retrans;

https://elixir.bootlin.com/linux/v5.19/source/net/ipv4/tcp.c#L3774

1
	info->tcpi_retrans = tp->retrans_out;

https://elixir.bootlin.com/linux/v5.19/source/include/linux/tcp.h#L266

1
2
struct tcp_sock {
	u32	retrans_out;	/* Retransmitted packets out		*/

bytes_retrans

description: Total data bytes retransmitted
metrics types: Counter: A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase

bytes_retrans is a single monotonically increasing metric.

TCP timer

For people who are new to TCP implementation, it is difficult to imagine that in addition to being driven by application sent and network card receive events, TCP is actually driven by many timers. ss can view these timers.

              Show timer information. For TCP protocol, the output
              format is:

              timer:(<timer_name>,<expire_time>,<retrans>)

              <timer_name>
                     the name of the timer, there are five kind of timer
                     names:

                     on : means one of these timers: TCP retrans timer,
                     TCP early retrans timer and tail loss probe timer

                     keepalive: tcp keep alive timer

                     timewait: timewait stage timer

                     persist: zero window probe timer

                     unknown: none of the above timers

              <expire_time>
                     how long time the timer will expire

Other

app_limited

https://unix.stackexchange.com/questions/542712/detailed-output-of-ss-command

limit TCP flows with application-limiting in request or responses. My understanding is that this is a boolean. If ss displays the app_limited flag, it means that the application is not fully using all the TCP sending bandwidth, that is, the connection has room to send more.

  tcpi_delivery_rate: The most recent goodput, as measured by
    tcp_rate_gen(). If the socket is limited by the sending
    application (e.g., no data to send), it reports the highest
    measurement instead of the most recent. The unit is bytes per
    second (like other rate fields in tcp_info).

  tcpi_delivery_rate_app_limited: A boolean indicating if the goodput
    was measured when the socket's throughput was limited by the
    sending application.

https://github.com/shemminger/iproute2/blob/f8decf82af07591833f89004e9b72cc39c1b5c52/misc/ss.c#L3138

1
		s.app_limited = info->tcpi_delivery_rate_app_limited;

https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/tcp_rate.c#L182

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
/* If a gap is detected between sends, mark the socket application-limited. */
void tcp_rate_check_app_limited(struct sock *sk)
{
	struct tcp_sock *tp = tcp_sk(sk);

	if (/* We have less than one packet to send. */
	    tp->write_seq - tp->snd_nxt < tp->mss_cache &&
	    /* Nothing in sending host's qdisc queues or NIC tx queue. */
	    sk_wmem_alloc_get(sk) < SKB_TRUESIZE(1) &&
	    /* We are not limited by CWND. */
	    tcp_packets_in_flight(tp) < tp->snd_cwnd &&
	    /* All lost packets have been retransmitted. */
	    tp->lost_out <= tp->retrans_out)
		tp->app_limited =
			(tp->delivered + tcp_packets_in_flight(tp)) ? : 1;
}

Special operations

specified network namespace

Specify the network namespace file used by ss, such as ss -N /proc/322/ns/net

       -N NSNAME, --net=NSNAME
              Switch to the specified network namespace name.

kill socket

Force close TCP connection.

       -K, --kill
              Attempts to forcibly close sockets. This option displays
              sockets that are successfully closed and silently skips
              sockets that the kernel does not support closing. It
              supports IPv4 and IPv6 sockets only.
1
sudo ss -K  'dport 22'

Monitor connection close events

ss -ta -E
State                  Recv-Q                 Send-Q                                   Local Address:Port                                     Peer Address:Port                     Process                 

UNCONN                0                     0                                        10.0.2.15:40612                                            172.67.141.218:http               

Filter

E.g:

1
ss -apu state unconnected 'sport = :1812'

Monitor use case

Non-containerized example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
#non-container version
export JMETER_PID=38991 # PLEASE UPDATE
export SS_FILTER="dst 1.1.1.1" # PLEASE UPDATE, e.g IP of the gateway to k8s

export CAPTURE_SECONDS=60000 #capture for 1000 minutes
sudo bash -c "
end=\$((SECONDS+$CAPTURE_SECONDS))
while [ \$SECONDS -lt \$end ]; do
    echo \$SECONDS/\$end
    ss -taoipnm \"${SS_FILTER}\" | grep -A1 $JMETER_PID
    sleep 2
    date
done
" | tee /tmp/tcp_conn_info_${JMETER_PID}

Containerized example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
export ENVOY_PID=$(sudo pgrep --ns $SVC_PID --nslist net envoy)

export SS_FILTER="dst 1.1.1.1 or dst 2.2.2.2" # PLEASE UPDATE, e.g IP of the O
racle/Cassandra/Kafka/Redis
export POD_NAME=$(sudo nsenter -t $ENVOY_PID -n -u -- hostname)

## capture connection info for 10 minutes
export CAPTURE_SECONDS=600 #capture for 10 min
sudo nsenter -t $ENVOY_PID -n -u -- bash -c "
end=\$((SECONDS+$CAPTURE_SECONDS))
while [ \$SECONDS -lt \$end ]; do
    echo \$SECONDS/\$end
    ss -taoipnm \"${SS_FILTER}\" | grep -A1 $ENVOY_PID
    sleep 1
    date
done
" | tee /tmp/tcp_conn_info_${POD_NAME}

Rationale - how ss work

https://events.static.linuxfound.org/sites/events/files/slides/Exploration%20of%20Linux%20Container%20Network%20Monitoring%20and%20Visualization.pdf

https://man7.org/linux/man-pages/man7/netlink.7.html

1
2
3
4
5
6
socket(AF_NETLINK, SOCK_RAW, NETLINK_INET_DIAG);
/**
       NETLINK_SOCK_DIAG (since Linux 3.3)
              Query information about sockets of various protocol
              families from the kernel (see sock_diag(7)).
**/
  • Fetch information about sockets - Used by ss (“another utility to investigate sockets”)

https://man7.org/linux/man-pages/man7/sock_diag.7.ht

idiag_ext

Here you can take a look at the data source of ss. It’s just another source of ss document.

https://man7.org/linux/man-pages/man7/sock_diag.7.html#:~:text=or%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20IPPROTO_UDPLITE.-,idiag_ext,-This%20is%20a

    The fields of struct inet_diag_req_v2 are as follows:

       idiag_ext
          This is a set of flags defining what kind of extended
          information to report.  Each requested kind of information
          is reported back as a netlink attribute as described
          below:

          INET_DIAG_TOS
                 The payload associated with this attribute is a
                 __u8 value which is the TOS of the socket.

          INET_DIAG_TCLASS
                 The payload associated with this attribute is a
                 __u8 value which is the TClass of the socket.  IPv6
                 sockets only.  For LISTEN and CLOSE sockets, this
                 is followed by INET_DIAG_SKV6ONLY attribute with
                 associated __u8 payload value meaning whether the
                 socket is IPv6-only or not.

          INET_DIAG_MEMINFO
                 The payload associated with this attribute is
                 represented in the following structure:

                     struct inet_diag_meminfo {
                         __u32 idiag_rmem;
                         __u32 idiag_wmem;
                         __u32 idiag_fmem;
                         __u32 idiag_tmem;
                     };

                 The fields of this structure are as follows:

                 idiag_rmem
                        The amount of data in the receive queue.

                 idiag_wmem
                        The amount of data that is queued by TCP but
                        not yet sent.

                 idiag_fmem
                        The amount of memory scheduled for future
                        use (TCP only).

                 idiag_tmem
                        The amount of data in send queue.

          INET_DIAG_SKMEMINFO
                 The payload associated with this attribute is an
                 array of __u32 values described below in the
                 subsection "Socket memory information".

          INET_DIAG_INFO
                 The payload associated with this attribute is
                 specific to the address family.  For TCP sockets,
                 it is an object of type struct tcp_info.

          INET_DIAG_CONG
                 The payload associated with this attribute is a
                 string that describes the congestion control
                 algorithm used.  For TCP sockets only.

​ idiag_timer
​ For TCP sockets, this field describes the type of timer
​ that is currently active for the socket. It is set to one
​ of the following constants:

​ 0 no timer is active
​ 1 a retransmit timer
​ 2 a keep-alive timer
​ 3 a TIME_WAIT timer
​ 4 a zero window probe timer

​ For non-TCP sockets, this field is set to 0.

​ idiag_retrans
​ For idiag_timer values 1, 2, and 4, this field contains
​ the number of retransmits. For other idiag_timer values,
​ this field is set to 0.

       idiag_expires
              For TCP sockets that have an active timer, this field
              describes its expiration time in milliseconds.  For other
              sockets, this field is set to 0.

       idiag_rqueue
              For listening sockets: the number of pending connections.

              For other sockets: the amount of data in the incoming
              queue.

       idiag_wqueue
              For listening sockets: the backlog length.

              For other sockets: the amount of memory available for
              sending.
       idiag_uid
              This is the socket owner UID.

       idiag_inode
              This is the socket inode number.
              
   Socket memory information
       The payload associated with UNIX_DIAG_MEMINFO and
       INET_DIAG_SKMEMINFO netlink attributes is an array of the
       following __u32 values:

       SK_MEMINFO_RMEM_ALLOC
              The amount of data in receive queue.

       SK_MEMINFO_RCVBUF
              The receive socket buffer as set by SO_RCVBUF.

       SK_MEMINFO_WMEM_ALLOC
              The amount of data in send queue.

       SK_MEMINFO_SNDBUF
              The send socket buffer as set by SO_SNDBUF.

       SK_MEMINFO_FWD_ALLOC
              The amount of memory scheduled for future use (TCP only).

       SK_MEMINFO_WMEM_QUEUED
              The amount of data queued by TCP, but not yet sent.

       SK_MEMINFO_OPTMEM
              The amount of memory allocated for the socket's service
              needs (e.g., socket filter).

       SK_MEMINFO_BACKLOG
              The amount of packets in the backlog (not yet processed).

For INET_DIAG_INFO:

For TCP sockets, it is an object of type struct tcp_info

https://wiki.linuxfoundation.org/networking/generic_netlink_howto

https://medium.com/thg-tech-blog/on-linux-netlink-d7af1987f89d

Ref.

https://djangocas.dev/blog/huge-improve-network-performance-by-change-tcp-congestion-control-to-bbr/

https://man7.org/linux/man-pages/man8/ss.8.html

Share on

Mark Zhu
WRITTEN BY
Mark Zhu
An old developer