
引
在 4 年前,Istio 让我眼前一亮的特性莫过于应用无关的流量拦截和透明代理。这为低成本进入 Service Mesh 时代大大降低了开发门槛。也是很多公司引入 Istio 的主要原因之一。
有句话说:
你的最大优点有时候也是你的最大缺点的来源。
Istio 默认使用了 iptable 的 REDIRECT rule 作为 DNAT 出站流量到 sidecar 的方法。
如果读者比较了解 NAT ,也一定或多或少知道,NAT 是个不完美,有比较明显缺陷(哪个技术没有?),但现实中广泛使用的技术。包括你家的路由器上也在使用。
NAT 技术,在 Linux kernel 中,一般是以 conntrack + NAT iptable 实现的。它是 NAT 的一个实现。它有着和 NAT 技术类似的问题。
问题
下面讲述两个问题。两个问题都只在特定条件下才触发:
- TCP Proxy 部分场景 half-closed 连接泄漏 1 小时
- 应用程序出站(outbound)连接超时,因为应用程序选择了一个与 15001(outbound)listener 上的现有套接字冲突的临时端口
问题 1 是 问题 2 的诱因之一。
由于之前与社区沟通过程主要使用英文,还未来得及翻译,下面主要还是英文叙述。
TCP Proxy 部分场景 half-closed 连接泄漏 1 小时
有配图的细节,可见我的书《istio insider》中的 TCP Proxy half-closed connection leak for 1 hour in some scenarios
Sidecar intercept and TCP proxy all outbound TCP connection by default:
(app --[conntrack DNAT]--> sidecar) -----> upstream-tcp-service
- When
upstream-tcp-servicewant to disconnect, it sentFIN. sidecarreceivedFINand call shutdown(fd,ENVOY_SHUT_WR) syscall on the downsteam socket to forward theFINtoappand keep the connection half-close. The socket state isFIN_WAIT2now.conntrack tablewill start a 60s timer(/proc/sys/net/netfilte/nf_conntrack_tcp_timeout_close_wait). After timeout, DNAT entry will be removed.appreceivedFIN- In normal scenarios, after receive
FIN, theappwill callclose()quickly and it close the socket and replyFIN, then all 2 sockets in sidecar will be closed. - BUT, if the
appcallclose()after 60s. TheFINsent byappwill not deliver tosidecar. Because the conntrack table DNAT entry was removed at 60s.
So 2 sockets leaked on sidecar.
We know that, gernally speaking, FIN_WAIT2 socket has timer to close it: /proc/sys/net/ipv4/tcp_fin_timeout
But for a half-closed FIN_WAIT2 socket(shutdown(fd,ENVOY_SHUT_WR)), no timer exists.
Good news is:Envoy TCPProxy Filter has an idle_timeout setting which by default is 1 hour. So above problem will have a 1 hour leak window before being GC.
应用程序出站(outbound)连接超时,因为应用程序选择了一个与 15001(outbound)listener 上的现有套接字冲突的临时端口
有配图的细节,可见我的书《istio insider》中的 App outbound connecting timed out because App selected a ephemeral port which collisions with the existing socket on 15001(outbound) listener
Sidecar intercept and TCP proxying all outbound TCP connection by default:
(app --[conntrack DNAT]--> sidecar) -----> upstream-tcp-service
But in some scenarios, App just get a connect timed out error when connecting to the sidecar 15001(outbound) listener.
Scenarios:
-
When sidecar has a half-open connection to App. e.g:
$ ss tcp FIN-WAIT-2 0 0 127.0.0.1:15001 172.29.73.7:44410(POD_IP:ephemeral_port)This can happen, eg: TCP Proxy half-closed connection leak for 1 hour in some scenarios #43297
There is no track entry in conntrack table because
nf_conntrack_tcp_timeout_close_waittime out and expired. -
App invoke syscall
connect(sockfd, peer_addr), kernel allocation aephemeral port(44410 in this case) , bind the new socket to thatephemeral portand sentSYNpacket to peer. -
SYNpacket reach conntrack and it create atrack entryinconntrack table:
$ conntrack -L
tcp 6 108 SYN_SENT src=172.29.73.7 dst=172.21.206.198 sport=44410 dport=7777 src=127.0.0.1 dst=172.29.73.7 sport=15001 dport=44410
-
SYNpacket DNAT to127.0.0.1:15001 -
SYNpacket reach the already existingFIN-WAIT-2 127.0.0.1:15001 172.29.73.7:44410socket, then sidecar reply aTCP Challenge ACK(TCP seq-no is from the oldFIN-WAIT-2) packet to App -
App reply the
TCP Challenge ACKwith aRST(TCP seq-no is from theTCP Challenge ACK) -
Conntrackget theRSTpacket and check it. In some kernel version,conntrackjustinvalidtheRSTpacket because theseq-nois out of thetrack entryinconntrack tablewhich created in step 3. -
App will retransmit
SYNbut all without an expectedSYN/ACKreply. Connect timed out will happen on App user space.
Different kernel version may have different packet validate rule in step 7:
|
|
It seems related to kernel patch: Add tcp_ignore_invalid_rst sysctl to allow to disable out of
segment RSTs which merge to kernel after kernel v5.14
Good news is that, someone will fix the problem at kernel v6.2-rc7: netfilter: conntrack: handle tcp challenge acks during connection reuse:
When a connection is re-used, following can happen:
[ connection starts to close, fin sent in either direction ]
> syn # initator quickly reuses connection
< ack # peer sends a challenge ack
> rst # rst, sequence number == ack_seq of previous challenge ack
> syn # this syn is expected to pass
Problem is that the rst will fail window validation, so it gets
tagged as invalid.
If ruleset drops such packets, we get repeated syn-retransmits until
initator gives up or peer starts responding with syn/ack.
But in some scenarios and kernel version, it will be an issue anyway.