引
在 4 年前,Istio 让我眼前一亮的特性莫过于应用无关的流量拦截
和透明代理
。这为低成本进入 Service Mesh 时代大大降低了开发门槛。也是很多公司引入 Istio 的主要原因之一。
有句话说:
你的最大优点有时候也是你的最大缺点的来源。
Istio 默认使用了 iptable
的 REDIRECT
rule 作为 DNAT
出站流量到 sidecar 的方法。
如果读者比较了解 NAT ,也一定或多或少知道,NAT 是个不完美,有比较明显缺陷(哪个技术没有?),但现实中广泛使用的技术。包括你家的路由器上也在使用。
NAT 技术,在 Linux kernel 中,一般是以 conntrack + NAT iptable 实现的。它是 NAT 的一个实现。它有着和 NAT 技术类似的问题。
问题
下面讲述两个问题。两个问题都只在特定条件下才触发:
- TCP Proxy 部分场景 half-closed 连接泄漏 1 小时
- 应用程序出站(outbound)连接超时,因为应用程序选择了一个与 15001(outbound)listener 上的现有套接字冲突的临时端口
问题 1
是 问题 2
的诱因之一。
由于之前与社区沟通过程主要使用英文,还未来得及翻译,下面主要还是英文叙述。
TCP Proxy 部分场景 half-closed 连接泄漏 1 小时
有配图的细节,可见我的书《istio insider》中的 TCP Proxy half-closed connection leak for 1 hour in some scenarios
Sidecar intercept and TCP proxy all outbound TCP connection by default:
(app --[conntrack DNAT]--> sidecar) -----> upstream-tcp-service
- When
upstream-tcp-service
want to disconnect, it sentFIN
. sidecar
receivedFIN
and call shutdown(fd,ENVOY_SHUT_WR) syscall on the downsteam socket to forward theFIN
toapp
and keep the connection half-close. The socket state isFIN_WAIT2
now.conntrack table
will start a 60s timer(/proc/sys/net/netfilte/nf_conntrack_tcp_timeout_close_wait
). After timeout, DNAT entry will be removed.app
receivedFIN
- In normal scenarios, after receive
FIN
, theapp
will callclose()
quickly and it close the socket and replyFIN
, then all 2 sockets in sidecar will be closed. - BUT, if the
app
callclose()
after 60s. TheFIN
sent byapp
will not deliver tosidecar
. Because the conntrack table DNAT entry was removed at 60s.
So 2 sockets leaked on sidecar
.
We know that, gernally speaking, FIN_WAIT2
socket has timer to close it: /proc/sys/net/ipv4/tcp_fin_timeout
But for a half-closed FIN_WAIT2
socket(shutdown(fd,ENVOY_SHUT_WR)
), no timer exists.
Good news is:Envoy TCPProxy Filter
has an idle_timeout setting which by default is 1 hour. So above problem will have a 1 hour leak window before being GC.
应用程序出站(outbound)连接超时,因为应用程序选择了一个与 15001(outbound)listener 上的现有套接字冲突的临时端口
有配图的细节,可见我的书《istio insider》中的 App outbound connecting timed out because App selected a ephemeral port which collisions with the existing socket on 15001(outbound) listener
Sidecar intercept and TCP proxying all outbound TCP connection by default:
(app --[conntrack DNAT]--> sidecar) -----> upstream-tcp-service
But in some scenarios, App just get a connect timed out error when connecting to the sidecar 15001(outbound) listener.
Scenarios:
-
When sidecar has a half-open connection to App. e.g:
$ ss tcp FIN-WAIT-2 0 0 127.0.0.1:15001 172.29.73.7:44410(POD_IP:ephemeral_port)
This can happen, eg: TCP Proxy half-closed connection leak for 1 hour in some scenarios #43297
There is no track entry in conntrack table because
nf_conntrack_tcp_timeout_close_wait
time out and expired. -
App invoke syscall
connect(sockfd, peer_addr)
, kernel allocation aephemeral port
(44410 in this case) , bind the new socket to thatephemeral port
and sentSYN
packet to peer. -
SYN
packet reach conntrack and it create atrack entry
inconntrack table
:
$ conntrack -L
tcp 6 108 SYN_SENT src=172.29.73.7 dst=172.21.206.198 sport=44410 dport=7777 src=127.0.0.1 dst=172.29.73.7 sport=15001 dport=44410
-
SYN
packet DNAT to127.0.0.1:15001
-
SYN
packet reach the already existingFIN-WAIT-2 127.0.0.1:15001 172.29.73.7:44410
socket, then sidecar reply aTCP Challenge ACK
(TCP seq-no is from the oldFIN-WAIT-2
) packet to App -
App reply the
TCP Challenge ACK
with aRST
(TCP seq-no is from theTCP Challenge ACK
) -
Conntrack
get theRST
packet and check it. In some kernel version,conntrack
justinvalid
theRST
packet because theseq-no
is out of thetrack entry
inconntrack table
which created in step 3. -
App will retransmit
SYN
but all without an expectedSYN/ACK
reply. Connect timed out will happen on App user space.
Different kernel version may have different packet validate rule in step 7:
|
|
It seems related to kernel patch: Add tcp_ignore_invalid_rst sysctl to allow to disable out of
segment RSTs which merge to kernel after kernel v5.14
Good news is that, someone will fix the problem at kernel v6.2-rc7: netfilter: conntrack: handle tcp challenge acks during connection reuse:
When a connection is re-used, following can happen:
[ connection starts to close, fin sent in either direction ]
> syn # initator quickly reuses connection
< ack # peer sends a challenge ack
> rst # rst, sequence number == ack_seq of previous challenge ack
> syn # this syn is expected to pass
Problem is that the rst will fail window validation, so it gets
tagged as invalid.
If ruleset drops such packets, we get repeated syn-retransmits until
initator gives up or peer starts responding with syn/ack.
But in some scenarios and kernel version, it will be an issue anyway.