特定条件下 Istio 发生 half-close 连接泄漏与出站连接失败

引

在 4 年前，Istio 让我眼前一亮的特性莫过于应用无关的流量拦截和透明代理。这为低成本进入 Service Mesh 时代大大降低了开发门槛。也是很多公司引入 Istio 的主要原因之一。

有句话说：

你的最大优点有时候也是你的最大缺点的来源。

Istio 默认使用了 iptable 的 REDIRECT rule 作为 DNAT 出站流量到 sidecar 的方法。

如果读者比较了解 NAT ，也一定或多或少知道，NAT 是个不完美，有比较明显缺陷（哪个技术没有？），但现实中广泛使用的技术。包括你家的路由器上也在使用。

NAT 技术，在 Linux kernel 中，一般是以 conntrack + NAT iptable 实现的。它是 NAT 的一个实现。它有着和 NAT 技术类似的问题。

用 Draw.io 打开

问题

下面讲述两个问题。两个问题都只在特定条件下才触发：

TCP Proxy 部分场景 half-closed 连接泄漏 1 小时
应用程序出站（outbound）连接超时，因为应用程序选择了一个与 15001（outbound）listener 上的现有套接字冲突的临时端口

问题 1 是 问题 2 的诱因之一。

由于之前与社区沟通过程主要使用英文，还未来得及翻译，下面主要还是英文叙述。

TCP Proxy 部分场景 half-closed 连接泄漏 1 小时

有配图的细节，可见我的书《istio insider》中的 TCP Proxy half-closed connection leak for 1 hour in some scenarios

Sidecar intercept and TCP proxy all outbound TCP connection by default:
(app --[conntrack DNAT]--> sidecar) -----> upstream-tcp-service

When upstream-tcp-service want to disconnect, it sent FIN.
sidecar received FIN and call shutdown(fd,ENVOY_SHUT_WR) syscall on the downsteam socket to forward the FIN to app and keep the connection half-close. The socket state is FIN_WAIT2 now.
conntrack table will start a 60s timer(/proc/sys/net/netfilte/nf_conntrack_tcp_timeout_close_wait). After timeout, DNAT entry will be removed.
app received FIN
In normal scenarios, after receive FIN, the app will call close() quickly and it close the socket and reply FIN, then all 2 sockets in sidecar will be closed.
BUT, if the app call close() after 60s. The FIN sent by app will not deliver to sidecar. Because the conntrack table DNAT entry was removed at 60s.

So 2 sockets leaked on sidecar.

We know that, gernally speaking, FIN_WAIT2 socket has timer to close it: /proc/sys/net/ipv4/tcp_fin_timeout

But for a half-closed FIN_WAIT2 socket(shutdown(fd,ENVOY_SHUT_WR)), no timer exists.

Good news is:Envoy TCPProxy Filter has an idle_timeout setting which by default is 1 hour. So above problem will have a 1 hour leak window before being GC.

应用程序出站（outbound）连接超时，因为应用程序选择了一个与 15001（outbound）listener 上的现有套接字冲突的临时端口

有配图的细节，可见我的书《istio insider》中的 App outbound connecting timed out because App selected a ephemeral port which collisions with the existing socket on 15001(outbound) listener

Sidecar intercept and TCP proxying all outbound TCP connection by default:
(app --[conntrack DNAT]--> sidecar) -----> upstream-tcp-service

But in some scenarios, App just get a connect timed out error when connecting to the sidecar 15001(outbound) listener.

Scenarios:

When sidecar has a half-open connection to App. e.g:
```
$ ss
tcp FIN-WAIT-2 0 0 127.0.0.1:15001  172.29.73.7:44410(POD_IP:ephemeral_port)
```
This can happen, eg: TCP Proxy half-closed connection leak for 1 hour in some scenarios #43297

There is no track entry in conntrack table because nf_conntrack_tcp_timeout_close_wait time out and expired.
App invoke syscall connect(sockfd, peer_addr) , kernel allocation a ephemeral port(44410 in this case) , bind the new socket to that ephemeral port and sent SYN packet to peer.
SYN packet reach conntrack and it create a track entry in conntrack table:

$ conntrack -L
tcp  6 108 SYN_SENT src=172.29.73.7 dst=172.21.206.198 sport=44410 dport=7777 src=127.0.0.1 dst=172.29.73.7 sport=15001 dport=44410

SYN packet DNAT to 127.0.0.1:15001
SYN packet reach the already existing FIN-WAIT-2 127.0.0.1:15001 172.29.73.7:44410 socket, then sidecar reply a TCP Challenge ACK (TCP seq-no is from the old FIN-WAIT-2) packet to App
App reply the TCP Challenge ACK with a RST(TCP seq-no is from the TCP Challenge ACK)
Conntrack get the RST packet and check it. In some kernel version, conntrack just invalid the RST packet because the seq-no is out of the track entry in conntrack table which created in step 3.
App will retransmit SYN but all without an expected SYN/ACK reply. Connect timed out will happen on App user space.

Different kernel version may have different packet validate rule in step 7:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


RST packet mark as invalid:
  SUSE Linux Enterprise Server 15 SP4:
    5.14.21-150400.24.21-default

# cat /proc/sys/net/netfilter/nf_conntrack_tcp_ignore_invalid_rst
0

####################
    
RST packet passed check and NATed:
  Ubuntu 20.04.2:
    5.4.0-137-generic
    
# cat /proc/sys/net/netfilter/nf_conntrack_tcp_ignore_invalid_rst
cat: /proc/sys/net/netfilter/nf_conntrack_tcp_ignore_invalid_rst: No such file or directory    

It seems related to kernel patch: Add tcp_ignore_invalid_rst sysctl to allow to disable out of
segment RSTs which merge to kernel after kernel v5.14

Good news is that, someone will fix the problem at kernel v6.2-rc7: netfilter: conntrack: handle tcp challenge acks during connection reuse:

When a connection is re-used, following can happen:
[ connection starts to close, fin sent in either direction ]
 > syn   # initator quickly reuses connection
 < ack   # peer sends a challenge ack
 > rst   # rst, sequence number == ack_seq of previous challenge ack
 > syn   # this syn is expected to pass

Problem is that the rst will fail window validation, so it gets
tagged as invalid.

If ruleset drops such packets, we get repeated syn-retransmits until
initator gives up or peer starts responding with syn/ack.

But in some scenarios and kernel version, it will be an issue anyway.

Chat