Revision 1b21da103fb27c94d3bcbb9a2dcafebd4c2c440d authored by John Fastabend on 14 January 2021, 06:25:29 UTC, committed by Paul Chaignon on 15 January 2021, 11:29:44 UTC
From: John Fastabend <john.fastabend@gmail.com> If a pod in host networking and/or the node itself sends traffic to a pod with encryption+vxlan enabled and encryptNode disabled we may see dropped packets. The flow is the following, First, a pkt is sent from the host networking with srcIP=nodeIP dstIP=podIP. Next a source IP is chosen for the packet so that the route src will not be used. Then the route table 'routes' the packet to cilium_host using the route rule matching podIPs subnet, podIPSubnet via $IP dev cilium_host src $SrcIP mtu 1410 Then we drop into BPF space with srcIP=nodeIP,dstIP=podIP as above. Here tohost=true, and we do an ipcache lookup. The ipcache lookup will have a hit for the podIP and will have both a key and tunnelIP. For example something like this, 10.128.5.129/32 1169 3 10.0.0.4 Using above key identifier, in the example '3', the bpf_host program will mark the packet form encryption. This will pass encryption parameters through the ctx where the ingress program attached to cilium_net will in turn use those parameters to set skb->mark before passing up to the stack for encryption. At this point we have a skb ingress'ing the stack with an skb->mark=0xe00,srcIP=nodeIP,dstIP=podIP. Here is where the trouble starts. This will miss encryption policy rules because unless encryptNode is enabled we do not have a rule to match srcIP=nodeIP. So the unencrypted is sent back to cilium_host using a route in the encryption routing table. Then finally bpf_host will send the skb to the overlay, cilium_vxlan. The packet is then received by the podIP node and sent to cilium_vxlan because its a vxlan packet. The vxlan header is popped off and the inner (unencrypted) packet is sent to the pod, remember srcIP=nodeIP still. Assuming the above was a SYN packet the pod will respond with a SYN/ACK with srcIP=podIP,dstIP=nodeIP. This will be handled by the bpf_lxc program. First, we will do an ipcache lookup and get an ipcache entry, but this entry will not have a tunnelIP. It will have a key though. Because no tunnelIP is available a tunnel map lookup will be done. This will fail because the dstIP is a nodeIP and not in a tunnel subnet. (The tunnel map key is done by masking the dstIP with the subnet.) Because the program did not find a valid tunnel the packet is sent to the stack. The stack will run through iptables, but because the packet flow is asymmetric (SYN over cilium_vxlan, SYN/ACK over nic) the MASQUERADE rules will be skipped. The end result is we try to send a packet with the srcIP=podIP. Here a couple different outcomes are possible. If the network infra is strict the packet may be dropped at the sending node, refusing to send a packet with an unknown srcIP. If the receiving node has rp_filter=1 on the receiving interface it will be dropped with a martianIP message in dmesg. Or if rp_filter=0 and the network knows how to route the srcIP somehow it might work. For example some of my test systems managed to work without any issues. In order to fix this we can ensure reply packets are symmetric to the request. To do this we inject tunnel info into nodeIP ipcache entry. Once this is done the pod replying will do the following, Send a packet with srcIP=podIP,dstIP=nodeIP, this is picked up by the bpf_lxc program. The ipcache lookup finds a complete entry now with the tunnel info. Now instead of sending that packet to the stack it will be encapsulated in the vxlan header and send/received on the original requesting node. A quick observation, some of the passes through the xfrm stack are useless here. We know the miss is going to happen and can signal this to the bpf layer by clearing the encryption key in the ipcache. We will do this as a follow up patch. Signed-off-by: John Fastabend <john.fastabend@gmail.com>
1 parent 291c533
File | Mode | Size |
---|---|---|
cmd | ||
.gitignore | -rw-r--r-- | 38 bytes |
Makefile | -rw-r--r-- | 448 bytes |
main.go | -rw-r--r-- | 785 bytes |
Computing file changes ...