Revision 1b21da103fb27c94d3bcbb9a2dcafebd4c2c440d authored by John Fastabend on 14 January 2021, 06:25:29 UTC, committed by Paul Chaignon on 15 January 2021, 11:29:44 UTC
From: John Fastabend <john.fastabend@gmail.com>

If a pod in host networking and/or the node itself sends traffic to a pod
with encryption+vxlan enabled and encryptNode disabled we may see dropped
packets. The flow is the following,

First, a pkt is sent from the host networking with srcIP=nodeIP dstIP=podIP.
Next a source IP is chosen for the packet so that the route src will not
be used. Then the route table 'routes' the packet to cilium_host using the
route rule matching podIPs subnet,

   podIPSubnet via $IP dev cilium_host src $SrcIP mtu 1410

Then we drop into BPF space with srcIP=nodeIP,dstIP=podIP as above. Here
tohost=true, and we do an ipcache lookup. The ipcache lookup will have a
hit for the podIP and will have both a key and tunnelIP. For example
something like this,

  10.128.5.129/32     1169 3 10.0.0.4

Using above key identifier, in the example '3', the bpf_host program will
mark the packet form encryption. This will pass encryption parameters
through the ctx where the ingress program attached to cilium_net will in
turn use those parameters to set skb->mark before passing up to the stack
for encryption. At this point we have a skb ingress'ing the stack with
an skb->mark=0xe00,srcIP=nodeIP,dstIP=podIP.

Here is where the trouble starts. This will miss encryption policy rules
because unless encryptNode is enabled we do not have a rule to match
srcIP=nodeIP. So the unencrypted is sent back to cilium_host using a
route in the encryption routing table. Then finally bpf_host will send
the skb to the overlay, cilium_vxlan.

The packet is then received by the podIP node and sent to cilium_vxlan
because its a vxlan packet. The vxlan header is popped off and the
inner (unencrypted) packet is sent to the pod, remember srcIP=nodeIP
still.

Assuming the above was a SYN packet the pod will respond with a SYN/ACK
with srcIP=podIP,dstIP=nodeIP. This will be handled by the bpf_lxc
program.

First, we will do an ipcache lookup and get an ipcache entry, but this
entry will not have a tunnelIP. It will have a key though. Because
no tunnelIP is available a tunnel map lookup will be done. This will
fail because the dstIP is a nodeIP and not in a tunnel subnet. (The
tunnel map key is done by masking the dstIP with the subnet.)

Because the program did not find a valid tunnel the packet is sent to the
stack. The stack will run through iptables, but because the packet flow
is asymmetric (SYN over cilium_vxlan, SYN/ACK over nic) the MASQUERADE
rules will be skipped. The end result is we try to send a packet with
the srcIP=podIP.

Here a couple different outcomes are possible. If the network infra
is strict the packet may be dropped at the sending node, refusing to
send a packet with an unknown srcIP. If the receiving node has rp_filter=1
on the receiving interface it will be dropped with a martianIP message
in dmesg. Or if rp_filter=0 and the network knows how to route the
srcIP somehow it might work. For example some of my test systems
managed to work without any issues.

In order to fix this we can ensure reply packets are symmetric to
the request. To do this we inject tunnel info into nodeIP ipcache
entry. Once this is done the pod replying will do the following,

Send a packet with srcIP=podIP,dstIP=nodeIP, this is picked up by the
bpf_lxc program. The ipcache lookup finds a complete entry now with the
tunnel info. Now instead of sending that packet to the stack it will be
encapsulated in the vxlan header and send/received on the original
requesting node.

A quick observation, some of the passes through the xfrm stack are
useless here. We know the miss is going to happen and can signal this
to the bpf layer by clearing the encryption key in the ipcache. We
will do this as a follow up patch.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
1 parent 291c533
Raw File
vagrant_box_defaults.rb
# -*- mode: ruby -*-
# vi: set ft=ruby
Vagrant.require_version ">= 2.2.0"

$SERVER_BOX = "cilium/ubuntu-dev"
$SERVER_VERSION= "173"
$NETNEXT_SERVER_BOX= "cilium/ubuntu-next"
$NETNEXT_SERVER_VERSION= "59"
@v419_SERVER_BOX= "cilium/ubuntu-4-19"
@v419_SERVER_VERSION= "14"
@v49_SERVER_BOX= "cilium/ubuntu"
@v49_SERVER_VERSION= "173"
back to top