Revision - 34337f0 - bpf: lxc: simplify RevNAT path for loopback replies

Revision 34337f0257d5fa2d4eccca8e1c514918694fab1a authored by Julian Wiedmann on 10 May 2024, 21:05:55 UTC, committed by Julian Wiedmann on 14 June 2024, 11:45:10 UTC

bpf: lxc: simplify RevNAT path for loopback replies

The usual flow for handling service traffic to a local backend is as
follows:
* requests are load-balanced in from-container. This entails selecting
a backend (and caching the selection in a CT_SERVICE entry), DNATing the
packet, creating a CT_EGRESS entry for the resulting `client -> backend`
flow, applying egress network policy, and local delivery to the backend
pod. As part of the local delivery, we also create a CT_INGRESS entry and
apply ingress network policy.
* replies bypass the backend's egress network policy (because the CT
lookup returns CT_REPLY), and pass to the client via local delivery. In
the client's ingress path they bypass ingress network policy (the packets
match as reply against the CT_EGRESS entry), and we apply RevDNAT based on
the `rev_nat_index` in the CT_EGRESS entry.

For a loopback connection (where the client pod is selected as backend for
the connection) this looks slightly more complicated:
* As we can't establish a `client -> client` connection, the requests are
also SNATed with IPV4_LOOPBACK. Network policy in forward direction is
explicitly skipped (as the matched CT entries have the `.loopback` flag
set).
* In reply direction, we can't deliver to IPV4_LOOPBACK (as that's not a
valid IP for an endpoint lookup). So a reply already gets fully RevNATed
by from-container, using the CT_INGRESS entry's `rev_nat_index`. But this
means that when passing into the client pod (either via to-container, or
via the ingress policy tail-call), the packet doesn't match as reply to the
CT_EGRESS entry - and so we don't benefit from automatic network policy
bypass. We ended up with two workarounds for this aspect:
(1) when to-container is installed, it contains custom logic to match the
    packet as a loopback reply, and skip ingress policy
    (see https://github.com/cilium/cilium/pull/27798).
(2) otherwise we skip the ingress policy tailcall, and forward the packet
    straight into the client pod.

The downside of these workarounds is that we bypass the *whole* ingress
program, not just the network policy part. So the CT_EGRESS entry doesn't
get updated (lifetime, statistics, observed packet flags, ...), and we
have the hidden risk that when we add more logic to the ingress program,
it doesn't get executed for loopback replies.

This patch aims to eliminate the need for such workarounds. At its core,
it detects loopback replies in from-container and overrides the packet's
destination IP. Instead of attempting an endpoint lookup for IPV4_LOOPBACK,
we can now look up the actual client endpoint - and deliver to the ingress
policy program, *without* needing to early-RevNAT the packet. Instead the
replies follow the usual packet flow, match the CT_EGRESS entry in the
ingress program, naturally bypass ingress network policy, and are *then*
RevNATed based on the CT_EGRESS entry's `rev_nat_index`.

Consequently we follow the standard datapath, without needing to skip over
policy programs. The CT_EGRESS entry is updated for every reply.

Thus we can also remove the manual policy bypass for loopback replies,
when using per-EP routing. It's no longer needed and in fact the
replies will no longer match the lookup logic, as they haven't
been RevNATed yet. This effectively reverts
e2829a061a53 ("bpf: lxc: support Pod->Service->Pod hairpinning with endpoint routes").

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>

1 parent b66862a

Files
Changes

Permalinks

File	Mode	Size
v1

Showing with 0 additions and 0 deletions (0 / 0 diffs computed)

Computing file changes ...