Revision 703e07a3540caf9e393b6a2ad8d0ce769bc67403 authored by Joe Stringer on 04 August 2023, 00:49:52 UTC, committed by Joe Stringer on 06 August 2023, 18:39:36 UTC
Boris, Jason and Kim-Eirik report that ten minutes after startup, Cilium releases CIDR identities that are allowing Cilium to implement network policy, which leads to connectivity outages. Boris traced this down to commit c96b9d81a8a6 ("ipcache: Remove superfluous if condition") which attempted to solve a reference leak issue in the core of ipcache. Upon deeper inspection, the agent's CIDR Identity restore logic was not playing well with network policy allocations of CIDR Identity in the ipcache, due to a misalignment of expectations of ownership over Identity and IPcache entry structures between the two agent modules. This commit resolves the misalignment by moving the CIDR Identity restore logic over to the newer IPCache APIs, thereby balancing the reference counting ownership between the different code paths. Previously the daemon on startup would: 1. Read previously-used CIDRs from the bpf ipcache map 2. Allocate the old numeric identities for the corresponding CIDRs 3. Directly inject entries into the new bpf ipcachemap for these CIDR to Identity mappings, to be used by an upcoming datapath reinitialization. 4. Wait for the identity-restore-grace-period (default 10m) 5. Release its references to the identities 6. If these are the final references on the identities, attempt to remove the corresponding prefix->identity mappings from the ipcache. The goal is to ensure that after Cilium starts up, it tries to retain the same identity numbers for the same prefixes in order to minimize the diff between policymaps as endpoints are progressively migrated from the datapath configured by the old Cilium agent towards the new one understood by this newly started agent. Step (3) above uses the old ipcache APIs that delegate responsibility to callers to properly reference count the Identities. However, nothing was stopping other parts of Cilium from configuring network policies for the same IP prefixes into the ipcache using the new ipcache APIs, where the ipcache package itself takes care of reference counting. If another subsystem like the policy engine requests the ipcache to allocate & inject an identity + IP mapping into the datapath during step (4) above, it would hit the inner loop of InjectLabels(). That logic attempts to (re-)allocate an identity for the CIDR, which would coincide with the identity allocated in Step (2) above. Subsequently, it checks whether the Identity is already pushed into the datapath. If yes, then it will release the (re-)allocated identity reference for the CIDR, thinking that it had previously grabbed a reference on the Identity. This is actually incorrect, the ipcache never acquired a reference on the Identity for itself! In the short term connectivity is kept alive because the CIDR retains a reference from the startup logic, but when the identity restore grace period runs out, this triggers persistent drops on the datapath until a user re-inserts their network policies or otherwise triggers re-evaluation of the policy paths that inject mappings into the ipcache. In this scenario, when the agent gets to step (5), it will release the final reference to each such CIDR. This releases the identity internally across the agent so that the policy engine no longer has any knowledge of it, which triggers policymap entry deletion on the datapath, leading to policy drops. As it turns out, due to the way that the ipcache entry is inserted into the datapath with a higher priority source, the logic in step (6) actually fails to clean up the datapath's ipcache entries since the entry is not really owned by the daemon restore logic, but by the ipcache at this stage. The result is that the datapath's ipcache maps certain CIDRs to identities that do not exist in the agent. In order to resolve this conflict, this commit moves steps (3) and (6) for the identity restore logic over to the new ipcache APIs, so that the ipcache can correctly allocate its own references for the identities and keep them alive. The daemon startup logic for identity restore still separately allocates and releases its own references to the underlying identities in order to ensure that the identities retain the prefix correspondence across upgrades, but this reference is no longer tied up with the old ipcache interfaces that caused the reference count ambiguity. As a matter of process to clearly delineate the codepaths for step (5), this commit also moves that code into the daemon so that there is no ambiguity over whether the daemon's identity references are tied to the ipcache or not. Fixes: c96b9d81a8a6 ("ipcache: Remove superfluous if condition") Reported-by: Boris Petrovic <carnerito.b@gmail.com> Reported-by: Kim-Eirik Karlsen <kim.eirik@gmail.com> Reported-by: Jason Witkowski <jason@witkow.ski> Signed-off-by: Joe Stringer <joe@cilium.io>
1 parent 841f6be
File | Mode | Size |
---|---|---|
.devcontainer | ||
.github | ||
.nvim | ||
.travis | ||
.vscode | ||
Documentation | ||
api | ||
bpf | ||
bugtool | ||
cilium | ||
cilium-health | ||
clustermesh-apiserver | ||
contrib | ||
daemon | ||
envoy | ||
examples | ||
hack | ||
hubble-relay | ||
images | ||
install | ||
jenkinsfiles | ||
kvstoremesh | ||
operator | ||
pkg | ||
plugins | ||
test | ||
tools | ||
vendor | ||
.authors.aux | -rw-r--r-- | 416 bytes |
.clang-format | -rw-r--r-- | 3.9 KB |
.clomonitor.yml | -rw-r--r-- | 984 bytes |
.gitattributes | -rw-r--r-- | 887 bytes |
.gitignore | -rw-r--r-- | 1.8 KB |
.golangci.yaml | -rw-r--r-- | 3.6 KB |
.mailmap | -rw-r--r-- | 6.5 KB |
.travis.yml | -rw-r--r-- | 506 bytes |
AUTHORS | -rw-r--r-- | 41.5 KB |
CHANGELOG.md | -rw-r--r-- | 112.0 KB |
CODEOWNERS | -rw-r--r-- | 11.2 KB |
CODE_OF_CONDUCT.md | -rw-r--r-- | 2.2 KB |
CONTRIBUTING.md | -rw-r--r-- | 691 bytes |
FURTHER_READINGS.rst | -rw-r--r-- | 6.4 KB |
LICENSE | -rw-r--r-- | 11.1 KB |
MAINTAINERS.md | -rw-r--r-- | 4.3 KB |
Makefile | -rw-r--r-- | 34.9 KB |
Makefile.defs | -rw-r--r-- | 6.9 KB |
Makefile.docker | -rw-r--r-- | 7.3 KB |
Makefile.quiet | -rw-r--r-- | 818 bytes |
README.rst | -rw-r--r-- | 19.4 KB |
SECURITY.md | -rw-r--r-- | 1.0 KB |
USERS.md | -rw-r--r-- | 28.5 KB |
VERSION | -rw-r--r-- | 7 bytes |
Vagrantfile | -rw-r--r-- | 14.9 KB |
go.mod | -rw-r--r-- | 12.2 KB |
go.sum | -rw-r--r-- | 152.7 KB |
netlify.toml | -rw-r--r-- | 92 bytes |
vagrant_box_defaults.rb | -rw-r--r-- | 334 bytes |
Computing file changes ...