Revision 703e07a3540caf9e393b6a2ad8d0ce769bc67403 authored by Joe Stringer on 04 August 2023, 00:49:52 UTC, committed by Joe Stringer on 06 August 2023, 18:39:36 UTC
Boris, Jason and Kim-Eirik report that ten minutes after startup, Cilium
releases CIDR identities that are allowing Cilium to implement network
policy, which leads to connectivity outages. Boris traced this down to
commit c96b9d81a8a6 ("ipcache: Remove superfluous if condition") which
attempted to solve a reference leak issue in the core of ipcache. Upon
deeper inspection, the agent's CIDR Identity restore logic was not
playing well with network policy allocations of CIDR Identity in the
ipcache, due to a misalignment of expectations of ownership over
Identity and IPcache entry structures between the two agent modules.
This commit resolves the misalignment by moving the CIDR Identity
restore logic over to the newer IPCache APIs, thereby balancing the
reference counting ownership between the different code paths.

Previously the daemon on startup would:
1. Read previously-used CIDRs from the bpf ipcache map
2. Allocate the old numeric identities for the corresponding CIDRs
3. Directly inject entries into the new bpf ipcachemap for these CIDR to
   Identity mappings, to be used by an upcoming datapath reinitialization.
4. Wait for the identity-restore-grace-period (default 10m)
5. Release its references to the identities
6. If these are the final references on the identities, attempt to remove
   the corresponding prefix->identity mappings from the ipcache.

The goal is to ensure that after Cilium starts up, it tries to retain
the same identity numbers for the same prefixes in order to minimize the
diff between policymaps as endpoints are progressively migrated from the
datapath configured by the old Cilium agent towards the new one
understood by this newly started agent.

Step (3) above uses the old ipcache APIs that delegate responsibility to
callers to properly reference count the Identities. However, nothing was
stopping other parts of Cilium from configuring network policies for the
same IP prefixes into the ipcache using the new ipcache APIs, where the
ipcache package itself takes care of reference counting.

If another subsystem like the policy engine requests the ipcache to
allocate & inject an identity + IP mapping into the datapath during step
(4) above, it would hit the inner loop of InjectLabels(). That logic
attempts to (re-)allocate an identity for the CIDR, which would coincide
with the identity allocated in Step (2) above. Subsequently, it checks
whether the Identity is already pushed into the datapath. If yes, then
it will release the (re-)allocated identity reference for the CIDR,
thinking that it had previously grabbed a reference on the Identity.
This is actually incorrect, the ipcache never acquired a reference on
the Identity for itself! In the short term connectivity is kept alive
because the CIDR retains a reference from the startup logic, but when
the identity restore grace period runs out, this triggers persistent
drops on the datapath until a user re-inserts their network policies or
otherwise triggers re-evaluation of the policy paths that inject
mappings into the ipcache.

In this scenario, when the agent gets to step (5), it will release the
final reference to each such CIDR. This releases the identity internally
across the agent so that the policy engine no longer has any knowledge
of it, which triggers policymap entry deletion on the datapath, leading
to policy drops.

As it turns out, due to the way that the ipcache entry is inserted into
the datapath with a higher priority source, the logic in step (6)
actually fails to clean up the datapath's ipcache entries since the
entry is not really owned by the daemon restore logic, but by the
ipcache at this stage. The result is that the datapath's ipcache maps
certain CIDRs to identities that do not exist in the agent.

In order to resolve this conflict, this commit moves steps (3) and (6)
for the identity restore logic over to the new ipcache APIs, so that the
ipcache can correctly allocate its own references for the identities and
keep them alive. The daemon startup logic for identity restore still
separately allocates and releases its own references to the underlying
identities in order to ensure that the identities retain the prefix
correspondence across upgrades, but this reference is no longer tied up
with the old ipcache interfaces that caused the reference count
ambiguity.

As a matter of process to clearly delineate the codepaths for step (5),
this commit also moves that code into the daemon so that there is no
ambiguity over whether the daemon's identity references are tied to the
ipcache or not.

Fixes: c96b9d81a8a6 ("ipcache: Remove superfluous if condition")
Reported-by: Boris Petrovic <carnerito.b@gmail.com>
Reported-by: Kim-Eirik Karlsen <kim.eirik@gmail.com>
Reported-by: Jason Witkowski <jason@witkow.ski>
Signed-off-by: Joe Stringer <joe@cilium.io>
1 parent 841f6be
History
File Mode Size
.devcontainer
.github
.nvim
.travis
.vscode
Documentation
api
bpf
bugtool
cilium
cilium-health
clustermesh-apiserver
contrib
daemon
envoy
examples
hack
hubble-relay
images
install
jenkinsfiles
kvstoremesh
operator
pkg
plugins
test
tools
vendor
.authors.aux -rw-r--r-- 416 bytes
.clang-format -rw-r--r-- 3.9 KB
.clomonitor.yml -rw-r--r-- 984 bytes
.gitattributes -rw-r--r-- 887 bytes
.gitignore -rw-r--r-- 1.8 KB
.golangci.yaml -rw-r--r-- 3.6 KB
.mailmap -rw-r--r-- 6.5 KB
.travis.yml -rw-r--r-- 506 bytes
AUTHORS -rw-r--r-- 41.5 KB
CHANGELOG.md -rw-r--r-- 112.0 KB
CODEOWNERS -rw-r--r-- 11.2 KB
CODE_OF_CONDUCT.md -rw-r--r-- 2.2 KB
CONTRIBUTING.md -rw-r--r-- 691 bytes
FURTHER_READINGS.rst -rw-r--r-- 6.4 KB
LICENSE -rw-r--r-- 11.1 KB
MAINTAINERS.md -rw-r--r-- 4.3 KB
Makefile -rw-r--r-- 34.9 KB
Makefile.defs -rw-r--r-- 6.9 KB
Makefile.docker -rw-r--r-- 7.3 KB
Makefile.quiet -rw-r--r-- 818 bytes
README.rst -rw-r--r-- 19.4 KB
SECURITY.md -rw-r--r-- 1.0 KB
USERS.md -rw-r--r-- 28.5 KB
VERSION -rw-r--r-- 7 bytes
Vagrantfile -rw-r--r-- 14.9 KB
go.mod -rw-r--r-- 12.2 KB
go.sum -rw-r--r-- 152.7 KB
netlify.toml -rw-r--r-- 92 bytes
vagrant_box_defaults.rb -rw-r--r-- 334 bytes

README.rst

back to top