sort by:
Revision Author Date Message Commit Date
d24de7d Revert "policy: Fix Deny Precedence Bug" This reverts commit b6970da61cf40b6dccc92e6ce60369c195c9a21e. Except for the logrus update. Signed-off-by: Marga Manterola <marga@isovalent.com> 15 September 2023, 16:57:56 UTC
f0f5639 Revert "bug: Fix Potential Nil Reference in GetLables Implementation" This reverts commit 7dc319e4e8e5c7e1d42d4e2f8c0fbc364bcd6b60. Signed-off-by: Marga Manterola <marga@isovalent.com> 15 September 2023, 16:55:15 UTC
99249ec Revert "policy: Export DenyPreferredInsertWithChanges, make revertible" This reverts commit 6dc3fb3d137af367a5a75245257f848e5a0c437e. Signed-off-by: Marga Manterola <marga@isovalent.com> 15 September 2023, 16:53:35 UTC
3305147 Revert "endpoint: Do not override deny entries with proxy redirects" This reverts commit fefa16b03ab2e1ce8b208be3252a4b7bcfb3d9c3. Signed-off-by: Marga Manterola <marga@isovalent.com> 15 September 2023, 16:53:07 UTC
878beba Switch GKE version to 1.24, 1.23 is deprecated Signed-off-by: Marga Manterola <marga@isovalent.com> 09 September 2023, 03:08:12 UTC
8209d85 cilium: Ensure xfrm state is initialized for route IP before publish [ upstream commit c9ea7a52bd59c167c6e7611d4976e3c041f4e7f0 ] When rolling cilium-agent or doing an upgrade while running stress test with encryption a small number of NoStateIn errors are seen. To capture the error state (a cilium_host IP without an xfrm state rule) you need to get into the pod near pod init and get somewhat lucky that init took some longer time. For example I ran `ip x s` in a pod about 15seconds after launch and captured a case with new XfrmInNoErrors, a cilium_host ip assigned, but no xfrm state rule for it. The packets received are dropped. The conclusion is remote nodes learn the new router IP before we have the xfrm state rule loaded. The remote nodes then start using that IP for the IPSec tunnel outer IP resulting in the errors when they reach the local node without the xfrm rule yet. The errors eventually resolve, but some packets are lost in the meantime. The reason this happens is because first we configure the datapath after we push node object updates. This is wrong because we need to init the ipsec code path before we teach remote nodes about the new IP. And second the configuration of the datapath does a lookup in the node objects IPAddresses{} this is only populated from the k8s watcher in the tunnel case. So we only have the fully populated node object after we receive it through the k8s watcher. Again its possible other nodes already have seen the event and started pushing traffic with the new IPs. To resolve push IPSec init code to create xfrm rules needed with the new IPs before we publish them to the k8s node object. And instead of pulling the IPs out of the node object simply pull them directly from the node module. This resolves the XfrmInNoState and XfrmIn*Policy* errors I've seen. To reproduce the errors I can consistently reproduce with about 30 nodes, with httpperf test running from a pod in all nodes, and then doing a 'rollout' of the cilium agent for awhile. Seems a 2-3 hours almost ensures errors pop up. Usually the errors happen much sooner. Initially I saw these errors on upgrade tests which is another method to reproduce. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Margarita Manterola <margamanterola@gmail.com> 09 September 2023, 03:08:12 UTC
e4eded5 ci: remove unavailable K8s 1.22 from GKE config [ upstream commit 4440b3e1d188c84478ea77f20525cb142bbad236 ] As of August 08 2023, K8s version 1.22 is no longer available for GKE clusters. Therefore, this commit removes the version from the CI GKE matrix configuration. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com> Signed-off-by: Tobias Klauser <tklauser@distanz.ch> 25 August 2023, 19:17:28 UTC
e71faf5 ipsec: discard trace events when monitor aggregation is enabled [ upstream commit 3e5282274d1a415e4496e5ce51956d59b277f7be ] This commit modifies calls to `send_trace_notify` in IPSec contexts where connection tracking information is not available to drop events when monitor aggregation is enabled. Benchmarks were performed on a two node, e2-custom-4-8192, cluster, using pod to pod netperf stream, RR and CRR tests. Cilium v1.14.0 was installed using unstripped images with IPSec, VXLAN tunnel mode, and Hubble enabled. Tests were performed in the order of stream, RR and CRR for each configuration. Between each test, the netserver was reset and each node's conntrack tables were cleared. Pod to pod netperf stream test on v1.14.0, one client per core: ```csv Throughput,Throughput Units,Elapsed Time (sec) 464.06,10^6bits/s,180.04 461.25,10^6bits/s,180.02 441.73,10^6bits/s,180.00 482.26,10^6bits/s,180.04 ``` Pod to pod netperf stream test on this commit, one client per core: ```csv Throughput,Throughput Units,Elapsed Time (sec) 698.48,10^6bits/s,180.03 643.81,10^6bits/s,180.05 780.22,10^6bits/s,180.05 552.11,10^6bits/s,180.03 ``` Taking the average of each client's throughput, this change leads to an increase in throughput by +45%. Pod to pod netperf RR test on v1.14.0, one client per core: ```csv 50th Percentile Latency Microseconds,90th Percentile Latency Microseconds,99th Percentile Latency Microseconds,Round Trip Latency usec/tran,Request Size Bytes,Response Size Bytes,Elapsed Time (sec) 302,486,1350,361.258,1,1,180.00 312,501,1304,372.272,1,1,180.00 320,530,1338,383.703,1,1,180.00 302,459,1286,355.595,1,1,180.00 ``` Pod to pod netperf RR test on this commit, one client per core: ```csv 50th Percentile Latency Microseconds,90th Percentile Latency Microseconds,99th Percentile Latency Microseconds,Round Trip Latency usec/tran,Request Size Bytes,Response Size Bytes,Elapsed Time (sec) 199,256,370,213.484,1,1,180.00 198,254,366,212.594,1,1,180.00 202,261,371,216.930,1,1,180.00 199,257,366,213.430,1,1,180.00 ``` Taking the worst 99th percentile latency from each test, this change leads to a reduction in p99 latency by 76.5%. Pod to pod netperf CRR test on v1.14.0, one client per core: ```csv 50th Percentile Latency Microseconds,90th Percentile Latency Microseconds,99th Percentile Latency Microseconds,Round Trip Latency usec/tran,Request Size Bytes,Response Size Bytes,Elapsed Time (sec) 935,3910,6920,3337.723,1,1,180.00 914,3308,6432,2502.398,1,1,180.00 933,3797,6843,3199.433,1,1,180.00 927,3761,6812,2855.196,1,1,180.00 ``` Pod to pod netperf CRR test on this commit, one client per core: ```csv 50th Percentile Latency Microseconds,90th Percentile Latency Microseconds,99th Percentile Latency Microseconds,Round Trip Latency usec/tran,Request Size Bytes,Response Size Bytes,Elapsed Time (sec) 683,4680,5803,3155.128,1,1,180.00 678,4627,5782,2958.921,1,1,180.00 683,4658,5830,3095.390,1,1,180.00 680,4646,5787,3092.147,1,1,180.00 ``` Taking the worst 99th percentile latency from each test, this change leads to a reduction in p99 latency by 15.8%. Fixes: #26648 Signed-off-by: Ryan Drew <ryan.drew@isovalent.com> Signed-off-by: Tobias Klauser <tklauser@distanz.ch> 25 August 2023, 19:17:28 UTC
aac3d44 docs: Document DROP_NO_NODE_ID for IPsec [ upstream commit b3171b99441a84ad307fdf588e1c954fc7bcfd88 ] This commit documents the new drop reason introduced for IPsec in 6109a38bc7 ("bpf/ipsec: Stop relying on ipcache node_id field"). Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Tobias Klauser <tklauser@distanz.ch> 25 August 2023, 19:17:28 UTC
d38104c operator: Adjust CiliumEndpoint GC to only run once in kvstore mode [ upstream commit 901b74938aa630b868f7274fd54d0b3932328e6c ] This commit adjusts the logic that the operator uses to control endpoint garbage collection and syncing in order to account for the case when CiliumEndpoint CRDs are disabled in kvstore mode. The operator will now only start the garbage collector if CiliumEndpoint CRD mode is enabled, and if it isn't enabled, the operator will check to ensure that the CiliumEndpoint CRD is installed in the cluster before starting a one-off gc sync. This will allow users to use kvstore mode without having to worry about setting the endpoint gc interval to zero, and it covers the case where a user transitions from CRD mode to kvstore mode. Fixes: cilium/cilium#24440 Signed-off-by: Ryan Drew <ryan.drew@isovalent.com> Signed-off-by: Tobias Klauser <tklauser@distanz.ch> 25 August 2023, 19:17:28 UTC
95f8175 cilium/ipsec: only update cilium_encrypt_map after xfrm is configured [upstream commit 4f3bc9f1560c98350c810310127d9421e35cdb61] [backporter's note: Had to apply manually because of many conflicts.] The cilium_encrypt_map is used to determine what key the datapath should use. This is done by setting the mark value of the skb and then in the XFRM policy the mark value is matched to designate what encryption policy/state to use. However, on key rotation we have an issue where the map entry with the key is updated before the xfrm policy is plumbed. The result is its possible to mark the skb with a value that will have no matching xfrm policy and result in a policy block error and drop the skb. To resolve ensure we do setup in the correct order and only set the min key in the cilium_encrypt_map after the policy has been updated. Signed-off-by: John Fastabend <john.fastabend@gmail.com> 21 August 2023, 21:47:56 UTC
2f712d1 cilium: trigger validation of all nodes when key updated [upstream commit bba3dfcec7b5626fde845fab4df094275d30c371] [backporter's notes: This was applied manually because it was mostly conflicts and even patch had some trouble.] On key update we only update the policy for our local node. But with latest round of changes we need to update for all nodes in the node cache. With out this we would rely on the validation interval timer to sync the nodes policy before we remove the old policy. This may or may not happen depending on how large the cluster is. Further I've seen us miss it even on relatively small clusters say around 30 nodes so seems its not entirely reliable to count on. Rather than rely on some external to ipsec timer to fire and sync the policies and to do it hopefully in our time window lets just force the nodeUpdate() call on all nodes in the cache when we get the key rotate event. Signed-off-by: John Fastabend <john.fastabend@gmail.com> 21 August 2023, 21:47:56 UTC
6c208f1 install: Update image digests for v1.11.20 Generated from https://github.com/cilium/cilium/actions/runs/5830951284. `docker.io/cilium/cilium:v1.11.20@sha256:60df3cb7155886e0b62060c7a4a31e457933c6e35af79febad5fd6e43bab2a99` `quay.io/cilium/cilium:v1.11.20@sha256:60df3cb7155886e0b62060c7a4a31e457933c6e35af79febad5fd6e43bab2a99` `docker.io/cilium/clustermesh-apiserver:v1.11.20@sha256:46760182f8c98227cfac27627275616987b71509227775350573d834133a6d49` `quay.io/cilium/clustermesh-apiserver:v1.11.20@sha256:46760182f8c98227cfac27627275616987b71509227775350573d834133a6d49` `docker.io/cilium/docker-plugin:v1.11.20@sha256:9e036af06498d1a90d8eee3ce3c3dbeb10a6bbe2b2e6a55d04941c82624a2e3a` `quay.io/cilium/docker-plugin:v1.11.20@sha256:9e036af06498d1a90d8eee3ce3c3dbeb10a6bbe2b2e6a55d04941c82624a2e3a` `docker.io/cilium/hubble-relay:v1.11.20@sha256:e2f38b901fd8bd5adc9a765a5e68836364ebd1e7dfb85c2bcd8a5488b23c3470` `quay.io/cilium/hubble-relay:v1.11.20@sha256:e2f38b901fd8bd5adc9a765a5e68836364ebd1e7dfb85c2bcd8a5488b23c3470` `docker.io/cilium/operator-alibabacloud:v1.11.20@sha256:5d5b44f0a08802972323adb7ca2d5df7e0983736ab3b195090906d2fa97f9594` `quay.io/cilium/operator-alibabacloud:v1.11.20@sha256:5d5b44f0a08802972323adb7ca2d5df7e0983736ab3b195090906d2fa97f9594` `docker.io/cilium/operator-aws:v1.11.20@sha256:48b755858729f783a682d80693ef3a208ddb70fa912b119f82f99bb988b23586` `quay.io/cilium/operator-aws:v1.11.20@sha256:48b755858729f783a682d80693ef3a208ddb70fa912b119f82f99bb988b23586` `docker.io/cilium/operator-azure:v1.11.20@sha256:65b2d2b143830e5a5764416d000244ac447b3e1fca07fe9c138c84094fa42085` `quay.io/cilium/operator-azure:v1.11.20@sha256:65b2d2b143830e5a5764416d000244ac447b3e1fca07fe9c138c84094fa42085` `docker.io/cilium/operator-generic:v1.11.20@sha256:1439954acf620f048ef663524ae70b4a25693c58527a2f2cee51124496e29f90` `quay.io/cilium/operator-generic:v1.11.20@sha256:1439954acf620f048ef663524ae70b4a25693c58527a2f2cee51124496e29f90` `docker.io/cilium/operator:v1.11.20@sha256:998f7df39d12324a7d968a8c8725533b10b54c01f4aeab33d12b395af1f2edf8` `quay.io/cilium/operator:v1.11.20@sha256:998f7df39d12324a7d968a8c8725533b10b54c01f4aeab33d12b395af1f2edf8` Signed-off-by: Maciej Kwiek <maciej@isovalent.com> 11 August 2023, 14:23:21 UTC
5269aad Prepare for release v1.11.20 Signed-off-by: Maciej Kwiek <maciej@isovalent.com> 10 August 2023, 13:52:15 UTC
4abfd4c Update API code Signed-off-by: Feroz Salam <feroz.salam@isovalent.com> 09 August 2023, 14:47:00 UTC
1eb593a images: update cilium-{runtime,builder} Signed-off-by: Cilium Imagebot <noreply@cilium.io> 09 August 2023, 14:47:00 UTC
ad34c5a chore(deps): update docker.io/library/golang docker tag to v1.19.11 A very lightly edited cherry-pick of commit 93e613e to try and ensure that renovate doesn't attempt to autoclose it. Signed-off-by: Feroz Salam <feroz.salam@isovalent.com> 09 August 2023, 14:47:00 UTC
73cf841 node_ids: Check that two nodes don't share a node ID [ upstream commit e1b90630ab26370f026ade2091ea07a937d9620c ] [ backporter's notes: Small conflict at the top of deallocateIDForNode. ] The previous commit fixed a bug that could lead to two nodes sharing the same node ID. This commit adds a check, on node deletion, for such erroneous state. It completes the existing check that a single node doesn't have multiple node IDs. In both cases, an error is thrown such that users will notice and our CI will fail. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> 01 August 2023, 12:04:01 UTC
b8a8eb0 node: Fix leak of node IP <> ID mappings [ upstream commit 4d46fa9c9def58396f252a1cfdefe0a820e233fd ] This commit fixes a leak of node IP to node ID mappings when the set of IP addresses of a node is updated. We have logic to allocate and map node IDs on node creation, logic to unmap and deallocate node IDs on node deletions, but we didn't fully handle the case of node updates. On node updates, we would map new IP addresses to the already-allocated node ID. We forgot to unmap the old IP addresses. This commit fixes that oversight. This leak means that some IP addresses that were previously assigned to the node would remain mapped to the node ID. Those IP addresses could then be assigned to a new node. The new node would then receive the same node ID as the old node (because we first check if any of the node IPs already have an ID). This could lead to encrypted traffic being sent to the wrong node. Before removing the node IDs from the ipcache (see previous commit), this leak could have another consequence. If after the node ID had been reassigned to the new node, the old node was deleted, the node ID would be deallocated, but it would remain in the ipcache. The XFRM policies and states of the node would also be identified based on the node ID and removed. Thus, both the old and new nodes' XFRM configs would be removed. As a consequence, outgoing packets for those nodes would be dropped with XfrmOutPolBlock (the default drop-all rule). Fixes: af88b42bd ("datapath: Introduce node IDs") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> 01 August 2023, 12:04:01 UTC
bc5aa92 datapath, ipcache: Remove unused node ID functions [ upstream commit efbba5916347640007bce74fe20afebedbbbe561 ] [ backporter's notes: Many small conflicts due to some node ID getters being added in v1.14 for the Mutual Auth feature. Since that feature is absent in previous versions, we can also remove the ipcache's NodeIDHandler. Some changes to integration tests that don't exist before v1.14 were also not backported. ] The AllocateNodeID function was used to allocate node IDs from the ipcache logic. Since we don't populate the ipcache with node IDs anymore, it's unused and can be removed. The DeallocateNodeID function was never used (a sibling function was used instead). This commit has no functional changes. Reported-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> 01 August 2023, 12:04:01 UTC
ff096fd ipcache: Stop populating node IDs in ipcache [ upstream commit 4856e3829193351712c1fc4958639bc34bbdf724 ] [ backporter's notes: Many small conflicts due to OnIPIdentityCacheChange's prototype changing in v1.14 and the ipcache map's structs slightly changing with the new loader. ] In previous commits, we removed all consumers (IPsec and Mutual Auth) of the ipcache's node_id field. This commit removes population of those node IDs in the ipcache. **Problem** Needing to populate the ipcache with node IDs causes all sorts of issues. Node IDs are allocated for and mapped with specific node IP addresses. In practice, it means node IDs are tied to the lifecycle of both node and pod objects: 1. We need to allocate node IDs when receiving node objects with new node IP addresses (either new node objects or existing node objects with changing IP addresses). 2. We need to allocate node IDs when receiving a node IP address via a pod (hostIP of the pod). Node objects have all node IP addresses. So why are we not relying solely on 1? We also need 2 because pod adds and pod updates (with new hostIP) can arrive before the related node add or node update. Basically, kube-apiserver doesn't provide any guarantee of ordering between node and pod events. Since we need to populate the ipcache with a node ID for each remote pod, we need to allocate a node ID when we receive the pod object and not later, when we receive the node object. Unfortunately, tying the node IDs to two object lifecycles in this way causes a lot of complications. For example, we can't simply deallocate node IDs on pod deletions because other pods may still be using the node ID. We also can't deallocate simply on node deletions because pods may still be using it. The typical solution to such issues is to maintain a reference count, but that leads to its own set of issues in this case [1]. **Considered Solution** An alternative to all this is to force events to always be managed in the same order. For example, NodeAdd -> PodAdd -> PodDelete -> NodeDelete. The agent unfortunately doesn't easily lend itself to that. Without a major refactoring, we would have to populate the ipcache as usual, but skip node IDs. Then, on NodeAdd, we would allocate a node ID and go through the ipcache again to populate the node IDs we skipped. In the datapath, we would have an explicit drop for such half-populated ipcache entries. To implement this without needing to walk through the entire ipcache for each NodeAdd or NodeUpdate, we need to keep track of the specific entries that need to be populated. We need the same sort of mechanism for deletions. This solution quickly turns into something quite complex to implement with good performance. **This Solution** It should be quite clear at this point that this problem's complexity stems from having node IDs tied to both the pods and nodes' lifecycles. To untie node IDs from pod lifecycles, and thus, stop populating the ipcache with node IDs, we will need the datapath to retrieve them from some other place. Fortunately, we already have a BPF map that maps node IP addresses to node IDs. It is currently only used as a restoration mechanism for agent restarts, but there's no reason we can't use it for more. Thus, with the node IP address retrieved from the ipcache (ep->tunnel_endpoint), we can lookup the node ID in the BPF map. Of course, since pod objects can arrive before their corresponding node object, the node ID may not always be ready in the BPF map when we need it. When that happens, we will drop the packet with a specific drop code (DROP_NO_NODE_ID). These drops simply mean we tried to send traffic to a node for which we didn't receive the node object yet (hence the encryption configuration isn't setup yet). These datapath changes have been implemented in the previous commit. This commit removes all population of node IDs in the ipcache. We can also remove node IDs from the tunnel map. It matters less since the tunnel map is tied to node objects, but let's keep the datapath consistent between the two maps (ipcache and tunnel map). 1 - https://github.com/cilium/cilium/pull/26725#discussion_r1266785189 Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> 01 August 2023, 12:04:01 UTC
fc2317b bpf/ipsec: Stop relying on ipcache node_id field [ upstream commit 6109a38bc7a5f79fc7635a266a24eea8f2afba34 ] [ backporter's notes: Many conflicts over large sections of code (in particular in encap.h). I resolved by reapplying the changes. ] The IPsec feature relies on the node IDs encoded in the packet marks to match packets against encryption rules (XFRM OUT policies and states). It currently retrieves the ID of the remote node from the ipcache. In a subsequent commit, however, we will remove this node ID information from the ipcache (to simplify node ID management in the agent). So IPsec's datapath must retrieve the node ID from a different place. Fortunately, we already have a BPF map containing all node IP to node ID mappings. Thus, in this commit, we use the remote node IP obtained from the ipcache (ep->tunnel_endpoint) to lookup the remote node ID in this BPF map. This one additional hashmap lookup is expected to have a negligible cost compared to the cost of encryption. It should even be small compared to the cost of the ipcache LPM lookup. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> 01 August 2023, 12:04:01 UTC
21f83b6 monitor: New NO_NODE_ID drop type [ upstream commit 11f779542d02ad63b80f66a0c28084f0c46a9881 ] [ backporter's notes: Trivial conflicts due to many additional drop reasons being added between this branch and v1.14. ] This new packet drop type/reason will be used by IPsec and Mutual Auth when a node ID lookup doesn't return anything. Both features rely on the presence of the node ID to function properly. A failure to find the node ID can occur if traffic is sent to a pod on a node for which we didn't receive the NodeAdd or NodeUpdate event yet. Such out-of-order events (PodAdd received before NodeAdd) are possible and normal behavior. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> 01 August 2023, 12:04:01 UTC
a025ef0 node: Allocate node ID even if IPsec is disabled [ upstream commit d2523ff4cea6facb8374a0720fd34b398e2323c4 ] [ backporter's notes: The mocked node ID BPF map doesn't exist in v1.13 and changes to the integration tests (see last paragraph below) were not necessary, so I skipped them. Trivial conflict in the variable declarations of nodeUpdate. ] Before this commit, we would allocate a node ID for new nodes if IPsec was enabled. That wasn't consistent with (1) allocations triggered by pod creations, where we allocate even if IPsec is disabled, and (2) deallocations triggered by node deletions which also happen regardless of IPsec being enabled. This commit changes the nodeUpdate logic slightly such that the node ID is allocated regardless of IPsec being enabled. As a consequence, we need to update several integration tests to mock the node ID BPF map. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> 01 August 2023, 12:04:01 UTC
00ab0f1 ipsec: Refactor IPsec logic to allocate node IDs in one place [ upstream commit f711329f9b12ff34c63bf14f17b8dd10f92904ed ] This commit has no functional changes. It simply moves the allocation of node IDs to a single place in the top calling function. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> 01 August 2023, 12:04:01 UTC
c75f7e8 bpf: Move IPsec prep logic to encrypt.h [ upstream commit 6d848825b63d3fa2bb50f1a5baf1a444ae579c11 ] [ backporter's notes: Trivial conflict on the unusual include of encrypt.h and due to macro guards at the top. ] This commit has no functional changes. It simply moves the logic to prepare packets for IPsec encryption to the dedicated encrypt.h file. Function set_ipsec_encrypt_mark is also created to simplify a subsequent commit (no, it won't remain a function that simply calls another function :)). Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> 01 August 2023, 12:04:01 UTC
f4323e1 bpf/Makefile: Remove IPsec from compile options for bpf_{xdp,sock} [ upstream commit 93f70bb471ca02fd3644876cef904784647ae835 ] [ backporter's notes: All changes conflicted so were reapplied from scratch. ] IPsec isn't used or compatible with bpf_xdp and bpf_sock, so there is no reason to compile test it there. This will simplify subsequent commits where we will introduce some code in IPsec that isn't compatible with non-tc BPF hooks. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> 01 August 2023, 12:04:01 UTC
8fdf05c install: Update image digests for v1.11.19 Generated from https://github.com/cilium/cilium/actions/runs/5672740170. `docker.io/cilium/cilium:v1.11.19@sha256:f71c973a9159158704012e1a065a3d484353ff4c2b4e05e10a03382f055adad4` `quay.io/cilium/cilium:v1.11.19@sha256:f71c973a9159158704012e1a065a3d484353ff4c2b4e05e10a03382f055adad4` `docker.io/cilium/clustermesh-apiserver:v1.11.19@sha256:9346b296322036d2df98bd0ebdc721f4fafd5449030c7fd5dc53b20103758eee` `quay.io/cilium/clustermesh-apiserver:v1.11.19@sha256:9346b296322036d2df98bd0ebdc721f4fafd5449030c7fd5dc53b20103758eee` `docker.io/cilium/docker-plugin:v1.11.19@sha256:dc5eb50a89ef4fc31596f922fb63149f1e2d68a563ae5844cd83b61d7da7c04e` `quay.io/cilium/docker-plugin:v1.11.19@sha256:dc5eb50a89ef4fc31596f922fb63149f1e2d68a563ae5844cd83b61d7da7c04e` `docker.io/cilium/hubble-relay:v1.11.19@sha256:8c1032dfb03359e0576061502196e06eefb8ef12743d602e075e7f97f56667e4` `quay.io/cilium/hubble-relay:v1.11.19@sha256:8c1032dfb03359e0576061502196e06eefb8ef12743d602e075e7f97f56667e4` `docker.io/cilium/operator-alibabacloud:v1.11.19@sha256:9cb60d9362a362b58bb33da6b7a4b73f7882d0bc580af74c91c50d3112a74e2e` `quay.io/cilium/operator-alibabacloud:v1.11.19@sha256:9cb60d9362a362b58bb33da6b7a4b73f7882d0bc580af74c91c50d3112a74e2e` `docker.io/cilium/operator-aws:v1.11.19@sha256:b121c72160abc99112bf155d05f3c09fca266a3ea026143d86da7376654f708b` `quay.io/cilium/operator-aws:v1.11.19@sha256:b121c72160abc99112bf155d05f3c09fca266a3ea026143d86da7376654f708b` `docker.io/cilium/operator-azure:v1.11.19@sha256:13c1030a90f38c483ae5b0696e0597c4129697f3af81e1eeb238d7d5a04e326e` `quay.io/cilium/operator-azure:v1.11.19@sha256:13c1030a90f38c483ae5b0696e0597c4129697f3af81e1eeb238d7d5a04e326e` `docker.io/cilium/operator-generic:v1.11.19@sha256:79b622067205037489dcfc3280a2b9a19b0ede9a1c83eb5b3064926fa6af6a23` `quay.io/cilium/operator-generic:v1.11.19@sha256:79b622067205037489dcfc3280a2b9a19b0ede9a1c83eb5b3064926fa6af6a23` `docker.io/cilium/operator:v1.11.19@sha256:26f479a21f3079eb0da4700b9ffd012dfce9b38d635486998bbe352b8f8df740` `quay.io/cilium/operator:v1.11.19@sha256:26f479a21f3079eb0da4700b9ffd012dfce9b38d635486998bbe352b8f8df740` Signed-off-by: Nate Sweet <nathanjsweet@pm.me> 27 July 2023, 21:07:08 UTC
06915ce Prepare for release v1.11.19 Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com> 26 July 2023, 18:28:54 UTC
fefa16b endpoint: Do not override deny entries with proxy redirects [ upstream commit 8aa89ef7088108fe7c5dfdb482ee57fb4ee02d25 ] Use DenyPreferredInsert instead of directly manipulating policy map state to make sure deny entries are not overridden by new proxy redirect entries. Prior to this fix it was possible for a proxy redirect to be pushed onto the policy map when it should have been overridden by a deny at least in these cases: - L3-only deny with L3/L4 redirect: No redirect should be added as the L3 is denied - L3-only deny with L4-only redirect: L4-only redirect should be added and an L3/L4 deny should also be added, but the L3/L4 deny is only added by deny preferred insert, and is missed when the map is manipulated directly. A new test case verifies this. It is clear that in the latter case the addition of the redirect can not be completely blocked, so we can't fix this by making AllowsL4 more restrictive. But also in the former case it is possible that the deny rule only covers a subset of security identities, while the redirect rule covers some of the same security identities, but also some more that should not be blocked. Hence the correct fix here is to leave AllowsL4 to be L3-independent, and cover these cases with deny preferred insert instead of adding redirect entries to the map directly. This commit also contains a related change that allows a redirect entry to be updated, maybe with a changed proxy port. I've not seen evidence that this is currently fixing a bug, but it feels like a real possibility. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> 26 July 2023, 13:02:50 UTC
6dc3fb3 policy: Export DenyPreferredInsertWithChanges, make revertible [ upstream commit 9f52abbfdb6d5570b91fe4c1809e4ac02bc7cc0f ] Export DenyPreferredInsertWithChanges and make it revertible by taking a map of old values as a new optional argument. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> 26 July 2023, 13:02:50 UTC
6193eb6 envoy: Bump envoy version to v1.24.10 This is for the below CVEs from the upstream. CVEs: https://github.com/envoyproxy/envoy/security/advisories/GHSA-pvgm-7jpg-pw5g https://github.com/envoyproxy/envoy/security/advisories/GHSA-69vr-g55c-v2v4 https://github.com/envoyproxy/envoy/security/advisories/GHSA-mc6h-6j9x-v3gq https://github.com/envoyproxy/envoy/security/advisories/GHSA-7mhv-gr67-hq55 Build: The build is coming from https://github.com/cilium/proxy/actions/runs/5661705068/job/15340176601 Release: https://github.com/envoyproxy/envoy/releases/tag/v1.24.10 Signed-off-by: Tam Mach <tam.mach@cilium.io> 26 July 2023, 03:42:44 UTC
af3facf docs/ipsec: Document RSS limitation [ upstream commit c9983ef8c5c03eac868aa9fe48ce2d9771074255 ] [ Backporter's notes: the changes had to be manually backported to the appropriate files for v1.11, as the docs were restructured in fbc53d084ce34159a3fde3b19e26fc2fbbef9e52 and 69d07f79cb17dd0a543043152a32604bb4226ee3 since then. ] All IPsec traffic between two nodes is always send on a single IPsec flow (defined by outer source and destination IP addresses). As a consequence, RSS on such traffic is ineffective and throughput will be limited to the decryption performance of a single core. Reported-by: Ryan Drew <ryan.drew@isovalent.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com> 26 July 2023, 01:02:34 UTC
d559a71 docs: Specify Helm chart version in "cilium install" commands [ upstream commit 12fc68a11f5773d3292d543266e0f16bd0119a0f ] [ Backporter's notes: the changes had to be manually backported to the appropriate files for v1.11, as the docs were restructured in fbc53d084ce34159a3fde3b19e26fc2fbbef9e52 and 69d07f79cb17dd0a543043152a32604bb4226ee3 since then. ] - For the main branch latest docs, clone the Cilium GitHub repo and use "--chart-directory ./install/kubernetes/cilium" flag. - For stable branches, set "--version" flag to the version in the top-level VERSION file. Fixes: #26931 Signed-off-by: Michi Mutsuzaki <michi@isovalent.com> Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com> 26 July 2023, 01:02:34 UTC
2da39b2 docs/ipsec: Extend troubleshooting section [ upstream commit 3ba76e5781050f3a3f2402f54fb4a6ad34944eb1 ] [ Backporter's notes: the changes had to be manually backported to the appropriate files for v1.11, as the docs were restructured in fbc53d084ce34159a3fde3b19e26fc2fbbef9e52 and 69d07f79cb17dd0a543043152a32604bb4226ee3 since then. ] Recent bugs with IPsec have highlighted a need to document several caveats of IPsec operations. This commit documents those caveats as well as common XFRM errors. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com> 26 July 2023, 01:02:34 UTC
3446ec2 test: Print error messages in case of failure [ upstream commit 67a3ab3533a7f77aa4241c0da6b04f5b31da9af9 ] [ Backporter's notes: the changes had to be manually backported to the appropriate files for v1.11, as they were renamed in ffd7e57b377f982fb57cf574564b7f1debef74a4 since then. (main > v1.11) test/k8s/datapath_configuration.go > test/k8sT/DatapathConfiguration.go ] If we check res.WasSuccessful() instead of res, then ginkgo won't print the error message in case the command wasn't successful. Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com> 26 July 2023, 01:02:34 UTC
0470ee5 test: Avoid downloading conntrack package at runtime [ upstream commit a58cb6a25753a5ddb38b46709bc120c51cd0e56b ] [ Backporter's notes: the changes had to be manually backported to the appropriate files for v1.11, as they were renamed in ffd7e57b377f982fb57cf574564b7f1debef74a4 since then. (main > v1.11) test/k8s/datapath_configuration.go > test/k8sT/DatapathConfiguration.go test/k8s/manifests/log-gatherer.yaml > test/k8sT/manifests/log-gatherer.yaml ] The 'Skip conntrack for pod traffic' test currently downloads the conntrack package at runtime to be able to flush and list Linux's conntrack entries. This sometimes fail because of connectivity issues to the package repositories. Instead, we've now included the conntrack package in the log-gatherer image. We can use those pods to run conntrack commands instead of using the Cilium agent pods. Fixes: 496ce420958 ("iptables: add support for NOTRACK rules for pod-to-pod traffic") Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com> 26 July 2023, 01:02:34 UTC
459f545 docs/ipsec: Clarify limitation on number of nodes [ upstream commit 39a9def6c24ff08fc2e7d66d6284586051a30146 ] The limitation on the number of nodes in the cluster when using IPsec applies to clustermeshes as well and is the total number of nodes. This limitation arises from the use of the node IDs, which are encoded on 16-bits. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> 21 July 2023, 22:13:45 UTC
50a26e4 bpf, daemon: Have bpf_host support both values for skb->cb[4] [ upstream commit 420d7faea339d8d93852da305a818cdebd41730e ] Commits 4c7cce1bf0 ("bpf: Remove IP_POOLS IPsec code") and 19a62da4e ("bpf: Lookup tunnel endpoint for IPsec rewrite") changed the way we pass the tunnel endpoint. We used to pass it via skb->cb[4] and read it in bpf_host to build the encapsulation. We changed that in the above commits to pass the identity via skb->cb[4] instead. Therefore, on upgrades, for a short while, we may end up with bpf_lxc writing the identity into skb->cb[4] (new datapath version) and bpf_host interpreting it as the tunnel endpoint (old version). Reloading bpf_host before bpf_lxc is not enough to fix it because then, for a short while, bpf_lxc would write the tunnel endpoint in skb->cb[4] (old version) and bpf_host would interpret it as the security identity (new version). In addition to reloading bpf_host first, we also need to make sure that it can handle both cases (skb->cb[4] has tunnel endpoint or identity). To distinguish between those two cases and interpret skb->cb[4] correctly, bpf_host will rely on the first byte: in the case of the tunnel endpoint is can't be zero because that would mean the IP address is within the special purpose range 0.0.0.0/8; in the case of the identity, it has to be zero because identities are on 24-bits. This commit implements those changes. Commit ca9c056deb ("daemon: Reload bpf_host first in case of IPsec upgrade") had already made the agent reload bpf_host first for ENI and Azure IPAM modes, so we just need to extend it to all IPAM modes. Note that the above bug on upgrades doesn't cause an immediate packet drop at the sender. Instead, it seems the packet is encrypted twice. The (unverified) assumption here is that the encapsulation is skipped because the tunnel endpoint IP address is invalid (being a security identity on 24-bits). The encrypted packet is then sent again to cilium_host where the encryption bit is reapplied (given the destination IP address is a CiliumInternalIP). And it goes through the XFRM encryption again. On the receiver's side, the packet is decrypted once as expected. It is then recirculated to bpf_overlay which removes the packet mark. Given it is still an ESP (encrypted) packet, it goes back through the XFRM decryption layer. But since the packet mark is now zero, it fails to match any XFRM IN state. The packet is dropped with XfrmInNoStates. This can be seen in the following trace: <- overlay encrypted flow 0x6fc46fc4 , identity unknown->unknown state new ifindex cilium_vxlan orig-ip 0.0.0.0: 10.0.9.91 -> 10.0.8.32 -> stack encrypted flow 0x6fc46fc4 , identity 134400->44 state new ifindex cilium_vxlan orig-ip 0.0.0.0: 10.0.9.91 -> 10.0.8.32 <- overlay encrypted flow 0x6fc46fc4 , identity unknown->unknown state new ifindex cilium_vxlan orig-ip 0.0.0.0: 10.0.9.91 -> 10.0.8.32 -> host from flow 0x6fc46fc4 , identity unknown->43 state unknown ifindex cilium_vxlan orig-ip 0.0.0.0: 10.0.9.91 -> 10.0.8.32 -> stack flow 0x6fc46fc4 , identity unknown->unknown state unknown ifindex cilium_host orig-ip 0.0.0.0: 10.0.9.91 -> 10.0.8.32 The packet comes from the overlay encrypted, is sent to the stack to be decrypted, and comes back still encrypted. Fixes: 4c7cce1bf0 ("bpf: Remove IP_POOLS IPsec code") Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> 21 July 2023, 22:13:45 UTC
fba7b12 ipsec: Remove workarounds for path asymmetry that was removed [ upstream commit 0a8f2c4ee43e55d8f82e67bba2af186435ff4a29 ] Commits 3b3e8d0b1 ("node: Don't encrypt traffic to CiliumInternalIP") and 5fe2b2d6d ("bpf: Don't encrypt on path hostns -> remote pod") removed a path asymmetry on the paths hostns <> remote pod. They however failed to remove workarounds that we have for this path asymmetry. In particular, we would encrypt packets on the path pod -> remote node (set SPI in the node manager) to try and avoid the path asymmetry, by also encrypting the replies. And, as a result of this first change, we would also need to handle a corner case in the datapath to appluy the correct reverse DNAT for reply traffic. All of that is unnecessary now that we fixed the path asymmetry. Fixes: 3b3e8d0b1 ("node: Don't encrypt traffic to CiliumInternalIP") Fixes: 5fe2b2d6d ("bpf: Don't encrypt on path hostns -> remote pod") Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> 21 July 2023, 22:13:45 UTC
cf2563d ipsec: Don't match on source IP for XFRM OUT policies [ upstream commit ebd02f1f62c945dc1aabf9694cdfcc51d91df86b ] On IPAM modes with one pod CIDR per node, the XFRM OUT policies look like below: src 10.0.1.0/24 dst 10.0.0.0/24 dir out priority 0 ptype main mark 0x66d11e00/0xffffff00 tmpl src 10.0.1.13 dst 10.0.0.214 proto esp spi 0x00000001 reqid 1 mode tunnel When sending traffic from the hostns, however, it may not match the source CIDR above. Traffic from the hostns may indeed leave the node with the NodeInternalIP as the source IP (vs. CiliumInternalIP which would match). In such cases, we don't match the XFRM OUT policy and fall back to the catch-all default-drop rule, ending up with XfrmOutPolBlock packet drops. Why wasn't this an issue before? It was. Traffic would simply go in plain-text (which is okay given we never intended to encrypt hostns traffic in the first place). What changes is that we now have a catch-all default-drop XFRM OUT policy to avoid leaking plain-text traffic. So it now results in XfrmOutPolBlock errors. In commit 5fe2b2d6da ("bpf: Don't encrypt on path hostns -> remote pod") we removed encryption for the path hostns -> remote pod. Unfortunately, that doesn't mean the issue is completely gone. On a new Cilium install, we won't see this issue of XfrmOutPolBlock drops for hostns traffic anymore. But on existing clusters, we will still see those drops during the upgrade, after the default-drop rule is installed but before hostns traffic encryption is removed. None of this is an issue on AKS and ENI IPAM modes because there, the XFRM OUT policies look like: src 0.0.0.0/0 dst 10.0.0.0/16 dir out priority 0 ptype main mark 0x66d11e00/0xffffff00 tmpl src 10.0.1.13 dst 10.0.0.214 proto esp spi 0x00000001 reqid 1 mode tunnel Thus, hostns -> remote pod traffic is matched regardless of the source IP being selected and packets are not dropped by the default-drop rule. We can therefore avoid the upgrade drops by changing the XFRM OUT policies to never match on the source IPs, as on AKS and ENI IPAM modes. Fixes: 7d44f3750 ("ipsec: Catch-default default drop policy for encryption") Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> 21 July 2023, 22:13:45 UTC
748644d node: Don't encrypt traffic to CiliumInternalIP [ upstream commit 3b3e8d0b1194ebc1d8c8f0f1525045b521e1c7f9 ] For the similar reasons as in the previous commit, we don't want to encrypt traffic going from a pod to the CiliumInternalIP. This is currently the only node IP address type that is associated an encryption key. Since we don't encrypt traffic from the hostns to remote pods anymore (see previous commit), encrypting traffic going to a CiliumInternalIP (remote node) would result in a path asymmetry: traffic going to the CiliumInternalIP would be encrypted, whereas reply traffic coming from the CiliumInternalIP wouldn't. This commit removes that caseand therefore ensures we never encrypt traffic going to a node IP address. Reported-by: Gray Lian <gray.liang@isovalent.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> 21 July 2023, 22:13:45 UTC
0551b5d bpf: Don't encrypt on path hostns -> remote pod [ upstream commit 5fe2b2d6da76d6b1d334f3ce9c0c59371b239892 ] In pod-to-pod encryption with IPsec and tunneling, Cilium currently encrypts traffic on the path hostns -> remote pod even though traffic is in plain-text on the path remote pod -> hostns. When using native routing, neither of those paths is encrypted because traffic from the hostns doesn't go through the bpf_host BPF program. Cilium's Transparent Encryption with IPsec aims at encrypting pod-to-pod traffic. It is therefore unclear why we are encrypting traffic from the hostns. The simple fact that only one direction of the connection is encrypted begs the question of its usefulness. It's possible that this traffic was encrypted by mistake: some of this logic is necessary for node-to-node encryption with IPsec (not supported anymore) and pod-to-pod encryption may have been somewhat simplified to encrypt *-to-pod traffic. Encrypting traffic from the hostns nevertheless creates several issues. First, this situation creates a path asymmetry between the forward and reply paths of hostns<>remote pod connections. Path asymmetry issues are well known to be a source of bugs, from of '--ctstate INVALID -j DROP' iptables rules to NAT issues. Second, Gray recently uncovered a separate bug which, when combined with this encryption from hostns, can prevent Cilium from starting. That separate bug is still being investigated but it seems to cause the reload of bpf_host to depend on Cilium connecting to etcd in a clustermesh context. If this etcd is a remote pod, Cilium connects to it on hostns -> remote pod path. The bpf_host program being unloaded[1], it fails. We end up in a cyclic dependency: bpf_host requires connectivity to etcd, connectivity to etcd requires bpf_host. This commit therefore removes encryption with IPsec for the path hostns -> remote pod when using tunneling (already unencrypted when using native routing). 1 - More specifically, in Gray's case, the bpf_host program is already loaded, but it needs to be reloaded because the IPsec XFRM config changed. Without this reload, encryption fails. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> 21 July 2023, 22:13:45 UTC
6892a24 bpf: Remove IPsec dead code in bpf_host [ upstream commit fd6fa25103d5170f294edd283393fe222c5fef8b ] TL;DR. this commit removes a bit of dead code that seems to have been intended for IPsec in native routing mode but is never actually executed. These code paths are only executed if going through cilium_host and coming from the host (see !from_host check above). For remote destinations, we only go through cilium_host if the destination is part of a remote pod CIDR and we are running in tunneling mode. In native routing mode, we go straight to the native device. Example routing table for tunneling (10.0.0.0/24 is the remote pod CIDR): 10.0.0.0/24 via 10.0.1.61 dev cilium_host src 10.0.1.61 mtu 1373 <- we follow this 10.0.1.0/24 via 10.0.1.61 dev cilium_host src 10.0.1.61 10.0.1.61 dev cilium_host scope link 192.168.56.0/24 dev enp0s8 proto kernel scope link src 192.168.56.11** Example routing table for native routing: 10.0.0.0/24 via 192.168.56.12 dev enp0s8 <- we follow this 10.0.1.0/24 via 10.0.1.61 dev cilium_host src 10.0.1.61 10.0.1.61 dev cilium_host scope link 192.168.56.0/24 dev enp0s8 proto kernel scope link src 192.168.56.11 Thus, this code path is only used for tunneling with IPsec. However, IPsec in tunneling mode should already be handled by the encap_and_redirect_with_nodeid call above in the same functions (see info->key argument). So why was this added? It was added in commit b76e6eb59 ("cilium: ipsec, support direct routing modes") to support "direct routing modes". I found that very suspicious because, per the above, in native routing mode, traffic from the hostns shouldn't even go through cilium_host. I thus tested it out. I've checked IPsec with native routing mode, with and without endpoint routes. I can confirm that, in all those cases, traffic from the hostns is not encrypted when going to a remote pod. Therefore, this code is dead. I'm unsure when it died. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> 21 July 2023, 22:13:45 UTC
303134b ci: increase ginkgo timeout Increase ginkgo kernel test timeout from 170m to 200m to avoid unnecessary timeouts during test execution. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com> 20 July 2023, 17:58:56 UTC
d4a6ded docs: Pick up PyYAML 6.0.1 [ upstream commit e06e70e26fdde5205429b71fdc5263b0d8905adc ] Revert commit 04d48fe3, and pick up PyYAML 6.0.1. Fixes: #26873 Signed-off-by: Michi Mutsuzaki <michi@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> 19 July 2023, 14:13:57 UTC
180e564 Fix "make -C Documentation builder-image" [ upstream commit 04d48fe3706a83d5612da1195fac78dc69c1a7b4 ] Use this workaround until the issue gets fixed: https://github.com/yaml/pyyaml/issues/601#issuecomment-1638509577 Signed-off-by: Michi Mutsuzaki <michi@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> 19 July 2023, 14:13:57 UTC
a3d58a4 client, health/client: set dummy host header on unix:// local communication [ upstream commit b9ec2aaece578278733e473a72bb5594f621d495 ] Go 1.20.6 added a security fix [1] which leads to stricter sanitization of the HTTP host header in the net/http client. Cilium's pkg/client currently sets the Host header to the UDS path (e.g. /var/run/cilium/cilium.sock), however the slashes in that Host header now lead net/http to reject it. RFC 7230, Section 5.4 states [2]: > If the authority component is missing or undefined for the target URI, > then a client MUST send a Host header field with an empty field-value. The authority component is undefined for the unix:// scheme. Thus, the correct value to use would be the empty string. However, this does not work due to OpenAPI runtime using the same value for the URL's host and the http client's host header. Thus, use a dummy value "localhost". [1] https://go.dev/issue/60374 [2] https://datatracker.ietf.org/doc/html/rfc7230#section-5.4 Signed-off-by: Tobias Klauser <tobias@cilium.io> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> 19 July 2023, 14:13:57 UTC
4fa7df8 envoy: Bump envoy to v1.24.9 This is to include the fix for below CVE. CVE: https://github.com/envoyproxy/envoy/security/advisories/GHSA-jfxv-29pc-x22r GHA build: https://github.com/cilium/proxy/actions/runs/5544741749/jobs/10122649239 Signed-off-by: Tam Mach <tam.mach@cilium.io> 14 July 2023, 05:45:28 UTC
e4b0551 ariane: don't skip verifier and l4lb tests on vendor/ changes [ upstream commit 1f35bafb3d1f754a20374d177a65ed8076ee9486 ] Both of these workflows use binaries that are built in CI making use of various vendored dependencies, so run them as well on PRs only changing vendor/. backporting conflicts: * tests-datapath-verifier.yaml doesn't exist in the v1.11 branch Signed-off-by: Tobias Klauser <tobias@cilium.io> Signed-off-by: Gilberto Bertin <jibi@cilium.io> 13 July 2023, 09:23:29 UTC
57302bf chore(deps): update hubble cli to v0.12.0 Signed-off-by: renovate[bot] <bot@renovateapp.com> 13 July 2023, 04:00:58 UTC
5ce15f8 test/provision/compile.sh: Make usable from dev VM [ upstream commit 0112ddbb6960e0cbf6153e2fa3c229a32f358af8 ] Add missing 'sudo' commands so that this can be run from a shell in a dev VM to launch a local cilium agent in docker. Only install the bpf mount unit to systemd if not already mounted. This avoids error message like this: Unit sys-fs-bpf.mount has a bad unit file setting With these changes Cilium agent can be compiled and launced in docker, assuming the VM hostname does NOT include "k8s", like so: $ SKIP_TEST_IMAGE_DOWNLOAD=1 VMUSER=${USER} PROVISIONSRC=test/provision test/provision/compile.sh After this 'docker ps' should show a "cilium" container. This can be used, for example to quickly run Cilium agent locally to observer agent startup and exit logs via 'docker logs cilium -f' when stopping cilium with 'docker stop cilium'. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: Gilberto Bertin <jibi@cilium.io> 11 July 2023, 14:06:20 UTC
1ea0df0 test: Fix ACK and FIN+ACK policy drops in hostfw tests [ upstream commit 439a0a059fdcabe23a33b427b637494bc5a59eda ] First see the code comments for the full explanation. This issue with the faulty conntrack entries when enforcing host policies is suspected to cause the flakes that have been polluting host firewall tests. We've seen this faulty conntrack issue happen mostly to health and kube-apiserver connections. And it turns out that the host firewall flakes look like they are caused by connectivity blips on kube-apiserver's side, which error messages such as: error: unable to upgrade connection: Authorization error (user=kube-apiserver-kubelet-client, verb=create, resource=nodes, subresource=proxy) This commit therefore tries to workaround the issue of faulty conntrack entries in host firewall tests. If the flakes are indeed caused by those faulty entries, we shouldn't see them happen anymore. Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com> Signed-off-by: Gilberto Bertin <jibi@cilium.io> 11 July 2023, 14:06:20 UTC
3553673 chore(deps): update all github action dependencies Signed-off-by: renovate[bot] <bot@renovateapp.com> 07 July 2023, 10:40:27 UTC
d8a2814 images: update cilium-{runtime,builder} Signed-off-by: Cilium Imagebot <noreply@cilium.io> 07 July 2023, 09:43:42 UTC
065e7aa chore(deps): update docker.io/library/ubuntu:20.04 docker digest to c9820a4 Signed-off-by: renovate[bot] <bot@renovateapp.com> 07 July 2023, 09:43:42 UTC
5945b86 chore(deps): update all github action dependencies Signed-off-by: renovate[bot] <bot@renovateapp.com> 07 July 2023, 09:34:16 UTC
7679b69 chore(deps): update actions/setup-go action to v4 Signed-off-by: renovate[bot] <bot@renovateapp.com> 07 July 2023, 09:08:29 UTC
1963e2d chore(deps): update docker.io/library/alpine docker tag to v3.16.6 Signed-off-by: renovate[bot] <bot@renovateapp.com> 05 July 2023, 14:29:15 UTC
a706f92 chore(deps): update docker.io/library/alpine docker tag to v3.16.6 Signed-off-by: renovate[bot] <bot@renovateapp.com> 05 July 2023, 14:27:09 UTC
fe69ac2 ci: rework workflows to be triggered by Ariane on 1.11 [ upstream commit 9949c5a1891aff8982bfc19e7fc195e7ecc2abf1 ] This is a custom backport, please see upstream commit for full details. In this commit, we move stable workflows from the `main` branch back into the 1.11 stable branch now that workflows are triggered via `workflow_dispatch` events in the appropriate context. Since these new workflows were previously living in `main`, we also need to backport dependencies on `.github/actions` configuration files. We take the opportunity to adjust the configuration files as appropriate, notably in terms of K8s version coverage, to ensure that we only test K8s versions officially supported by the stable branch. In particular for 1.11, the AKS workflow was NOT backported because 1.11 only supports K8s versions up to 1.23, and 1.23 is not available on AKS anymore. The stable 1.11 AKS workflow from `main` was already disabled when this changed happened, as per our testing policy, so we are not backporting it. Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com> 05 July 2023, 12:10:08 UTC
c32bff4 ci: add Ariane configuration file for 1.11 [ upstream commit 4a9ee81c6b6bdb5b63e61287a93ab67a77255c4c ] This is a custom backport, please see upstream commit for full details. Ariane is a new GitHub App intended to trigger Cilium CI workflows based on trigger phrases commented in pull requests, in order to replace the existing `issue_comment`-based workflows and simplify our CI stack. This commit adds a configuration setting up triggers such that existing 1.11 workflows can be triggered with the usual `/test-backport-1.11`, and based on the same PR changelist match / ignore rules. In particular for 1.11, the AKS workflow was NOT backported because 1.11 only supports K8s versions up to 1.23, and 1.23 is not available on AKS anymore. The stable 1.11 AKS workflow from `main` was already disabled when this changed happened, as per our testing policy, so we are not backporting it. Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com> 05 July 2023, 12:10:08 UTC
7cf6b03 install: Don't install CNI binaries if cni.install=false [ upstream commit 390b4dc0d9ef63ae30f435e3ea2926069ef5a78b ] Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com> 29 June 2023, 11:24:54 UTC
ec12aef ipsec: Split removeStaleXFRMOnce to fix deprioritization issue [ upstream commit f4f3656b32492d31abc533d12692e6fe9b4d32f9 ] We expect deprioritizeOldOutPolicy() to be executed for IPv4 and IPv6, but removeStaleXFRMOnce prevents the second call. If both IPv4 and IPv6 are enabled, v6 xfrm policy won't be deprioritized due to this issue. This commit fixes it by spliting removeStaleXFRMOnce into removeStaleIPv4XFRMOnce and removeStaleIPv6XFRMOnce. Fixes: https://github.com/cilium/cilium/commit/688dc9ac802b11f6c16a9cbc5d60baaf77bd6ed0 Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com> Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com> 29 June 2023, 11:24:54 UTC
63fa3d1 cli: Print NodeID in hex [ upstream commit e956bb1a29e131e730d74572a97152d547101143 ] [ backporter's notes: conflicts due to string format being different in v1.11, applied changes based on v1.11 format. ] The Node ID is used in SKB mark used by XFRM policies. The latter print it in hex. So, let's reduce a mental strain by a bit when debugging IPsec issues. Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com> 29 June 2023, 11:24:54 UTC
43c277a bugtool: Add cilium bpf nodeid list [ upstream commit 18f85a014282dec7ddf2c2bf39d54564670352ec ] To help to detect when IPcache is out of sync with locally stored Node IDs. Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com> 29 June 2023, 11:24:54 UTC
c9c2262 docs: clarify that L3 DNS policies require L7 proxy enabled [ upstream commit e0931df324592358d4645a9f9a31ca87aeddaf70 ] [ backporter's notes: conflicts due to docs structure change, manually applied changes to the corresponding file pre-structure change. ] Add a note to the L3 policy documentation clarifying that L3 DNS policies require the L7 proxy enabled and an L7 policy for DNS traffic so Cilium can intercept DNS responses. Previously, the documentation linked to other sections describing the DNS Proxy, but I know at least a few people who were surprised that a policy under "L3 Examples" would require an L7 proxy. Hopefully adding a note near the beginning of the section will make this requirement more obvious. Signed-off-by: Will Daly <widaly@microsoft.com> Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com> 29 June 2023, 11:24:54 UTC
5650804 docs: reword incorrect L7 policy description [ upstream commit 68bff35b533fcd4224236a4bef27b5e711c87c69 ] [ backporter's notes: conflicts due to docs structure change, manually applied changes to the corresponding file pre-structure change. ] Fixing incorrect description of the GET /public policy in the L7 section. Signed-off-by: Peter Jausovec <peter.jausovec@solo.io> Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com> 29 June 2023, 11:24:54 UTC
576580e docker: Detect default "desktop-linux" builder [ upstream commit 13f146eb117f54a20199bfecc1ab226eb9df6bfb ] New Docker desktop may have a default builder with name "desktop-linux" that is not buildx capable. Detect that name as well as the old "default" for the need to create a new buildx builder. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com> 29 June 2023, 11:24:54 UTC
ce13536 proxy: Increment non-DNS proxy ports on failure [ upstream commit 894aa4e062abda757e7cc952d4663206aa75b08d ] [ backporter's notes: conflicts due to ProxyType not existing on v1.11, used parserType as the v1.11 equivalent. ] Increment non-DNS proxy ports on failure even if DNS has been configured with a static port. Fixes: #20896 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com> 29 June 2023, 11:24:54 UTC
204a802 proxy: Only update redirects with configured proxy ports [ upstream commit ca6199827b9a68fd78227cc31afa712a7e7b51f1 ] [ backporter's notes: conflicts due to ProxyType not existing on v1.11, used parserType as the v1.11 equivalent. ] Only update an existing redirect if it is configured. This prevents Cilium agent panic when trying to update redirect with released proxy port. This has only been observed to happen with explicit Envoy listener redirects in CiliumNetworkPolicy when the listener has been removed. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com> 29 June 2023, 11:24:54 UTC
8d92f63 proxy: Do not panic on local error [ upstream commit 525007f69a87282cb4056820c722a1593402bf0d ] [ backporter's notes: conflicts due to proxy_test.go not existing on v1.11, these changes were skipped. ] CreateOrUpdateRedirect called nil revertFunc when any local error was returned. This was done using the pattern `return 0, err, nil, nil` which sets the revertFunc return variable as nil, but this was called on a deferred function to revert any changes on a local error. Fix this by calling ReverStack.Revert() directly on the deferred function, and setting the return variable if there was no local error. This was hit any time a CiliumNetworkPolicy referred to a non-existing listener. Add a test case that reproduced the panic and works after the fix. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com> 29 June 2023, 11:24:54 UTC
ed26f99 v1.11 docs: Use stable-v0.14.txt for cilium-cli version The next cilium-cli release is v0.15.0 with Helm mode as the default installation mode. Continue to use v0.14 cilium-cli for v1.11 docs since we haven't validated v1.11 docs using Helm mode. Also change the branch name from master to main. The default branch name recently changed from master to main in cilium-cli repo. Ref: https://github.com/cilium/cilium-cli/pull/1759 Ref: #26430 Signed-off-by: Michi Mutsuzaki <michi@isovalent.com> 23 June 2023, 22:24:45 UTC
a88d414 envoy: Bump minor version to v1.24.x This commit is to bump envoy version to v1.24.8, as envoy v1.23 will be EOL next month as per [release schedule](https://github.com/envoyproxy/envoy/blob/main/RELEASES.md#major-release-schedule) The image is coming from below run https://github.com/cilium/proxy/actions/runs/5291782230/jobs/9585253849 Signed-off-by: Tam Mach <tam.mach@cilium.io> 19 June 2023, 22:31:56 UTC
9af9345 envoy: Bump envoy version to v1.23.10 This is for latest patch release from upstream https://github.com/envoyproxy/envoy/releases/tag/v1.23.10 https://www.envoyproxy.io/docs/envoy/latest/version_history/v1.23/v1.23.10 Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com> 15 June 2023, 16:12:00 UTC
a336dda images: introduce update script update-cilium-envoy-image This commit introduces the script `update-cilium-envoy-image.sh` (and corresponding make target) which fetches the latest cilium-envoy image by fetching the relevant data from its github repo. It updates the cilium Dockerfile. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com> 15 June 2023, 16:12:00 UTC
ba9d077 install: Update image digests for v1.11.18 Generated from https://github.com/cilium/cilium/actions/runs/5279434084. ## Docker Manifests ### cilium `docker.io/cilium/cilium:v1.11.18@sha256:dda94072012c328fe0d00838f2f7d8ead071019d1d1950ecf44060640bf93cae` `quay.io/cilium/cilium:v1.11.18@sha256:dda94072012c328fe0d00838f2f7d8ead071019d1d1950ecf44060640bf93cae` ### clustermesh-apiserver `docker.io/cilium/clustermesh-apiserver:v1.11.18@sha256:b3e8de4e56c5e16ab8f4482cebf3a12bb12826ba3da3e5890de1ecdc2b34a3ed` `quay.io/cilium/clustermesh-apiserver:v1.11.18@sha256:b3e8de4e56c5e16ab8f4482cebf3a12bb12826ba3da3e5890de1ecdc2b34a3ed` ### docker-plugin `docker.io/cilium/docker-plugin:v1.11.18@sha256:b086fc1ec24b9b2b0bc5f7f525ef76ff608c26dc1bdd76d46729871cbbfb4b08` `quay.io/cilium/docker-plugin:v1.11.18@sha256:b086fc1ec24b9b2b0bc5f7f525ef76ff608c26dc1bdd76d46729871cbbfb4b08` ### hubble-relay `docker.io/cilium/hubble-relay:v1.11.18@sha256:4899d8a98c05ccb7bb3d0b54e18dc72147995b2e8a18db19805d15933ec6e45d` `quay.io/cilium/hubble-relay:v1.11.18@sha256:4899d8a98c05ccb7bb3d0b54e18dc72147995b2e8a18db19805d15933ec6e45d` ### operator-alibabacloud `docker.io/cilium/operator-alibabacloud:v1.11.18@sha256:590062c3797c0d0732d848b8fa09cd5aaf5ce2cbbbc5f5fc860bde79d27c743c` `quay.io/cilium/operator-alibabacloud:v1.11.18@sha256:590062c3797c0d0732d848b8fa09cd5aaf5ce2cbbbc5f5fc860bde79d27c743c` ### operator-aws `docker.io/cilium/operator-aws:v1.11.18@sha256:4b3aeeb5d0de096d68ab249845c4c53c7c595735d529a13a81540597a6b29bb5` `quay.io/cilium/operator-aws:v1.11.18@sha256:4b3aeeb5d0de096d68ab249845c4c53c7c595735d529a13a81540597a6b29bb5` ### operator-azure `docker.io/cilium/operator-azure:v1.11.18@sha256:c833cd215dafcb9a73dc1d435d984038fc46ebd9a0b3d50ceeb8f8c4c7e9ac3d` `quay.io/cilium/operator-azure:v1.11.18@sha256:c833cd215dafcb9a73dc1d435d984038fc46ebd9a0b3d50ceeb8f8c4c7e9ac3d` ### operator-generic `docker.io/cilium/operator-generic:v1.11.18@sha256:bccdcc3036b38581fd44bf7154255956a58d7d13006aae44f419378911dec986` `quay.io/cilium/operator-generic:v1.11.18@sha256:bccdcc3036b38581fd44bf7154255956a58d7d13006aae44f419378911dec986` ### operator `docker.io/cilium/operator:v1.11.18@sha256:0c09e5188d5d8899e7b037fafcc1928a68872f1e48e5f7a128799594c99f8282` `quay.io/cilium/operator:v1.11.18@sha256:0c09e5188d5d8899e7b037fafcc1928a68872f1e48e5f7a128799594c99f8282` Signed-off-by: Quentin Monnet <quentin@isovalent.com> 15 June 2023, 14:55:26 UTC
f5d7e2d Prepare for release v1.11.18 Signed-off-by: Michi Mutsuzaki <michi@isovalent.com> 14 June 2023, 21:00:38 UTC
9d72663 docs: Promote Deny Policies out of Beta Signed-off-by: Nate Sweet <nathanjsweet@pm.me> 13 June 2023, 22:55:09 UTC
1e8d21b docs: fix wording for the upgrade guide Rephrase a recent change to Documentation/operations/upgrade.rst. Signed-off-by: Anton Protopopov <aspsk@isovalent.com> 13 June 2023, 19:22:22 UTC
c3ffd99 ipsec: Don't rely on output-marks to know if state exists On kernels before 4.19, the XFRM output mark is not fully supported. Thus, when comparing XFRM states, if we compare the output marks, the existing states will never match the new state. The new state will have an output mark, but the states installed in the kernel don't have it (because the kernel ignored it). As a result, our current logic will assume that the state we want to install doesn't already exist, it will try to install it, fail because it already exist, assume there's a conflicting state, throw an error, remove the conflicting state, and install the new (but identical) one. The end result is therefore the same: the new state is in place in the kernel. But on the way to installing it, we will emit an unnecessary error and temporarily remove the state (potentially causing packet drops). Instead, we can safely ignore the output-marks when comparing states. We don't expect any states with same IPs, SPI, and marks, but different output-marks anyway. The only way this could happen is if someone manually added such a state. Even if they did, the only impact would be that we wouldn't overwrite the manually-added state with the different output-mark. This patch is only necessary on v1.12 and earlier versions of Cilium because v1.13 dropped support for Linux <4.19. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> 13 June 2023, 19:22:04 UTC
2c0a06d ipsec: Don't attempt per-node route deletion when unexistant [ upstream commit 1e1e2f7e410d24e4af2d6dbd2cb2ceb016fb76b7 ] Commit 3e59b681f ("ipsec: Per-node XFRM states & policies for EKS & AKS") changed the XFRM config to have one state and policy per remote node in IPAM modes ENI and Azure. The IPsec cleanup logic was therefore also updated to call deleteIPsec() whenever a remote node is deleted. However, we missed that the cleanup logic also tries to remove the per-node IP route. In case of IPAM modes ENI and Azure, the IP route however stays as before: we have a single route for all remote nodes. We therefore don't have anything to cleanup. Because of this unnecessary IP route cleanup attempt, an error message was printed for every remote node deletion: Unable to delete the IPsec route OUT from the host routing table This commit fixes it to avoid attempting this unnecessary cleanup. Fixes: 3e59b681f ("ipsec: Per-node XFRM states & policies for EKS & AKS") Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Quentin Monnet <quentin@isovalent.com> 13 June 2023, 19:22:04 UTC
60f5a1b ipsec: Only match appropriate XFRM configs with node ID [ upstream commit 57eac9d8b42a19f5aeae412f38de3eaf8bfadc4a ] With commit 9cc8a89f9 ("ipsec: Fix leak of XFRM policies with ENI and Azure IPAMs") we rely on the node ID to find XFRM states and policies that belong to remote nodes, to clean them up when remote nodes are deleted. This commit makes sure that we only do this for XFRM states and policies that actually match on these node IDs. That should only be the same if the mark mask matches on node ID bits. Thus if should look like 0xffffff00 (matches on node ID, SPI, and encryption bit) or 0xffff0f00 (matches on node and encryption bit). Fixes: 9cc8a89f9 ("ipsec: Fix leak of XFRM policies with ENI and Azure IPAMs") Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> 13 June 2023, 19:22:04 UTC
319fa31 ipsec: Only delete ipsec endpoint when node ID is not 0 [ upstream commit 25064d1ec51895ab89e2f736fcf7c6c66dfb5551 ] After applying a backport of 9cc8a89f9 ("ipsec: Fix leak of XFRM policies with ENI and Azure IPAMs") to 1.11.16, I noticed that we were getting occasional spikes of "no inbound state" xfrm errors (XfrmInNoStates). These lead to packet loss and brief outages for applications sending traffic to the node on which the spikes occur. I noticed that the "No node ID found for node." logline would appear at the time of these spikes and from the code this is logged when the node ID cannot be resolved. Looking a bit further the call to `DeleteIPsecEndpoint` will end up deleting the xfrm state for any state that matches the node id as derived from the mark in the state. The problem seems to be that the inbound state for 0.0.0.0/0 -> node IP has a mark of `0xd00` which when shifted >> 16 in `getNodeIDFromXfrmMark` matches nodeID 0 and so the inbound state gets deleted and the kernel drops all the inbound traffic as it no longer matches a state. This commit updates that logic to skip the XFRM state and policy deletion when the node ID is zero. Fixes: 9cc8a89f9 ("ipsec: Fix leak of XFRM policies with ENI and Azure IPAMs") Signed-off-by: Steven Johnson <sjdot@protonmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> 13 June 2023, 19:22:04 UTC
00bb13b ipsec: Fix IPv6 wildcard CIDR used in some IPsec policies [ upstream commit d0ab559441311dbe0908834a86d633aa9eeb6a84 ] We use this wildcard IPv6 CIDR in the catch-all default-drop OUT policy as well as in the FWD policy. It was incorrectly set to ::/128 instead of ::/0 and would therefore not match anything instead of matching everything. This commit fixes it. Fixes: e802c2985 ("ipsec: Refactor wildcard IP variables") Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> 13 June 2023, 19:22:04 UTC
3a5aa36 ipsec: Change XFRM FWD policy to simplest wildcard [ upstream commit ac54f2965908c06ff53e5a63a0f47b2448204a18 ] We recently changed our XFRM configuration to have one XFRM OUT policy per remote node, regardless of the IPAM mode being used. In doing so, we also moved the XFRM FWD policy to be installed once per remote node. With ENI and Azure IPAM modes, this wouldn't cause any issue because the XFRM FWD policy is the same regardless of the remote node. On other IPAM modes, however, the XFRM FWD policy is for some reason different depending on the remote node that triggered the installation. As a result, for those IPAM modes, one FWD policy is installed per remote node. And the deletion logic triggered on node deletions wasn't updated to take that into account. We thus have a leak of XFRM FWD policies. In the end, our FWD policy just needs to allow everything through without encrypting it. It doesn't need to be specific to any remote node. We can simply completely wildcard the match, to look like: src 0.0.0.0/0 dst 0.0.0.0/0 dir fwd priority 2975 ptype main tmpl src 0.0.0.0 dst 192.168.134.181 proto esp reqid 1 mode tunnel level use So we match all packets regardless of source and destination IPs. We don't match on the packet mark. There's a small implementation hurdle here. Because we used to install FWD policies of the form "src 0.0.0.0/0 dst 10.0.1.0/24", the kernel was able to deduce which IP family we are matching against and would adapt the 0.0.0.0/0 source CIDR to ::/0 as needed. Now that we are matching on 0/0 for both CIDRs, it cannot deduce this anymore. So instead, we must detect the IP family ourself and use the proper CIDRs. In addition to changing the XFRM FWD policy to the above, we can also stop installing it once per remote node. It's enough to install it when we receive the event for the local node, once. Fixes: 3e59b681f ("ipsec: Per-node XFRM states & policies for EKS & AKS") Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> 13 June 2023, 19:22:04 UTC
10a2851 loader: In IPsec reload ignore veth devices & fix settle wait [ upstream commit 592777da560bea7838b99223386e943c08d5d052 ] reloadIPSecOnLinkChanges() did not ignore veth device updates causing reload to be triggered when new endpoints were created. Ignore any updates with "veth" as device type. The draining of updates during settle wait was broken due to unintentional breaking out of the loop. Removed the break. Fixes: bf0940b4ff ("loader: Reinitialize IPsec on device changes on ENI") Signed-off-by: Jussi Maki <jussi@isovalent.com> 13 June 2023, 19:22:04 UTC
f421409 loader: Do not fatal on IPsec reinitialization [ upstream commit 470465550bc446b920a62c5b7f7b521cd10b0a9b ] Now that the code is reloading the bpf_network program at runtime we should not fatal if we fail to reload the program since this may be caused by ongoing interface changes (e.g. interface was being removed). Change the log.Fatal into log.Error and keep loading to other interfaces. Fixes: bf0940b4ff ("loader: Reinitialize IPsec on device changes on ENI") Signed-off-by: Jussi Maki <jussi@isovalent.com> 13 June 2023, 19:22:04 UTC
fdba480 ipsec: Allow old and new XFRM OUT states to coexist for upgrade [ upstream commit c0d9b8c9e791b8419c63e5e80b52bc2b39f80030 ] Commit 73c36d45e0 ("ipsec: Match OUT XFRM states & policies using node IDs") changed our XFRM states to match on packet marks of the form 0xXXXXYe00/0xffffff00 where XXXX is the node ID and Y is the SPI. The previous format for the packet mark in XFRM states was 0xYe00/0xff00. According to the Linux kernel these two states conflict (because 0xXXXXYe00/0xffffff00 ∈ 0xYe00/0xff00). That means we can't add the new state while the old one is around. Thus, in commit ddd491bd8 ("ipsec: Custom check for XFRM state existence"), we removed any old conflicting XFRM state before adding the new ones. That however causes packet drops on upgrades because we may remove the old XFRM state before bpf_lxc has been updated to use the new 0xXXXXYe00/0xffffff00 mark. Instead, we would need both XFRM state formats to coexist for the duration of the upgrade. Impossible, you say! Don't despair. Things are actually a bit more complicated (it's IPsec and Linux after all). While Linux doesn't allow us to add 0xXXXXYe00/0xffffff00 when 0xYe00/0xff00 exists, it does allow adding in the reverse order. That seems to be because 0xXXXXYe00/0xffffff00 ∈ 0xYe00/0xff00 but 0xYe00/0xff00 ∉ 0xXXXXYe00/0xffffff00 [1]. Therefore, to have both XFRM states coexist, we can remove the old state, add the new one, then re-add the old state. That is allowed because we never try to add the new state when the old is present. During the short period of time when we have removed the old XFRM state, we can have a packet drops due to the missing state. These drops should be limited to the specific node pair this XFRM state is handling. This will also only happen on upgrades. Finally, this shouldn't happen with ENI and Azure IPAM modes because they don't have such old conflicting states. I tested this upgrade path on a 20-nodes GKE cluster running our drop-sensitive application, migrate-svc, scaled up to 50 clients and 30 backends. I didn't get a single packet drop despite the application consistently sending packets back and forth between nodes. Thus, I think the window for drops to happen is really small. Diff before/after the upgrade (v1.13.0 -> thi patch, GKE): src 10.24.1.77 dst 10.24.2.207 proto esp spi 0x00000003 reqid 1 mode tunnel replay-window 0 mark 0x3e00/0xff00 output-mark 0xe00/0xf00 aead rfc4106(gcm(aes)) 0xfc2d0c4e646b87ff2d0801b57997e3598eab0d6b 128 - anti-replay context: seq 0x0, oseq 0x2c, bitmap 0x00000000 + anti-replay context: seq 0x0, oseq 0x16, bitmap 0x00000000 sel src 0.0.0.0/0 dst 0.0.0.0/0 + src 10.24.1.77 dst 10.24.2.207 + proto esp spi 0x00000003 reqid 1 mode tunnel + replay-window 0 + mark 0x713f3e00/0xffffff00 output-mark 0xe00/0xf00 + aead rfc4106(gcm(aes)) 0xfc2d0c4e646b87ff2d0801b57997e3598eab0d6b 128 + anti-replay context: seq 0x0, oseq 0x0, bitmap 0x00000000 + sel src 0.0.0.0/0 dst 0.0.0.0/0 We can notice that the counters for the existing XFRM state also changed (decreased). That's expected since the state got recreated. 1 - I think this is because XFRM states don't have priorities. So when two XFRM states would match a given packet (in our case a packet with mark XXXXYe00), the oldest XFRM state is taken. Thus, by not allowing to add a more specific match after a more generic one, the kernel ensures that the more specific match is always taken when both match a packet. That likely corresponds to user expectations. That is, if both 0xXXXXYe00/0xffffff00 and 0xYe00/0xff00 match a packet, we would probably expect 0xXXXXYe00/0xffffff00 to be used. Fixes: ddd491bd8 ("ipsec: Custom check for XFRM state existence") Fixes: 73c36d45e0 ("ipsec: Match OUT XFRM states & policies using node IDs") Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> 13 June 2023, 19:22:04 UTC
5fd9b0b daemon: Reload bpf_host first in case of IPsec upgrade [ upstream commit ca9c056deb31f6e0747c951be24b25d67ea99d6d ] As explained in the previous commit, we need to switch our IPsec logic from one implementation to another. This implementation requires some synchronized work between bpf_lxc and bpf_host. To enable this switch without causing drops, the previous commit made bpf_host support both implementations. This is quite enough though. For this to work, we need to ensure that bpf_host is always reloaded before any bpf_lxc is loaded. That is, we need to load the bpf_host program that supports both implementations before we actually start the switch from one implementation to the second. This commit makes that change in the order of BPF program reloads. Instead of regenerating the bpf_host program (i.e., the host endpoint's datapath) in a goroutine like other BPF programs, we will regenerate it first, as a blocking operation. Regenerating the host endpoint's datapath separately like this will delay the agent startup. This regeneration was measured to take around 1 second on an EKS cluster (though it can probably grow to a few seconds depending on the node type and current load). That should stay fairly small compared to the overall duration of the agent startup (around 30 seconds). Nevertheless, this separate regeneration is only performed when we actually need: for IPsec with EKS or AKS IPAM mode. Fixes: 4c7cce1bf ("bpf: Remove IP_POOLS IPsec code") Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> 13 June 2023, 19:22:04 UTC
7afecc7 bpf: Support the old IP_POOLS logic in bpf_host [ upstream commit 0af2303f534bb155918e86f07f0f3f4686d2a927 ] This commit reverts the bpf_host changes of commit 4c7cce1bf ("bpf: Remove IP_POOLS IPsec code"). The IP_POOLS IPsec code was a hack to avoid having one XFRM OUT policy and state per remote node. Instead, we had a single XFRM OUT policy and state that would encrypt traffic as usual, but encapsulate it with placeholder IP addresses, such as 0.0.0.0 -> 192.168.0.0. Those outer IP addresses would then be rewritten to the proper IPs in bpf_host. To that end, bpf_lxc would pass the destination IP address, the tunnel endpoint, to bpf_host via a skb->cb slot. The source IP address was hardcoded in the object file. Commit 4c7cce1bf ("bpf: Remove IP_POOLS IPsec code") thus got rid of that hack to instead have per-node XFRM OUT policies and states. The kernel therefore directly writes the proper outer IP addresses. Unfortunately, the transition from one implementation to the other isn't so simple. If we simply remove the old IP_POOLS code as done in commit 4c7cce1bf, then we will have drops on upgrade. We have two cases, depending on which of bpf_lxc or bpf_host is reloaded first: 1. If bpf_host is reloaded before the new bpf_lxc is loaded, then it won't rewrite the outer IP addresses anymore. In that case, we end up with packets of the form 0.0.0.0 -> 192.168.0.0 leaving on the wire. Obviously, they don't go far and end up dropped. 2. If bpf_lxc is reloaded before the new bpf_host, then it will reuse skb->cb for something else and the XFRM layer will handle the outer IP addresses. But because bpf_host is still on the old implementation, it will try to use skb->cb to rewrite the outer IP addresses. We thus end up with gibberish outer destination IP addresses. One way to fix this is to have bpf_host support both implementations. This is what this commit does. The logic to rewrite the outer IP addresses is reintroduced in bpf_host, but it is only executed if the outer source IP address is 0.0.0.0. That way, we will only rewrite the outer IP addresses if bpf_lxc is on the old implementation and the XFRM layer didn't write the proper outer IPs. Fixes: 4c7cce1bf ("bpf: Remove IP_POOLS IPsec code") Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> 13 June 2023, 19:22:04 UTC
a8ce874 ipsec: Deprioritize old XFRM OUT policy for dropless upgrade [ upstream commit a11d088154b2d3fe50d0ce750aca87b3fabb19e5 ] This is a revert, or rather a reimplementation, of commit 688dc9ac8 ("ipsec: Remove stale XFRM states and policies"). In that commit, we would remove the old XFRM OUT policies and states because they conflict with the new ones and prevent the installation to proceed. This removal however causes a short window of packet drops on upgrade, between the time the old XFRM configs are removed and the new ones are added. These drops would show up as XfrmOutPolBlock because packets then match the catch-all default-drop XFRM policy. Instead of removing the old XFRM configs, a better, less-disruptive approach is to deprioritize them and add the new ones in front. To that end, we "lower" the priority of the old XFRM OUT policy from 0 to 50 (0 is the highest-possible priority). By doing this the XFRM OUT state is also indirectly deprioritized because it is selected by the XFRM OUT policy. As with the code from commit 688dc9ac8 ("ipsec: Remove stale XFRM states and policies"), this whole logic can be removed in v1.15, once we are sure that nobody is upgrading with the old XFRM configs in place. At that point, we will be able to completely clean up those old XFRM configs. The priority of 50 was chosen arbitrarily, to be between the priority of new XFRM OUT configs (0) and the priority of the catch-all default-drop policy (100), while leaving space if we need to add additional rules of different priorities. Diff before/after upgrade (v1.13.0 -> this patch, GKE): src 10.24.1.0/24 dst 10.24.2.0/24 - dir out priority 0 + dir out priority 50 mark 0x3e00/0xff00 tmpl src 10.24.1.77 dst 10.24.2.207 proto esp spi 0x00000003 reqid 1 mode tunnel Fixes: 688dc9ac8 ("ipsec: Remove stale XFRM states and policies") Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> 13 June 2023, 19:22:04 UTC
e64caec ipsec: Lower priority of catch-all XFRM policies [ upstream commit 3e898f26063531b9bf3883c5c79e347f15112631 ] This commit lowers the priority of the catch-all default-drop XFRM OUT policies, from 1 to 100. For context, 0 is the highest possible priority. This change will allow us to introduce several levels of priorities for XFRM OUT policies in subsequent commits. Diff before/after this patch: src 0.0.0.0/0 dst 0.0.0.0/0 - dir out action block priority 1 + dir out action block priority 100 mark 0xe00/0xf00 Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> 13 June 2023, 19:22:04 UTC
ac40896 ipsec: Fix leak of XFRM policies with ENI and Azure IPAMs [ upstream commit 9cc8a89f914195d52a8b3df021215b4051348b45 ] Our logic to clean up old XFRM configs on node deletion currently relies on the destination IP to identify the configs to remove. That doesn't work with ENI and Azure IPAMs, but until recently, it didn't need to. On ENI and Azure IPAMs we didn't have per-node XFRM configs. That changed in commit 3e59b681f ("ipsec: Per-node XFRM states & policies for EKS & AKS"). We now need to clean up per-node XFRM configs for ENI and Azure IPAM modes as well, and we can't rely on the destination IP for that because the XFRM policies don't match on that destination IP. Instead, since commit 73c36d45e0 ("ipsec: Match OUT XFRM states & policies using node IDs"), we match the per-node XFRM configs using node IDs encoded in the packet mark. The good news is that this is true for all IPAM modes (whether Azure, ENI, cluster-pool, or something else). So our cleanup logic can now rely on the node ID of the deleted node to clean up its XFRM states and policies. This commit implements that. Fixes: 3e59b681f ("ipsec: Per-node XFRM states & policies for EKS & AKS") Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> 13 June 2023, 19:22:04 UTC
df1effa node_ids: New helper function getNodeIDForNode [ upstream commit 3201a5ee689ba650df414d3417d9a9a0ad677bf7 ] This commit simply refactors some existing code into a new getNodeIDForNode function. This function will be called from elsewhere in a subsequent commit. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> 13 June 2023, 19:22:04 UTC
2217b29 loader: Reinitialize IPsec on device changes on ENI [ upstream commit bf0940b4ff6fcc54227137c1322c2e632e7a1819 ] If IPsec is enabled along with the ENI IPAM mode we need to load the bpf_network program onto new ENI devices when they're added at runtime. To fix this, we subscribe to netlink link updates to detect when new (non-veth) devices are added and reinitialize IPsec to load the BPF program onto the devices. The compilation of the bpf_netowrk program has been moved to Reinitialize() to avoid spurious recompilation on reinitialize. Signed-off-by: Jussi Maki <jussi@isovalent.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> 13 June 2023, 19:22:04 UTC
692f395 loader: Allow reinitializeIPSec to run multiple times [ upstream commit e880002be665e96473daced96f809b3b04f81e27 ] reinitializeIPSec only runs the interface detection if EncryptInterface is empty. Since it sets it after detecting interfaces, it will only run the detection once. Let's change that to run the detection even if the EncryptInterface list isn't empty. That will allow us to rerun the detection when new ENI devices are added on EKS. One consequence of this change is that we will now attach to all interfaces even if the user configured --encrypt-interface. That is fine because --encrypt-interface shouldn't actually be used in ENI mode. In ENI mode, we want to attach to all interfaces as we don't have a guarantee on which interface the IPsec traffic will come in. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Jussi Maki <jussi@isovalent.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> 13 June 2023, 19:22:04 UTC
f1a44e9 ipsec: Flag to switch between IP types used for IPsec encap [ upstream commit 963e45b1c9a0a0d6420cfed6b0aaabbe45cb630e ] On EKS and AKS, IPsec used NodeInternalIPs for the encapsulation. This commit introduces a new flag to allow switching from NodeInternalIPs to CiliumInternalIPs; it defaults to the former. This new flag allows for step 3 of the migration plan defined in the previous commit. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> 13 June 2023, 19:22:04 UTC
a4ce174 ipsec: Accept both CiliumInternalIP and NodeInternalIP on decrypt [ upstream commit 6b3b50d2f568bb145b09e5947ebe55df46e5bc3b ] On EKS and AKS, we currently use NodeInternalIPs for the IPsec tunnels. A subsequent commit will allow us to change that to switch to using CiliumInternalIPs (as done on GKE). For that to be possible without breaking inter-node connectivity for the whole duration of the switch, we need an intermediate mode where both CiliumInternalIPs and NodeInternalIPs are accepted on ingress. The idea is that we will then have a two-steps migration from NodeInternalIP to CiliumInternalIP: 1. All nodes are using NodeInternalIP. 2. Upgrade to the version of Cilium that supports both NodeInternalIP and CiliumInternalIP and encapsulates IPsec traffic with NodeInternalIP. 3. Via an agent flag, tell Cilium to switch to encapsulating IPsec traffic with CiliumInternalIP. 4. All nodes are using CiliumInternalIP. This commit implements the logic for step 2 above. To that end, we will duplicate the XFRM IN states such that we have both: src 0.0.0.0 dst [NodeInternalIP] # existing src 0.0.0.0 dst [CiliumInternalIP] # new thus matching and being able to receive IPsec packets with an outer destination IP of either NodeInternalIP or CiliumInternalIP. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> 13 June 2023, 19:22:04 UTC
2acc610 ipsec: Reintroduce NodeInternalIPs for EKS & AKS IPsec tunnels [ upstream commit 66c45ace70f1355d44efb9c325694375751a943d ] This is a partial revert of commit 3e59b681f ("ipsec: Per-node XFRM states & policies for EKS & AKS"). One change that commit 3e59b681f ("ipsec: Per-node XFRM states & policies for EKS & AKS") make on EKS and AKS was to switch from using NodeInternalIPs to using CiliumInternalIPs for outer IPsec (ESN) IP addresses. That made the logic more consistent with the logic we use for other IPAM schemes (e.g., GKE). It however causes serious connectivity issues on upgrades and downgrades. This is mostly because typically not all nodes are updated to the new Cilium version at the same time. If we consider two pods on nodes A and B trying to communicate, then node A may be using the old NodeInternalIPs while node B is already on the new CiliumInternalIPs. When node B sends traffic to node A, node A doesn't have the XFRM state IN necessary to decrypt it. The same happens in the other direction. This commit reintroduces the NodeInternalIPs for EKS and AKS. Subsequent commits will introduce additional changes to enable a proper migration path from NodeInternalIPs to CiliumInternalIPs. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> 13 June 2023, 19:22:04 UTC
back to top