https://github.com/cilium/cilium

sort by:
Revision Author Date Message Commit Date
b5bc930 Prepare for release v1.7.8 Signed-off-by: Joe Stringer <joe@cilium.io> 28 August 2020, 20:34:55 UTC
8d9e532 daemon: Add hidden allow-remote-src flag This flag allows to override the default setting for whether to accept traffic into a node if the source node reports the source identity as "host". Default is "auto" - to allow if enableRemoteNodeIdentity is disabled; deny if enableRemoteNodeIdentity is enabled. If specified directly, it will take on the configured behaviour (allow such traffic if true, otherwise drop it). Signed-off-by: Joe Stringer <joe@cilium.io> 28 August 2020, 20:10:47 UTC
830f9cd bpf: Allow from host with remote-node-identity=false According to the documentation: One can set enable-remote-node-identity=false in the ConfigMap to retain the Cilium 1.6.x behavior. The above is true when evaluating policy, but not true from the DROP_INVALID_IDENTITY perspective as during an upgrade from v1.6 to v1.7, v1.6 nodes may send traffic to v1.7 nodes with the "host" identity and the v1.7 nodes will drop such traffic with this error code. Mitigate this by also covering this datapath case with the `EnableRemoteNodeIdentity` flag check. Signed-off-by: Joe Stringer <joe@cilium.io> 28 August 2020, 20:10:47 UTC
f3e57e4 iptables, loader: add rules to ensure symmetric routing for AWS ENI traffic [ upstream commit 132088c996a59e64d8f848c88f3c0c93a654290c ] Multi-node NodePort traffic with AWS ENI needs a set of specific rules that are usually set by the AWS DaemonSet: # sysctl -w net.ipv4.conf.eth0.rp_filter=2 # iptables -t mangle -A PREROUTING -i eth0 -m comment --comment "AWS, primary ENI" -m addrtype --dst-type LOCAL --limit-iface-in -j CONNMARK --set-xmark 0x80/0x80 # iptables -t mangle -A PREROUTING -i eni+ -m comment --comment "AWS, primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 # ip rule add fwmark 0x80/0x80 lookup main These rules mark packets coming from another node through eth0, and restore the mark on the return path to force a lookup into the main routing table. Without them, the "ip rules" set by the cilium-cni plugin tell the host to lookup into the table related to the VPC for which the CIDR used by the endpoint has been configured. We want to reproduce equivalent rules to ensure correct routing, or multi-node NodePort traffic will not be routed correctly. This could be observed with the pod-to-b-multi-node-nodeport pod from connectivity check never getting ready. This commit makes the loader and iptables module create the relevant rules when ENI is in use. The rules are nearly identical to those from the aws daemonset (different comments, different interface prefix for conntrack return path, explicit preference for ip rule): # sysctl -w net.ipv4.conf.<egressMasqueradeInterfaces>.rp_filter=2 # iptables -t mangle -A PREROUTING -i <egressMasqueradeInterfaces> -m comment --comment "cilium: primary ENI" -m addrtype --dst-type LOCAL --limit-iface-in -j CONNMARK --set-xmark 0x80/0x80 # iptables -t mangle -A PREROUTING -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 # ip rule add fwmark 0x80/0x80 lookup main pref 109 Fixes: #12098 Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 28 August 2020, 20:09:06 UTC
1819921 daemon: properly maintain node lists on updates [ upstream commit 5550c0f3f2206d05f3ef3af569ab756cbba94fae ] NodeAdd and NodeUpdate update the node state for clients so that they can return the changes when client requests so. If a node was added and then updated, its old and new version would be on the added list and its old on the removed list. Instead, we can just update the node on the added list. Note that the setNodes() function on pkg/health/server/prober.go first deletes the removed nodes and then adds the new ones, which means that the old version of the node would be added and remain as stale on the health server. This was found during investigation of issues with inconsistent health reports when nodes are added/removed from the cluster (e.g., #11532), and it seems to fix inconsistencies observed a small-scale test I did to reproduce the issue. Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com> Signed-off-by: Alexandre Perrin <alex@kaworu.ch> 28 August 2020, 18:26:15 UTC
93f4976 docs: limit copybutton to content area only [ upstream commit 6711a0ce13cceb217df187c492f11e7879cb3a09 ] Fixes copy button to not conflict with the search Signed-off-by: Sergey Generalov <sergey@genbit.ru> Signed-off-by: Alexandre Perrin <alex@kaworu.ch> 28 August 2020, 18:26:15 UTC
ad25469 Upgrade Cilium docs theme version [ upstream commit eeec4d0e00549a886511069ebf6784042d93550c ] Signed-off-by: Nicolas Jacques <neela@isovalent.com> Signed-off-by: Alexandre Perrin <alex@kaworu.ch> 28 August 2020, 18:26:15 UTC
035a85b vagrant: Don't use the NFS device's IP as node IP [ upstream commit 1c37921003a824568d8165de4625c4ce390df37c ] The K8s node IP is the IP address propagated to other nodes and mapped to the REMOTE_NODE_ID in the ipcache. We therefore don't want to use the IP address of the NFS interface (enp0s9) for that. When we use that IP address, any policy using the remote-node identity (or host in case the two aren't dissociated) will fail to resolve properly. In general, I don't think K8s even needs to know about the NFS interface or its IP addresses. Fixes: 0eafea4 ("examples/kubernetes-ingress: fixing scripts to run k8s 1.8.1") Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Alexandre Perrin <alex@kaworu.ch> 28 August 2020, 18:26:15 UTC
2cebf91 fix: node-init should use docker if /etc/crictl.yaml not found [ upstream commit 552c823f561149213807627cfbd724c39dbd8a10 ] This script has several tests for what the container runtime situation looks like to determine how best to restart the underlying containers (going around the kubelet) so that the new networking configuration can take effect. The first test looks to see if the crictl config file is configured to use docker, but if that file doesn't exist then it fails. I believe docker is the default if this hasn't been configured at all so if that file doesn't exist then use docker. Fixes #12850 Signed-off-by: Nathan Bird <njbird@infiniteenergy.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 28 August 2020, 01:54:54 UTC
9fdee51 etcd: Make keepalive interval and timeout configurable [ upstream commit a4a1df0289a3067e3c9913c894b322b64cc3b0e1 ] Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 28 August 2020, 01:54:54 UTC
a1326e5 pkg/kvstore: add gRPC keep alives for etcd connectivity [ upstream commit 268f4066e4f8d245f67d3cfc305a11d76ffffb1e ] If the client does not receive a keep alive from the server, that connection should be closed so the etcd client library does proper round robin for the other available endpoints. This might be a little bit aggressive in a larger environment if all clients perform a keep alive requests to the etcd servers. Some testing could be done to verify if there is a large overhead of doing these keep alive requests. Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 28 August 2020, 01:54:54 UTC
8f5a278 datapath: Pull skb data in to-netdev path [ upstream commit 2960b5f56ad048fe04560e01349b36c2422c8afc ] It has been reported [1][2] that ICMP packets are being dropped by a receiving node due to DROP_INVALID when bpf_host was attached to the receiving iface. Further look into the issue revealed that the drops were happening because IP headers were not in the skb linear data (unsuccessful revalidate_data() caused the DROP_INVALID return). Fix this by making sure that the first invocation of revalidate_data() in the "to-netdev" path will always do skb_data_pull() before deciding that the packet is invalid. [1]: https://github.com/cilium/cilium/issues/11802 [2]: https://github.com/cilium/cilium/issues/12854 Reported-by: Andrei Kvapil <kvapss@gmail.com> Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 28 August 2020, 01:54:54 UTC
6d2c9d1 docs/metrics: Correct label typo `equal` in metrics.rst [ upstream commit 85600be8c0d73ca564661979f37c37e63340cd2d ] This PR is to correct simple typo equal in metrics.rst Signed-off-by: Tam Mach <sayboras@yahoo.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 28 August 2020, 01:54:54 UTC
2f6f93a doc: fix the AKS installation validation Before this patch, the aks documentation tries to find cilium in the rube-system namespace although it is installed in the cilium namespace (alike to the GKE documentation). Signed-off-by: Alexandre Perrin <alex@kaworu.ch> 27 August 2020, 17:03:47 UTC
35f1b10 doc: Specify CILIUM_NAMESPACE for Hubble installation instruction [ upstream commit 575bff841b3f796d255e009da74944e4b7166b3a ] This makes it easier to follow the instructions, especially for GKE which uses cilium namespace instead of kube-system. Signed-off-by: Michi Mutsuzaki <michi@isovalent.com> Signed-off-by: Alexandre Perrin <alex@kaworu.ch> 27 August 2020, 17:03:47 UTC
88ecd7f operator: make EC2 AWS API endpoint configurable Add a new --ec2-api-endpoint operator option which allows to specify a custom AWS API endpoints for the EC2 service. One possible use-case for this is the usage of FIPS endpoints, see https://aws.amazon.com/compliance/fips/. For example, to use API endpoint ec2-fips.us-west-1.amazonaws.com, the AWS operator can be called using: cilium-operator --ec2-api-endpoint=ec2-fips.us-west-1.amazonaws.com Updates #12620 Signed-off-by: Tobias Klauser <tklauser@distanz.ch> 18 August 2020, 07:01:58 UTC
bb0ee9c Istio: Update to release 1.5.9 [ upstream commit 8dca1c8e10138e99e664951a4d3540154bb25117 ] Signed-off-by: Jarno Rajahalme <jarno@covalent.io> Signed-off-by: Tobias Klauser <tklauser@distanz.ch> 14 August 2020, 12:46:19 UTC
34c7adf daemon: Add hidden --k8s-sync-timeout option [ upstream commit bd89e83a4245769dac42860cf928e2dd7c227ce1 ] This option governs how long Cilium agent will wait to synchronize local caches with global Kubernetes state before exiting. The default is 3 minutes. Don't expose it by default, this is for advanced tweaking. Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Tobias Klauser <tklauser@distanz.ch> 14 August 2020, 12:46:19 UTC
fbab570 k8s: update k8s versions to 1.17.11 Also update test versions to 1.17.11 and 1.16.14 Signed-off-by: André Martins <andre@cilium.io> 14 August 2020, 10:41:13 UTC
01b935e operator: Fix non-leader crashing with kvstore [ upstream commit 3d376fae42ab0fac43403dfec6f08fe7eecb3234 ] A non-leader operator will hang during its healthcheck report as it tries to check the status of the kvstore. The reason it hangs is because the leader operator is the only one that has access to the client. This hang causes an HTTP level timeout on the kubetlet liveness check. The timeout then causes kubelet to roll the pod, eventually into CrashLoopBackOff. ``` Warning Unhealthy 8m17s (x19 over 17m) kubelet, ip-10-0-12-239.us-west-2.compute.internal Liveness probe failed: Get http://127.0.0.1:9234/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers) ``` Signed-off-by: Chris Tarazi <chris@isovalent.com> 11 August 2020, 03:46:40 UTC
de6a5ac Update Go to 1.13.15 Signed-off-by: Tobias Klauser <tklauser@distanz.ch> 07 August 2020, 15:35:11 UTC
5aeb7ff docs: clarify Kubernetes compatibility with Cilium [ upstream commit 038877cc5a71cb546a27d53f97d4b4fa46f592b4 ] Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> 07 August 2020, 05:41:47 UTC
492a26c docs: add current k8s network policy limitations [ upstream commit c767682be85bb96e7398aa492f538867591943b3 ] Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> 07 August 2020, 05:41:47 UTC
0cee8c8 doc: update #ebpf Slack channel name [ upstream commit 0547ea4b5c86e81bf8967ed21dd87b0007fd3d67 ] The Slack channel dedicated to discussions on eBPF and datapath has been renamed from #bpf to #eBPF (on 2020-08-03). Report this change to Cilium's documentation, and also turn "BPF" into "eBPF" on the updated page. Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> 07 August 2020, 05:41:47 UTC
1c83817 Makes ExecInPods not wait for pod ready during log gathering upon test failure. [ upstream commit 7986df9d25ad345be97e71a5af3e3720367c9533 ] Signed-off-by: Weilong Cui <cuiwl@google.com> Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> 07 August 2020, 05:41:47 UTC
f2615d6 test: Disable K8sKubeProxyFreeMatrix [ upstream commit e09d991a1970ffd4a286382322091ffc64d40add ] The suite does not provide much value because of the following reasons: - It does not test the kube-proxy replacement from outside, so only bpf_sock is tested. - K8sServicesTest should provide the same coverage. - It takes 20min to run the suite. Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> 07 August 2020, 05:41:47 UTC
b67a6fe install: Default cilium-operator replicas to 1 Given that Cilium v1.7.x was not released with operator HA and we are otherwise disabling this functionality (including etcd heartbeats) by default in v1.7 branch, we should also not surprise users during the v1.7.7->v1.7.8 point release update by upgrading to find that there are more copies of the cilium-operator running in their cluster. Users can still opt in by explicitly configuring the number of replicas. Signed-off-by: Joe Stringer <joe@cilium.io> 05 August 2020, 21:47:27 UTC
da02ac8 install/kubernetes: do not schedule cilium-operator pods in same node [ upstream commit bde2daf77fa8bac84a66c803ecc18f78df779082 ] Since Cilium Operator is running in host network, 2 or more pods can't run in the same node at the same time or they will clash the open ports they use for liveness and / or readiness health check. Fixes: 930bde726974 ("install: update helm templates to add HA capabilities for operator") Signed-off-by: André Martins <andre@cilium.io> 05 August 2020, 21:47:27 UTC
233aeb1 test: generate cilium helm template validating against k8s cluster [ upstream commit 82cc7c3d076149e147cd120f7ee4424317a39ccd ] * Use --validate with `helm template` command to validate the generated manifest against the associated kubernetes cluster * For more information see - https://github.com/cilium/cilium/pull/12409#discussion_r453313631 Signed-off-by: Deepesh Pathak <deepshpathak@gmail.com> Signed-off-by: André Martins <andre@cilium.io> 05 August 2020, 21:47:27 UTC
b88b96e install: update helm templates to add HA capabilities for operator [ upstream commit 930bde726974320196583cde03ccf1c57af55606 ] Signed-off-by: Deepesh Pathak <deepshpathak@gmail.com> Signed-off-by: André Martins <andre@cilium.io> 05 August 2020, 21:47:27 UTC
a625703 operator: support HA mode for operator using k8s leaderelection library [ upstream commit df90c99905ad107710ce66d2dd36820f068db189 ] * Make leaderelection parameters configurable using command line flags * Update cmdref to include documentation for new flags. Signed-off-by: Deepesh Pathak <deepshpathak@gmail.com> Signed-off-by: André Martins <andre@cilium.io> 05 August 2020, 21:47:27 UTC
81c2d35 k8s: add coordinationv1 capability check to k8s version package [ upstream commit fb101dfc04ddb6277413207eb6b6580f4be82b82 ] * Introduces config option `K8sLeasesFallbackDiscoveryEnabled` to check if fallback discovery is enabled for Leases. * K8sLeasesFallbackDiscovery is enabled by default only in operator. Signed-off-by: Deepesh Pathak <deepshpathak@gmail.com> Signed-off-by: André Martins <andre@cilium.io> 05 August 2020, 21:47:27 UTC
f7431a5 vendor: vendor kubernetes leaderelection library [ upstream commit 66c3d9c4d3ef85d57914ec1a595f0226fcde7e00 ] Signed-off-by: Deepesh Pathak <deepshpathak@gmail.com> Signed-off-by: André Martins <andre@cilium.io> 05 August 2020, 21:47:27 UTC
469af4e rand: rename RandomRune* funcs to RandomString* [ upstream commit 3c9c9709cf6a82da9060fa824119c24463e18c0a ] These functions return strings, not a rune. Rename them accordingly. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: André Martins <andre@cilium.io> 05 August 2020, 21:47:27 UTC
ae9f83e ctmap: Add unit test for ICMP CT/NAT GC [ upstream commit ae3917b64dd04d35c68e5746b1b670555a9f0ffe ] Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> 05 August 2020, 21:47:27 UTC
efeec6c datapath: Fix CT tuple ports for ICMP Echo [ upstream commit c633c2d02d59cc9429800bd1912187f246eaf0ac ] Previously, when an ICMP EchoRequest was sent from one node A to another node B with Echo ID > NAT_MIN_EGRESS, the ICMP EchoReply sent from B -> A created a CT entry and NAT entries which could not be related by GC. E.g. node A (192.168.34.12) pings node B (192.168.34.11): ICMP IN 192.168.34.12:0 -> 192.168.34.11:38193 XLATE_DST 192.168.34.11:38193 Created=6292sec HostLocal=1 ICMP OUT 192.168.34.11:38193 -> 192.168.34.12:0 XLATE_SRC 192.168.34.11:38193 Created=6292sec HostLocal=1 ICMP OUT 192.168.34.11:0 -> 192.168.34.12:38193 expires=16783063 RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=1 TxBytes=50 TxFlagsSeen=0x00 LastTxReport=16783005 Flags=0x0000 [ ] RevNAT=0 SourceSecurityID=0 IfIndex=0 This made the NAT entries to escape the CT GC meaning that the CT entry was removed, while the NAT entries were kept which made them to stay forever until a user manually ran "cilium bpf nat flush". Fix this by setting ICMP Echo ID in a port which belongs to addr of the local node, so that the CT GC could relate the NAT entries. In the previous example, the CT entry after the fix is the following: ICMP OUT 192.168.34.11:38193 -> 192.168.34.12:0 expires=16783063 RxPackets=0 RxBytes=0 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=1 TxBytes=50 TxFlagsSeen=0x00 LastTxReport=16783005 Flags=0x0000 [ ] RevNAT=0 SourceSecurityID=0 IfIndex=0 The fix does not change the ID placement in a port for the case when B -> A sends ICMP EchoRequest. Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> 05 August 2020, 21:47:27 UTC
bd9d164 Prepare for release v1.7.7 Signed-off-by: Joe Stringer <joe@cilium.io> 05 August 2020, 01:26:41 UTC
c55a82d etcd: Fix firstSession error handling [ upstream commit 40026dbb211a43061ac8bbd9d534a3a1fa1e562f ] The commit bf8e4327448 ("etcd: Ensure that firstSession is closed") incorrectly assumed that only a single reader exists for firstSession. This is not the case and the error returned via the channel will only be read by one of the readers, the other readers will assume success and continue in their code logic even though the etcd client is being shut down. Fixes: bf8e4327448 Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io> 05 August 2020, 01:00:07 UTC
cb0befd test: Validate FQDN connectivity during restart This commit reworks how the FQDN test is run. It now validates that connectivity is still available during a Cilium restart, instead of waiting until Cilium is back up. This allows validating https://github.com/cilium/cilium/pull/12718 and https://github.com/cilium/cilium/pull/12731 improving the time it takes the procy to respond to DNS requests. Signed-off-by: Chris Tarazi <chris@isovalent.com> 04 August 2020, 10:10:47 UTC
0f78aa0 test: Remove temporary wait for endpoints in FQDN This can be removed now that https://github.com/cilium/cilium/pull/12718 has been merged. The aforementioned PR should improve the DNS connectivity downtime when Cilium is restarting. Related: https://github.com/cilium/cilium/pull/12731 Signed-off-by: Chris Tarazi <chris@isovalent.com> 04 August 2020, 10:10:47 UTC
3ffa3c5 test: Refactor FQDN test for readability This commit splits out the function to test connectivity into components to be reused. The call sites are consolidated. Signed-off-by: Chris Tarazi <chris@isovalent.com> 04 August 2020, 10:10:47 UTC
629efeb etcd: Disable heartbeat quorum check by default Diable the heartbeat check by default as the sudden requirement for the cilium-operator to be always available can come to a surprise to existing 1.7 users. Require etcd.enableHeartbeat=true to be set in order to enable the requirement for the heartbeat. Signed-off-by: Thomas Graf <thomas@cilium.io> 04 August 2020, 08:05:33 UTC
5c12c1a endpoint: Demote proxy stats not found warnings to debug level "Proxy stats not found when updating" warnigns are currently issued if stats updates are received for a proxy redirect that can not be found. There are two common scenarios where this can happen as part of normal operation: 1. A policy change removed a proxy redirect, and stats updates from requests that were redirected to the proxy before the datapath redirect entry was removed are received. 2. DNS proxy issues stats for requests that have been forwarded on the basis of a restored DNS policy while the Endpoint policy has not yet been computed. Demote this log message to debug level to avoid these false warnings. Signed-off-by: Jarno Rajahalme <jarno@covalent.io> 01 August 2020, 10:23:27 UTC
d7da834 dnsproxy: Use restored Endpoints before Endpoints are available Use restored Endpoints during Cilium restart when Endpoints are not yet available. Do not error out if the destination IP can not be found from ipcache, but default to WORLD destination security identity instead. This allows IP-based restored rules to be processed before ipcache is fully updated. Signed-off-by: Jarno Rajahalme <jarno@covalent.io> 01 August 2020, 10:23:27 UTC
adac95f ipcache: Fix unit test flake This list was unsorted which caused random ordering at test run time. Fixes: #12733 Fixes: c3e19d8ee0f6 ("dnsproxy: Use restored rules during restart") Signed-off-by: Joe Stringer <joe@cilium.io> 01 August 2020, 00:40:20 UTC
88174d7 fqdn/dnsproxy: set SO_REUSEPORT on listening socket Now that we start re-using the same port for the DNS proxy across restarts (see #12718), set the SO_REUSEPORT option on the listening port. This given the proxy a better chance to re-bind() upon restarts. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> 31 July 2020, 14:25:32 UTC
a1b7232 agent: Fix bootstrap metric for kvstore [ Backporter's notes: Resolved conflict with `k8sCachesSynced` channel, which which was moved to another location in the upstream commit. ] [ upstream commit 87d68ea095158dbf347bdd5ea6aca17566dd05a2 ] Do not account kvstore initialization as k8s bootstrap time. Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: Chris Tarazi <chris@isovalent.com> 31 July 2020, 10:18:57 UTC
389705e k8s: Register CRDs in parallel [ Backporter's notes: Ran `go mod tidy && go mod vendor` to retrieve errgroup external package. ] [ upstream commit c8fd3e9c5d10914576abd670971f81e4c7c60a3a ] Individual CRD registrations do not depend on each other, the registration can be done in parallel. Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: Chris Tarazi <chris@isovalent.com> 31 July 2020, 10:18:57 UTC
2ba8ae9 fqdn: Limit the max processing in GetRules() Normally the number of DNS proxy rules should be very small. To guard against pathological cases, limit the number of IPs processed to 1000 per port. Signed-off-by: Jarno Rajahalme <jarno@covalent.io> 31 July 2020, 07:17:16 UTC
759af79 endpoint: Remove restored DNS rules Remove restored DNS rules after a successful regeneration, and also at endpoint delete to cover endpoints that were never regenerated. Signed-off-by: Jarno Rajahalme <jarno@covalent.io> 31 July 2020, 07:17:16 UTC
401f171 endpoint: Update DNSRules on header rewrite Update DNSRules, if any, before writing headers to capture potentially changed allowed destination IPs. Signed-off-by: Jarno Rajahalme <jarno@covalent.io> 31 July 2020, 07:17:16 UTC
c3e19d8 dnsproxy: Use restored rules during restart Store current DNS rules with the Endpoint and use them in the DNS proxy during initial regeneration of the restrored endpoints. DNS proxy starts with restored DNS rules based on allowed IP addresses. These rules are removed for each endpoint as soon as the first regeneration completes. Such restored rules should allow DNS requests to be served, but for new DNS resolutions to be added to the Endpoint's policy the restored endpoint's must still have their first regeneration completed. Signed-off-by: Jarno Rajahalme <jarno@covalent.io> 31 July 2020, 07:17:16 UTC
99196d5 daemon: Minimize DNS proxy downtime on restart Start proxy support earlier in the daemon bootstrap, notably before any k8s setup. Fetch old endpoints earlier so that the DNS history is avalailable before k8s is set up and move dns proxy initialization earlier in the bootstrap. Reuse DNS proxy port from previous run on restart unless overridden by an explicit Cilium agent option These changes allow the DNS proxy start serving requests as soon as the toFQDN policy is received from k8s and avoid any service disruption prevously possible due to endpoints being regenerated before the DNS proxy was started. Signed-off-by: Jarno Rajahalme <jarno@covalent.io> 31 July 2020, 07:17:16 UTC
4ab9da2 install/kubernetes: re-add removed permissions from clusterrole During a rolling upgrade across minor versions, Cilium's clusterrole might change. When doing this upgrade the newer Cilium version cannot have permissions removed from its clusterrole or the older Cilium version might fail has it might require such permissions. Re-adding, by having all permissions required by both versions will make sure that Cilium can run successfully while the rolling upgrade happens. Signed-off-by: André Martins <andre@cilium.io> 31 July 2020, 00:10:02 UTC
5d0ab10 contrib: Tighten search for list of PRs [ upstream commit 1797d0ee97cf1bc69ac6d1d43493c4e91d13bb56 ] Previously, if "set-labels.py" was in the PR title, then the `grep` would pick up extraneous lines which throw off the parsing. See failed example below: ``` $ ./contrib/backporting/submit-backport v1.8 ... Updating labels for PRs * #12640 -- backporting: Report progress in set-labels.py (@pchaigno) 12640 12626 12632 12654 12651 12652 12659 12521 12683 Set labels for all PRs above? [y/N] y usage: set-labels.py [-h] pr_number {pending,done} [version] set-labels.py: error: argument pr_number: invalid int value: 'api' Signal ERR caught! ``` Fixes: 3c4d43af8f ("contrib: Fix submit-backport PR set-labels detection") Signed-off-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: Quentin Monnet <quentin@isovalent.com> 30 July 2020, 23:54:00 UTC
80fdcc7 contrib: Print PR number in set-labels.py [ upstream commit 3b58cf6227428c4e583644c188dac269786bfb8a ] Simple enough change and improves usability. Fixes: 9fdaf24555 ("backporting: Report progress in set-labels.py") Signed-off-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: Quentin Monnet <quentin@isovalent.com> 30 July 2020, 23:54:00 UTC
dff547f docs(identity): Correct discrepancy between label and descriptions [ upstream commit db31fd833cf1b94d179cc3b25012dcd1d540757c ] Some of mentioned labesl were not in sync with the description. This PR is to correct such discrepancy. Reference is from the code pkg/labelsfilter/filter.go Signed-off-by: Tam Mach <sayboras@yahoo.com> Signed-off-by: Quentin Monnet <quentin@isovalent.com> 30 July 2020, 23:54:00 UTC
74f8c8f backporting: Report progress in set-labels.py [ upstream commit 9fdaf245550e68c60fc3176dd3f02810b475b05f ] When working on the backport PRs, there are two steps where we update PR labels: when we create the backport PR and when we merge it. On master, if submitting the backport PR with submit-backport, it reports the progress as it's updating the labels. However, running the command below, included in the PR, to update labels once merged doesn't report any progress. contrib/backporting/set-labels.py 12345 12346 done 1.8 This small commit fixes it to report progress in both cases. Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Quentin Monnet <quentin@isovalent.com> 30 July 2020, 23:54:00 UTC
0366505 fqdn/dnsproxy/proxy_test: increase again timeout for DNS TCP exchanges [ upstream commit fff58ef30a26fcf07a26e455e565d60df07385a4 ] Follow-up to #12305, where we raised the timeout from 100ms to 500ms. This seemed to suppress most of the flakes reported in #12042, but we saw one again recently: Try restoring the timeout value to its original value of 1 second. Most of the time the RTT time for the exchange is way below 100ms anyway and we won't have a difference on tests duration. In the worst and very unlikely case where all DNS TCP exchanges are super-slow, we only have 5 exchanges in the tests and cannot spend more than a total 5 seconds on them (or one would timeout and the test fail). Fixes: #12042 Signed-off-by: Quentin Monnet <quentin@isovalent.com> 30 July 2020, 23:54:00 UTC
ff74093 contrib: Fix submit-backport PR set-labels detection [ upstream commit 3c4d43af8f98d32115cce833925387d37d23a49b ] Previously if you manually edited the command in the backport summary file and referenced PRs elsewhere in the summary file, then the script would still ask if you wanted to update the labels for *all* PRs mentioned in the summary. This commit changes it to only ask whether you want to set the labels for the set of PRs that are listed in the *command* that's provided in the summary file. Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Quentin Monnet <quentin@isovalent.com> 30 July 2020, 23:54:00 UTC
2e59e84 k8s: update k8s dependencies to 1.17.9 Also update k8s testing versions to 1.17.9 Signed-off-by: André Martins <andre@cilium.io> 27 July 2020, 11:49:33 UTC
350c881 etcd: Fix incorrect context usage in session renewal [ upstream commit 02628547e06fde912a8fafb0a71d61cddc72dae3 ] The context passed into NewSession() was supposed to enforce a timeout on the NewSession() operation which is triggering a Grant() and KeepAlive() instruction. However, the context passed into NewSession() will also be associated with the resulting lease. As per etcd documentation, this will result in: > If the context is canceled before Close() completes, the session's lease will > be abandoned and left to expire instead of being revoked. Because of this, any session renewal triggering a new session would create a session that is immediately closed again due to the context passed into NewSession being cancelled when the controller run ends successfully. This resulted in any renewed session to have an effective lifetime of 10 milliseconds before requiring renewal again. Fixes: #12619 Fixes: 8524fca879b ("kvstore: Add session renew backoff") Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 23 July 2020, 22:39:41 UTC
dc5d269 etcd: Lock e.session when renewing session [ upstream commit 02a18106d7c8dff351f7caa18aa2f71050487b41 ] Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 23 July 2020, 22:39:41 UTC
9063810 etcd: Report reason when lock acquisition in status check fails [ upstream commit 589bdf3e995beddbcdbfd0a1d450cddcff705def ] Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 23 July 2020, 22:39:41 UTC
00ceace etcd: Print info message when connection is established [ upstream commit e959c9f1d59c6b6cbe22813fdb1431a1a8565fa0 ] Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 23 July 2020, 22:39:41 UTC
9f53e7f etcd: Ensure that firstSession is closed [ upstream commit bf8e4327448ebb4f98bd49d20e9facd21c2efa3b ] The firstSession channel has been used to allow blocking until the initial session has been established. However, because callers have not been handling errors, it was not possible to close the channel if the etcd client was shut down before the session was ever established and thus the channel was leaked. This was even more problematic for etcd operations performed without a context with a timeout as most operations currently block on firstSession or context. Consolidate the errChan and firstSession as they effectively served the same purpose. The only slight change is that the firstSession channel is closed *after* the etcd version has been verified which is the right behavior anyway. Having the error condition returned via the firstSession channel allows to return it to owning controllers as well when renewing sessions for better visibility. Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 23 July 2020, 22:39:41 UTC
89c3bc1 k8s: Exit CNP status update controller if kvstore is unavailable [ upstream commit 794c34ba7977bbcb3b3c4dad206a40605855847c ] Commit 19ad311f5c incorrectly changed the return value to nil instead of returning an error. Fixes: 19ad311f5c1 ("kvstore: Fix Watch() to return when client is closed or context is cancelled") Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 23 July 2020, 22:39:41 UTC
5d7def2 etcd: Ensure that lock session renewal controller exit [ upstream commit f6a751fed6dd61a02aaf5952aa0d0d9ab9eb0a5c ] Same as ba9031bb45a ("etcd: Ensure that session renewal controller exit") but for the lock session. Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 23 July 2020, 22:39:41 UTC
3d5b43a Changes based on comments in the pull request [ upstream commit a5fe80eb8bb017d372729a79cf529cfe0e4eefb4 ] backport note: concepts/scalability/index.rst does not exist in 1.7 so I left identity-relevant-labels in gettingstarted index. Signed-off-by: Sean Winn <sean@isovalent.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 23 July 2020, 22:39:41 UTC
69df59f Adds documentation for limiting identity-relevant labels [ upstream commit 0945666ac60328d4ffd42b08ccae6e4509d65bde ] Fixes: #11540 Signed-off-by: Sean Winn <sean@isovalent.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 23 July 2020, 22:39:41 UTC
17e5335 Fix small CRD issue with toGroups [ upstream commit 678e7f0205ee7f50566d3c2fdd0bd6a3a63bc54c ] Fixes: f0049da61f4f ("pkg/k8s: fix all structural issues with CNP validation") Signed-off-by: Laurent Bernaille <laurent.bernaille@datadoghq.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 23 July 2020, 22:39:41 UTC
a331fac etcd: Cancel isConnectedAndHasQuorum() on client close [ upstream commit f42c7d43e6c6b122209de5e6b77e68cf7771c785 ] Don't rely on timeout of user context while waiting for initial connect if the client is closing. Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 22 July 2020, 20:51:22 UTC
e13a1b3 etcd: Cancel waitForInitLock() when client is closing [ upstream commit 26f8cd7fd2e5e917399623a2fe141347f9b3ccf7 ] Don't rely on timeout if the client is being cancelled. If the client is closing, a lock can never succeed again. Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 22 July 2020, 20:51:22 UTC
39f86b3 etcd: Fix endless loop in Watch() [ upstream commit 148b68154345a19e49b1b767215cd17179e4de92 ] When Watch() is called on an etcd client that is closing or the context has been cancelled then the Get() call will fail. The error handler will retry the loop with no chance of success as the context is already cancelled or the client has been closed. The retry will occur forever. The lack of sleep in the retry path will further put massive stress on the etcd rate limiter and thus will reduce the probabilit for other etcd operations to succeed in time. Fix Watch() to break out of the loop if the user or etcd client context has been cancelled. Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 22 July 2020, 20:51:22 UTC
62bb7b9 kvstore: Fix Watch() to return when client is closed or context is cancelled [ upstream commit 19ad311f5c15cf6ef56362deaf665d21885887be ] Connected() is being used for two use cases: 1. To wait for a particular client to be connected and reach quorum. 2. To wait for the overall agent to connect to a kvstore and reach quorum. The first use case is tied to a particular kvstore client. The second use case is not. The current implementation will block Connected() forever until isConnectedAndHasQuorum() returns success. If the context is cancelled or if the client is closed, this never happens as an error will always be returned. Change etcd's Connected() implementation to check the context and the etcd client context and return an error on the channel. This allows Watch() to return if this condition is met to properly return the Watch() call which is made against a particular client. This fixes a Watch() call never returning in case the etcd client is closed before connectivity was ever achieved. To account for the second use case, introduce an overall Connected() function which will block closing the channel until kvstore connectivity has been achieved. Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 22 July 2020, 20:51:22 UTC
0f22d91 etcd: Ensure that session renewal controller exit [ upstream commit ba9031bb45a0da79ee3a843a8c9b50807d0e40be ] The controller functions could potentially block forever while waiting on channels to close. Bind waiting on the channel and check of the etcd version to the controller and etcd client context so these operations get cancelled. Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 22 July 2020, 20:51:22 UTC
7357af1 kvstore: Fix session renew controller timeouts [ upstream commit bfac77f20696b37ee073d42cc15a39db66771012 ] Due to creating a context with a timeout prior to blocking on the session channel to wait for it to end, the timeout of the context will likely already hvae expired by the time the kvstore operation is performed. It is thus cancelled immediately, requiring a subsequent controller call. This prolongs the time the session is renewed and causes unnecessary controller failures which can be mistaken for etcd issues. Also pass the etcd client context to the controller to bind the lifecycle of a controller run to it. This fixes errors like these: ``` kvstore-etcd-lock-session-renew 20s ago 10s ago 1 unable to renew etcd lock session: context deadline exceeded ``` Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 22 July 2020, 20:51:22 UTC
d16369c bpf: Fix monitor aggregation for 'from-network' [ upstream commit 037bef5260918fba7fac19971593c80ff020485c ] Previously, we did not take into account 'from-network' sources in the monitor aggregation logic check in `send_trace_notify()`, which was fine because we rarely ever sent such events (limited to ipsec for instance). However, since commit c470e28a82a9 we also use this in bpf_host which suddenly means that any and all traffic from the network will trigger monitor events, flooding the monitor output. Fixes: 7a4b0beccbfe ("bpf: Add MonitorAggregation option") Fixes: c470e28a82a9 ("Adds TRACE_TO_NETWORK obs label and trace pkts in to-netdev prog.") Fixes: https://github.com/cilium/cilium/issues/12555 Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 22 July 2020, 20:51:22 UTC
0b0b71c policy: Clarify egress policy rule members [ upstream commit ab38d73f0bb2e32bad8c45d3abed1b1ee7072b15 ] - ToCIDR + ToPorts has been supported since #3835 (Cilium v1.1) - Combining any of the L3 selectors together in a single rule doesn't make sense. Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 22 July 2020, 20:51:22 UTC
ecb7861 endpoint: Write headerfile under full regeneration [ upstream commit 21783d5996019f6a3fb175592129f15339630177 ] Previously, regenerating an endpoint manually via CLI resulted in compilation errors as seen below. These errors occurred because the endpoint headerfile (ep_config.h or lxc_config.h) was not written to the `<endpoint ID>_next` directory (will be reffered to as "next"), where the BPF compilation takes place. The reason the headerfile was not written to the "next" directory was because Cilium only wrote the headerfile if it was changed (via hash). However, a manual regeneration triggered through the API, sets the regeneration level to "compile+load" (a full regeneration), where Cilium expects all relevant files, including the headerfile, to be present in the "next" directory (`<endpoint ID>_next`). Hence, the compilation errors. This commit fixes this issue by checking whether there has been a request for a full regeneration, in which case, we write the headerfile to the "next" directory. ``` root@k8s2:/var/run/cilium/state# clang -emit-llvm -O2 -target bpf -std=gnu89 -nostdinc -D__NR_CPUS__=2 -Wall -Wextra -Werror -Wshadow -Wno-address-of-packed-member -Wno-unknown-warning-option -Wno-gnu-variable-s ized-type-not-at-end -Wdeclaration-after-statement -I/var/run/cilium/state/globals -I1285_next -I/var/lib/cilium/bpf -I/var/lib/cilium/bpf/include -c /var/lib/cilium/bpf/bpf_lxc.c -o /tmp/c In file included from /var/lib/cilium/bpf/bpf_lxc.c:22: /var/lib/cilium/bpf/lib/icmp6.h:50:29: error: use of undeclared identifier 'NODE_MAC' union macaddr smac, dmac = NODE_MAC; ^ /var/lib/cilium/bpf/lib/icmp6.h:359:30: error: use of undeclared identifier 'NODE_MAC' union macaddr router_mac = NODE_MAC; ^ In file included from /var/lib/cilium/bpf/bpf_lxc.c:26: /var/lib/cilium/bpf/lib/lxc.h:67:29: error: use of undeclared identifier 'NODE_MAC' union macaddr router_mac = NODE_MAC; ^ /var/lib/cilium/bpf/bpf_lxc.c:159:10: error: implicit declaration of function 'lookup_ip6_remote_endpoint' [-Werror,-Wimplicit-function-declaration] info = lookup_ip6_remote_endpoint(&orig_dip); ^ /var/lib/cilium/bpf/bpf_lxc.c:159:10: note: did you mean 'lookup_ip6_endpoint'? /var/lib/cilium/bpf/lib/eps.h:13:1: note: 'lookup_ip6_endpoint' declared here lookup_ip6_endpoint(struct ipv6hdr *ip6) ^ /var/lib/cilium/bpf/bpf_lxc.c:159:8: error: incompatible integer to pointer conversion assigning to 'struct remote_endpoint_info *' from 'int' [-Werror,-Wint-conversion] info = lookup_ip6_remote_endpoint(&orig_dip); ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /var/lib/cilium/bpf/bpf_lxc.c:529:10: error: implicit declaration of function 'lookup_ip4_remote_endpoint' [-Werror,-Wimplicit-function-declaration] info = lookup_ip4_remote_endpoint(orig_dip); ^ /var/lib/cilium/bpf/bpf_lxc.c:529:10: note: did you mean 'lookup_ip4_endpoint'? /var/lib/cilium/bpf/lib/eps.h:35:1: note: 'lookup_ip4_endpoint' declared here lookup_ip4_endpoint(const struct iphdr *ip4) ^ /var/lib/cilium/bpf/bpf_lxc.c:529:8: error: incompatible integer to pointer conversion assigning to 'struct remote_endpoint_info *' from 'int' [-Werror,-Wint-conversion] info = lookup_ip4_remote_endpoint(orig_dip); ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /var/lib/cilium/bpf/bpf_lxc.c:1027:10: error: implicit declaration of function 'lookup_ip6_remote_endpoint' [-Werror,-Wimplicit-function-declaration] info = lookup_ip6_remote_endpoint(src); ^ /var/lib/cilium/bpf/bpf_lxc.c:1027:8: error: incompatible integer to pointer conversion assigning to 'struct remote_endpoint_info *' from 'int' [-Werror,-Wint-conversion] info = lookup_ip6_remote_endpoint(src); ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /var/lib/cilium/bpf/bpf_lxc.c:1256:10: error: implicit declaration of function 'lookup_ip4_remote_endpoint' [-Werror,-Wimplicit-function-declaration] info = lookup_ip4_remote_endpoint(ip4->saddr); ^ /var/lib/cilium/bpf/bpf_lxc.c:1256:8: error: incompatible integer to pointer conversion assigning to 'struct remote_endpoint_info *' from 'int' [-Werror,-Wint-conversion] info = lookup_ip4_remote_endpoint(ip4->saddr); ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 11 errors generated. ``` Fixes https://github.com/cilium/cilium/pull/10630 Fixes https://github.com/cilium/cilium/issues/12005 Signed-off-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 22 July 2020, 20:51:22 UTC
5ba5b05 docs(troubleshooting): Remove --serve flag in bugtool [ upstream commit 23265dd883f87cf6a0973d7c30018e25320427dc ] The flag --serve is removed in bugtool in PR #6237, hence related docs should be removed as well. Signed-off-by: Tam Mach <sayboras@yahoo.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 22 July 2020, 20:51:22 UTC
7a07e58 bpf: explicitly set ttl in tunnel key [ upstream commit 38d9bf589e6be9db6694dada7c1a4f407098b40f ] Maciej reported that when using vxlan packets we can see a TTL of 64 in the packet while for geneve packets the TTL is set to 0. This is a kernel issue in that vxlan driver in its vxlan_xmit_one() routine derives TTL from the route if otherwise not explicitly set (such as by BPF tunnel key): ttl = ttl ? : ip4_dst_hoplimit(&rt->dst); In geneve driver however, geneve_xmit_skb() only does the above in non-collect_md mode, which means, if not explicitly set, the TTL will remain 0 here. I'll post a kernel fix separately, but simple workaround is to just set TTL in BPF tunnel key to fixed value of 64. tunnel_ttl is part of 4.9 bpf uapi header, so there are no issues with backwards compat. Reported-by: Maciej Skrocki <maciejskrocki@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 22 July 2020, 20:51:22 UTC
79396e9 .travis:fix up TestShuffle failure on Arm64 [ upstream commit b0610a4b9a9eac349da5e409a532850fdfe38c18 ] After shuffling, there is a small probability event that the order of elements has not changed; The results of multiple tests should prevail. Signed-off-by: Jianlin Lv <Jianlin.Lv@arm.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 22 July 2020, 20:51:22 UTC
42f5aaf datapath/linux: protect against concurrent access in NodeValidateImplementation [ upstream commit d88d5f95f3ff0c19fd926abf01ad3f4ee449bf3a ] The linuxNodeHandler.neighByNode map is also protected by linuxNodeHandler.mutex in all other code paths. Protect access to it in (*linuxNodeHandler).NodeValidateImplementation as well. Fixes #12460 Fixes: 6c06c51926bc ("node: Remove permanent ARP entry when remote node is deleted") Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 22 July 2020, 20:51:22 UTC
8c2f5b0 pkg/k8s: use copy of objectmeta when fetching from local stores [ upstream commit 3a311ab0b0c5698e130e82ce4dcf75efb9d18fe8 ] This commit fixes a bug introduced by 9975bba1637c where we were accidentally writing into the object metadata of the local pod store, which should never happen. Doing that could cause k8s update pod events to give to the Update function handlers an old pod structure with such fields modified. Fixes: 9975bba1637c ("pkg/endpoint: fetch pod and namespace labels from local stores") Reported-by: Deepesh Pathak <deepshpathak@gmail.com> Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 22 July 2020, 20:51:22 UTC
763a05f refactor stringSlice type CLI arguments [ upstream commit 9fe5b33a55decb598b78a5d3a8706c53c03782a0 ] Signed-off-by: JieJhih Jhang <jiejhihjhang@gmail.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 22 July 2020, 20:51:22 UTC
b658d94 linux/routing: Clarify debug logs in test [ upstream commit 9dc4ed67e854fab107fd315cb6ae22056f8e972a ] This should make it easier to know which messages to pay attention to or not. Signed-off-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 22 July 2020, 20:51:22 UTC
720634f linux/routing: Centralize netns handling in test [ upstream commit c4f91e4996ac349a3257d6635611beca21b6a715 ] Previously, the code set the original root netns back when the entire test suite finishes. This commit changes that to switch back to the root netns when each individual test completes. This simplifies the handling of netns's with regard to the their creation and corresponding destruction. Given that we now are locking the goroutine which executes the test to a OS thread, this makes the execution flow easier to follow. The flow becomes: 1) Grab Golang runtime OS thread lock 2) Save reference to original / root netns 3) Create and switch to new netns 4) Execute test 5) Cleanup resources under new netns 6) Close new netns 7) Switch back to original / root netns 8) Unlock Golang runtime OS thread lock This is an effort to reduce the flakiness of this test suite. Signed-off-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 22 July 2020, 20:51:22 UTC
aa76d1f linux/routing: Lock Golang runtime OS thread [ upstream commit fc7716622f5d1b11753e5815472fac15ee359d61 ] This locks the runtime from switching threads (read: goroutines) when handling network namespaces. This is required as network namespaces used with the vishvananda/netns library are thread-local variables. Due to this, we must pin and disallow any other goroutine from running on the OS thread by issuing a runtime.LockOSThread. This allows us to safely invoke the OS services for getting a new netns, executing the test under that netns, and cleanup the netns. This is an effort to reduce the flakiness of this test suite. Read here from more info: https://pkg.go.dev/github.com/vishvananda/netns?tab=doc https://tip.golang.org/src/runtime/proc.go?h=LockOSThread#L3762 Signed-off-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 22 July 2020, 20:51:22 UTC
b9909c5 linux/routing: List devices when adding dummy dev [ upstream commit 582170470b85c6c16e309fbf84710d1bc6fe2870 ] This allows assert if the device already exists before running the tests, and assert that the device truly was created before running the tests. This is an effort to reduce the flakiness of this test suite. Signed-off-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> 22 July 2020, 20:51:22 UTC
4aaf377 Update Go to 1.13.14 Signed-off-by: Tobias Klauser <tklauser@distanz.ch> 21 July 2020, 08:52:10 UTC
98fac06 clustermesh: Report number of failures in status [ upstream commit 3e951e15ec9dcc92c8b53982713f8af5fe631684 ] Good condition: ``` cluster2: ready, 4 nodes, 3 identities, 1 services, 0 failures (last: never) ``` Bad condition: ``` cluster2: not-ready, 0 nodes, 0 identities, 0 services, 1 failures (last: 9s ago) ``` Signed-off-by: Thomas Graf <thomas@cilium.io> 15 July 2020, 18:57:39 UTC
772e0a5 clustermesh: Release old connection asynchronously [ upstream commit 0b0dd558fe3df45ec78ddcb13467cad1b8c2d1cc ] When releasing the etcd connection, sessions are attempted to be revoked. In the event of an unhealthy etcd connection, the operation will fail and time out. This operation will take a long time though. Instead of blocking, release the resources in the background. Signed-off-by: Thomas Graf <thomas@cilium.io> 15 July 2020, 18:57:39 UTC
d24cd64 kvstore: Log errors while closing etcd client [ upstream commit 03b3c105bb427d6d062f609fe084021dc766f228 ] Signed-off-by: Thomas Graf <thomas@cilium.io> 15 July 2020, 18:57:39 UTC
f2ed81b kvstore: Improve initial status message [ upstream commit a7510398179c51d357232ce9eda14d8e8741d0db ] The initial status message of the etcd subsystem is: ``` KVStore: Ok No connection to etcd ``` This can be misleading as it does not indicate whether the etcd session was ever established or not. Clarify this: ``` KVStore: Ok Waiting for initial connection to be established ``` Signed-off-by: Thomas Graf <thomas@cilium.io> 15 July 2020, 18:57:39 UTC
2b92575 clustermesh: Fix comments on scope of rc.mutex [ upstream commit 8599664980a52b1a8e18dc782aa6f210e87c18a5 ] Reported-by: @sayboras Signed-off-by: Thomas Graf <thomas@cilium.io> 15 July 2020, 18:57:39 UTC
504e5aa clustermesh: Improve error message for inital connection attempt [ upstream commit cc162e5e7322855cb8deadc7f8fa1b5020757a10 ] "Backend not initialized" does not mean much to users. Signed-off-by: Thomas Graf <thomas@cilium.io> 15 July 2020, 18:57:39 UTC
dd7341d clustermesh: Restart etcd connection on quorum errors [ upstream commit e88a9093a58f5fbd89fbcbfc39ebd4fde45e8d14 ] Watch the status of the etcd conection and restart the connection if quorum loss is detected. Given that lock acquisition is disabled for clustermesh, the quorum check equals to the ability to receive updates on the heartbat key. Signed-off-by: Thomas Graf <thomas@cilium.io> 15 July 2020, 18:57:39 UTC
856b4a0 clustermesh: Disable initlock quorum check [ upstream commit 645355561bf3c8fa628a2053a7276ca03c73937a ] Clustermesh is never performing write operations so the lock-based quorum check is only adding contention to remote etcds. Signed-off-by: Thomas Graf <thomas@cilium.io> 15 July 2020, 18:57:39 UTC
19364b8 kvstore: Add hearbeat to detect stale etcd connections [ upstream commit d61fdcd537751eedf6f24c820ed8568d386492a9 ] Adds a heartbeat written to a key in an interval (1min) by the operator. Each etcd client installs a watcher to watch the heartbeat key. When the heartbeat is not updated in 2*interval, the quorum check will start failing: ``` KVStore: Ok etcd: 1/1 connected, lease-ID=29c6732d5d580cb5, lock lease-ID=29c6732d5d580cb7, has-quorum=2m2.778966915s since last heartbeat update has been received, consecutive-errors=1: https://192.168.33.11:2379 - 3.4.9 (Leader) ``` When enough consecutive errors have accumulated, the kvstore subsystem will start failing: ``` KVStore: Failure Err: quorum check failed 8 times in a row: 4m28.446600949s since last heartbeat update has been received ``` Signed-off-by: Thomas Graf <thomas@cilium.io> 15 July 2020, 18:57:39 UTC
back to top