Revision history - None - origin: https://github.com/cilium/cilium

visit type:

Revision	Author	Date	Message	Commit Date
9ba0504	Ian Vernon	11 July 2019, 19:56:13 UTC	Prepare for v1.5.5 Signed-off by: Ian Vernon <ian@cilium.io>	11 July 2019, 20:42:12 UTC
7eaaf08	Martynas Pumputis	08 July 2019, 17:53:55 UTC	lbmap: Get rid of bpfService cache lock [ upstream commit 48bef164ce991aab6c097079352d2de1bdf4c271 ] This commit: - Removes the bpfService cache lock, as any call to the cache happens via lbmap, and lbmap itself is protected by its own lock. - Makes sure that the lbmap lock is taken in the very beginning of each public method definition of lbmap. Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Ian Vernon <ian@cilium.io>	10 July 2019, 16:59:40 UTC
5c623f5	Maciej Kwiek	05 July 2019, 11:58:46 UTC	retry vm provisioning, increase timeout [ upstream commit 2a50e8fd5cdd1620b23026fbdd24fa1d95ae82d5 ] Signed-off-by: Maciej Kwiek <maciej@isovalent.com> Signed-off-by: Ian Vernon <ian@cilium.io>	10 July 2019, 16:59:40 UTC
10ca060	Martynas Pumputis	08 July 2019, 14:44:32 UTC	daemon: Remove svc-v2 maps when restore is disabled [ upstream commit 302c70faef4f29ecc9a696a1d15ade95e81ad0a1 ] Previously, in the case of `--restore=false`, the svc-v2 related BPF maps were not removed. Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Ian Vernon <ian@cilium.io>	10 July 2019, 16:59:40 UTC
2953f7d	Martynas Pumputis	08 July 2019, 14:51:47 UTC	daemon: Do not remove revNAT if removing svc fails [ upstream commit 23fd0d7cd4262ee1fc647f5a1078075b1f1f4fb0 ] It has been observed that the daemon can receive the duplicate events for a k8s svc removal, e.g.: msg="Kubernetes service definition changed" action=service-deleted endpoints= k8sNamespace=default k8sSvcName=migrate-svc-6 service="frontend:10.104.17.176/ports=[]/selector=map[app:migrate-svc-server-6]" subsys=daemon <..> msg="Kubernetes service definition changed" action=service-deleted endpoints= k8sNamespace=default k8sSvcName=migrate-svc-6 service="frontend:10.104.17.176/ports=[]/selector=map[app:migrate-svc-server-6]" subsys=daemon g msg="Error deleting service by frontend" error="Service frontend not found 10.104.17.176:8000" k8sNamespace=default k8sSvcName=migrate-svc-6 obj="10.104.17.176:8000" subsys=daemon msg="# cilium lb delete-rev-nat 32" k8sNamespace=default k8sSvcName=migrate-svc-6 subsys=daemon msg="deleting L3n4Addr by ID" l3n4AddrID=33 subsys=service In such situations, the daemon tries to remove the revNAT entry twice, which can lead to a removal of a valid revNAT entry if in between the events the daemon creates a new service with the same ID. Fix this by skipping the revNAT removal if the frontend removal fails. Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Ian Vernon <ian@cilium.io>	10 July 2019, 16:59:40 UTC
4354cd5	André Martins	03 July 2019, 09:57:35 UTC	pkg/k8s: add conversion for DeleteFinalStateUnknown objects [ upstream commit 39b0190817642789f44f34f01fee811d330fdb3d ] As k8s watchers can return DeleteFinalStateUnknown objects, cilium needs to be able to convert those object types into DeleteFinalStateUnknown where its internal Obj is also converted into private cilium types. Fixes: 027ccdc31d11 ("pkg/k8s: add converter functions from k8s to cilium types") Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Ian Vernon <ian@cilium.io>	10 July 2019, 16:59:40 UTC
58c1314	Deepesh Pathak	02 July 2019, 15:42:52 UTC	cli: fix panic in cilium bpf sha get command [ upstream commit a9861d12a67254d2ff8ecea174c0c8a460dee4d4 ] * Fixes #8423 * Fixes panic issue in `cilium bpf sha get` command when no sha is provided Signed-off-by: Deepesh Pathak <deepshpathak@gmail.com> Signed-off-by: Ian Vernon <ian@cilium.io>	09 July 2019, 22:54:47 UTC
42232a3	Maciej Kwiek	03 July 2019, 10:53:55 UTC	Retry provisioning vagrant vms in CI [ upstream commit eed40849fe45afd6b477e4ad029b1cb8b7a793f3 ] Signed-off-by: Maciej Kwiek <maciej@covalent.io> Signed-off-by: Ian Vernon <ian@cilium.io>	09 July 2019, 22:54:47 UTC
3b22bae	André Martins	02 July 2019, 19:18:31 UTC	pkg/k8s: hold mutex while adding events to the queue [ upstream commit 7eb649efcfcc2376ecb87e3780a9aacdc55b8e48 ] As there are certain code regions where the mutex is held it might occur the events are put in to the channel out of the original order they got the mutex. Fixes: 7efa98e03b9b ("k8s: Factor out service and endpoints correlation into a new ServiceCache") Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Ian Vernon <ian@cilium.io>	09 July 2019, 22:54:47 UTC
a3fae39	Maciej Kwiek	02 July 2019, 09:29:23 UTC	Change nightly CI job label from fixed to baremetal [ upstream commit 39c279614f8ce7f6efc0e84bbb3acc3d646d33d0 ] Signed-off-by: Maciej Kwiek <maciej@covalent.io> Signed-off-by: Ian Vernon <ian@cilium.io>	09 July 2019, 22:54:47 UTC
ad2f08b	André Martins	02 July 2019, 09:46:25 UTC	test: set 1.15 by default in CI Vagrantfile [ upstream commit 81dfcb918a8d41a5ad6a3f46c770a666a3e475cc ] Fixes: d7fb1c1eec1a ("test: run k8s 1.15.0 by default in all PRs") Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Ian Vernon <ian@cilium.io>	09 July 2019, 22:54:47 UTC
fac1699	Martynas Pumputis	26 June 2019, 12:40:59 UTC	daemon: Change loglevel of "ipcache entry owned by kvstore or agent" [ upstream commit ae1cf42dd079b409ec3bd898b190563753280121 ] This commit changes the loglevel of the following log entry: level=warning msg="Unable to update ipcache entry of Kubernetes node" error="ipcache entry owned by kvstore or agent" This message is not harmful, as after cilium-agent has established connectivity to a kvstore, it's expected that IPCache entries will be owned by kvstore. Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Ian Vernon <ian@cilium.io>	09 July 2019, 22:54:47 UTC
36437ec	André Martins	01 July 2019, 10:42:50 UTC	pkg/kvstore: add etcd lease information into cilium status [ upstream commit 8c70c3728c56d171aad0e1e5b5383fdc9fe9595c ] Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Ian Vernon <ian@cilium.io>	09 July 2019, 22:54:47 UTC
77f8997	André Martins	28 June 2019, 13:55:47 UTC	pkg/k8s: do not parse empty annotations [ upstream commit 185a86b8f203f936b22d7ad8f04c07f947fcdae2 ] It seems annotations for a particular field can exist but they may be empty which causes the parsing of that empty string into an IP to fail. Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Ian Vernon <ian@cilium.io>	09 July 2019, 22:54:47 UTC
efe0057	André Martins	28 June 2019, 11:04:24 UTC	maps/lbmap: protect service cache refcount with concurrent access [ upstream commit 17da3aebd48a77f9d4cb37c32f24b4aedbe52043 ] As updating services can modify the lbmapCache the mutex to access the same bpf map should be held before modifying the cache. Without it we can risk to have the bpf map out of sync with the lbmap cache. Fixes: 2766bc127368 ("daemon,lbmap: Do not update legacy svc if they are disabled") Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Ian Vernon <ian@cilium.io>	09 July 2019, 22:54:47 UTC
e6f53b2	André Martins	27 June 2019, 20:26:21 UTC	operator: add warning message if status returns an error [ upstream commit d0db460d55f89f33f740b725c8f1a696c2347e51 ] It might be helpful to track down why kubernetes killed cilium-operator due the readiness probe failures. Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Ian Vernon <ian@cilium.io>	09 July 2019, 22:54:47 UTC
9fec2a5	André Martins	27 June 2019, 16:55:13 UTC	pkg/kvstore: fix nil pointer in error while doing a transaction in etcd [ upstream commit 7e16b286b598980e1ee97dbf3d9798dd742d9953 ] txnresp can obviously be nil if err is != nil. We should check if txnresp.Succeeded after checking if err == nil first. Fixes: 84291c30ece5 ("pkg/kvstore: implement new *IfLocked methods for etcd") Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Ian Vernon <ian@cilium.io>	09 July 2019, 22:54:47 UTC
9d72bdf	André Martins	02 July 2019, 10:25:52 UTC	examples/kubernetes: bump cilium to v1.5.4 Signed-off-by: André Martins <andre@cilium.io>	02 July 2019, 14:32:25 UTC
73a052b	Thomas Graf	01 July 2019, 22:50:42 UTC	bpf: Remove unneeded debug instructions to stay below instruction limit Signed-off-by: Thomas Graf <thomas@cilium.io>	02 July 2019, 01:52:05 UTC
2e29c31	Thomas Graf	01 July 2019, 18:15:05 UTC	bpf: Prohibit encapsulation traffic from pod when running in encapsulation mode An endpoint can emit encapsulation traffic to a worker node and given certain conditions are met, the worker node will accept the encapsulated traffic and route it. This can bypass egress security policies. If the endpoint is able to guess security identities correctly, the endpoint can also impersonate other security identities. Conditions that must be bet: * Cilium must be running in encapsulation mode. * Endpoint must be aware of at least one worker node IP or be aware of an external LB that redirects to a worker node. * The egress policy of the endpoint must allow UDP on the configured encapsulation port. Alternatively, if the endpoint has access to an external LB such as a NodePort which redirects to the configured UDP encapsulation port on a worker, the policy must allow for this. Redirection to a UDP encapsulation port using a CluserIP or headless service is not affected as the load-balancing decision happens before egress policy. * Masquerading of egress traffic to worker node IP must be enabled, if masquerading is disabled, the remote tunnel will reject the traffic. This means that known or guessed worker node IP must be a node IP which is considered outside of the cluster so masquerading is performed. * The endpoint must guess a security identity or be aware of well-known security identities. * The targeted destination endpoint as defined in the inner IP header must allow the guessed or known source security identity as per ingress policy. In order to mitigate this, any UDP traffic to one of the supported encapsulation ports is now dropped with a distinct drop reason: ``` xx drop (Encapsulation traffic is prohibited) flow 0xcab0685a to endpoint 0, identity 3087->2: 10.16.209.20:39258 -> 192.168.122.124:8472 udp ``` Signed-off-by: Thomas Graf <thomas@cilium.io>	02 July 2019, 01:52:05 UTC
d4cbbfd	André Martins	25 June 2019, 20:03:35 UTC	pkg/endpointmanager: protecting endpoints against concurrent access [ upstream commit fbc9ba004c022f8abcdb040e2689cf6dfbe9566f ] Fixes: f71d87a71c99 ("endpointmanager: signal when work is done") Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Jarno Rajahalme <jarno@covalent.io>	28 June 2019, 03:17:46 UTC
72b7a3a	André Martins	25 June 2019, 20:08:34 UTC	test: set k8s 1.15 as default k8s version [ upstream commit 8c2a33d5fcc05d567d08cc5c8094ac474fd4717a ] Fixes: 4794445b0e02 ("test: test against 1.15.0") Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Jarno Rajahalme <jarno@covalent.io>	28 June 2019, 03:17:46 UTC
3041379	Ray Bejjani	20 June 2019, 10:00:19 UTC	CI: Clean VMs and reclaim disk in nightly test [ upstream commit d7f5931634ad239505725fa5edd70c38d5adc66c ] We didn't cleanup VM images in the nighty test. This isn't a problem overall but nodes that may see more nightly builds (e.g. jenkins-fixed-node-0) may run out of space if only these jobs execute. We now run the cleanup script in a similar way to our other jobs. fixes 0af448da73fb9d8ebb64de6a6e1052f52b9a2a43 Signed-off-by: Ray Bejjani <ray@covalent.io> Signed-off-by: Jarno Rajahalme <jarno@covalent.io>	28 June 2019, 03:17:46 UTC
a55b83f	Ian Vernon	18 June 2019, 22:42:42 UTC	allocator: fix race condition when allocating local identities upon bootstrap [ upstream commit d146a2efd28ca6073c10a74836abf1b93011de5f ] When identities are attempted to be allocated by `AllocateIdentity`, both the labels are checked for if the identity is local, and whether the `localIdentities` structure is non-nil: ``` ... if !identity.RequiresGlobalIdentity(lbls) && localIdentities != nil { return localIdentities.lookupOrCreate(lbls) } // else allocate global identity ... ``` Upon bootstrap, the creation of the localIdentities structure was done asynchronously along with the creation of the global identity allocator. This meant that if a local identity for a CIDR was attempted to be allocated upon bootstrap soon after `InitIdentityAllocator` was invoked, there was no guarantee as to whether `localIdentities` was initialized or not. Consequently, even if a set of labels did correspond to a local identity (e.g., for a CIDR), there would be no local identity allocated for the CIDR, and instead, a global identity would be allocated. This is incorrect behavior. Fix this incorrect behavior by doing the following: * Create `localIdentities` synchronously with calls to `InitIdentityAllocator`. The initialization of the global allocator is still done asynchronously. * Do not asynchronously call `InitIdentityAllocator` upon bootstrap any more, since `InitIdentityAllocator` synchronously creates the local identity allocator now, but now creates the global identity allocator asynchronously. * Wait to allocate local identities until the local identity allocator is initialized instead of having the aforementioned nil-check. If an identity does not require a global identity, it should always be allocated a local identity. Instead of checking whether `localIdentities` is nil, wait for it to be initialized so that we can allocate a local identity for all identities which are not global. Additionally, return the channel which is closed upon the allocator being initialized so that callers to `InitIdentityAllocator` can wait upon it being initialized to preserve existing behavior in unit tests. Also remove oudated comments about relying on the kvstore for allocating CIDR identities upon daemon bootstrap. Fixes: f3bbcd8e88 ("identity: Use local identities to represent CIDR") Signed-off by: Ian Vernon <ian@cilium.io> Signed-off-by: Jarno Rajahalme <jarno@covalent.io>	28 June 2019, 03:17:46 UTC
2cb7d70	Jarno Rajahalme	15 May 2019, 13:05:52 UTC	identity: Initialize well-known identities before the policy repository. [ upstream commit cdf3b6ec11a3437b558cdf671ebbc4f2b8f3d117 ] NewPolicyRepository() now grabs the IdentityCache, so all well-known identities must be initialized before then. This commit removes InitWellKnownIdentities() from InitIdentityAllocator() so that the well-known identities can be initialized earlier. Unfortunately the well-known identities depend on the runtime configuration and thus can't be statically initialized. Before this commit there was a potential for race in initializing and using the well-known identities map. Now that the well-known identities map is initialized earlier, this race is less likely to be a problem. Signed-off-by: Jarno Rajahalme <jarno@covalent.io>	28 June 2019, 03:17:46 UTC
b21e708	John Fastabend	13 June 2019, 14:43:56 UTC	cilium: docker.go ineffectual assignment [ upstream commit c67655cf8eb198ed1b0c02dea76ad5842b467a3a ] Remove unnecessary runtime assignment. ineffassign . /home/john/go/src/github.com/cilium/cilium/pkg/workloads/docker.go:244:4: ineffectual assignment to runtimeRunning make: *** [Makefile:395: ineffassign] Error 1 Fixes: 910d8d7f6a16a ("pkg/workloads: add containerd integration") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Jarno Rajahalme <jarno@covalent.io>	28 June 2019, 03:17:46 UTC
d47f399	Joe Stringer	26 June 2019, 18:40:23 UTC	Disable automatic direct node routes test This test is consistently failing on the v1.5 branch, disable it. Related: https://github.com/cilium/cilium/issues/8378 Signed-off-by: Joe Stringer <joe@cilium.io>	27 June 2019, 00:14:05 UTC
bff1730	André Martins	20 June 2019, 16:15:39 UTC	kubernetes-upstream: add seperate stage to run tests [ upstream commit 46f9b503465c45a4e3f065424711bff31108821b ] Signed-off-by: André Martins <andre@cilium.io>	27 June 2019, 00:14:05 UTC
fa53405	André Martins	20 June 2019, 12:46:02 UTC	docs: update documentation with k8s 1.15 support [ upstream commit f1aff1c681c27502169fde0a37dc415aec4daa9d ] Signed-off-by: André Martins <andre@cilium.io>	27 June 2019, 00:14:05 UTC
1816521	André Martins	12 June 2019, 10:05:50 UTC	test: run k8s 1.15.0 by default in all PRs [ upstream commit d7fb1c1eec1a34ea9ead95d9e9c1d52283a08f32 ] Signed-off-by: André Martins <andre@cilium.io>	27 June 2019, 00:14:05 UTC
f654adf	André Martins	12 June 2019, 09:59:31 UTC	test: test against 1.15.0 [ upstream commit 4794445b0e022320924798a5c9a457fe57be7b66 ] Signed-off-by: André Martins <andre@cilium.io>	27 June 2019, 00:14:05 UTC
ce6ea6b	André Martins	25 June 2019, 12:56:29 UTC	vendor: update k8s to v1.15.0 [ upstream commit 5d8a18795370d94d3dfe9ebaf0f5878f0f19b6c6 ] As a consequence `k8s.io/kubernetes/pkg/kubelet/apis/cri/runtime/v1alpha2` is now `k8s.io/cri-api/pkg/apis/runtime/v1alpha2` Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: André Martins <andre@cilium.io>	27 June 2019, 00:14:05 UTC
7ec5b44	Martynas Pumputis	19 June 2019, 12:44:36 UTC	bpf: Set random MAC addrs for cilium interfaces [ upstream commit acced7e5e0836fd6e79b6b17e2473876f1f75d2f ] To work around the systemd 242+ feature which tries to assign a persistent MAC address for any device by default (see commit message of the previous commit for more details). Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: André Martins <andre@cilium.io>	27 June 2019, 00:14:05 UTC
02477fe	Martynas Pumputis	18 June 2019, 11:54:18 UTC	endpoint: Set random MAC addrs for veth when creating it [ upstream commit 719c2e5bfbf7c931da8535d491aa15a19421c072 ] systemd 242+ tries to set a "persistent" MAC addr for any virtual device by default (controlled by MACAddressPolicy). As setting happens asynchronously after a device has been created, ep.Mac and ep.HostMac can become stale which has a serious consequence - the kernel will drop any packet sent to/from the endpoint. However, we can trick systemd by explicitly setting MAC addrs for both veth ends. This sets addr_assign_type for NET_ADDR_SET which prevents systemd from changing the addrs. Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: André Martins <andre@cilium.io>	27 June 2019, 00:14:05 UTC
0e7feab	Martynas Pumputis	18 June 2019, 09:42:25 UTC	vendor: Update vishvananda/netlink [ upstream commit a8bb290d45c06e80ed39299fdcbc99deecc9a065 ] To include the change which allows to specify veth peer mac address (https://github.com/vishvananda/netlink/pull/460). Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: André Martins <andre@cilium.io>	27 June 2019, 00:14:05 UTC
629fa7a	Martynas Pumputis	17 June 2019, 14:31:42 UTC	mac: Add function to generate a random MAC addr [ upstream commit 143d5d36227cec18a9d5476250438ef6f3d3221f ] A generated MAC addr is unicast and locally administered. Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: André Martins <andre@cilium.io>	27 June 2019, 00:14:05 UTC
6334912	Ian Vernon	19 June 2019, 21:33:03 UTC	test: remove unused function [ upstream commit 4257665ab0c83e21f8ee2d821eed1d740608beda ] Signed-off by: Ian Vernon <ian@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>	27 June 2019, 00:14:05 UTC
610b83e	Ian Vernon	19 June 2019, 21:32:42 UTC	test: introduce `ExecShort` function [ upstream commit 47bdde0d447183e2f7f2ce5d4be4b852ffaaecdf ] This runs commands that do not require a lot of time to complete with a timeout of 10 seconds. Before, the default was up to 5 minutes. Now, we have separate timeouts depending on the nature of the command being ran. Also use context more consistently across other various helper functions. Signed-off by: Ian Vernon <ian@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>	27 June 2019, 00:14:05 UTC
2e22149	Martynas Pumputis	21 June 2019, 08:34:15 UTC	docs: Clarify about legacy services enabled by default [ upstream commit a214c6591068b3860ed33bbb56a6cde6e95e94c6 ] From the previous description of the `enable-legacy-services` ConfigMap option it was not clear that legacy services are enabled by default, i.e. when the option is not set. Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Joe Stringer <joe@cilium.io>	27 June 2019, 00:14:05 UTC
defd338	André Martins	20 June 2019, 15:37:52 UTC	pkg/metrics: re-register newStatusCollector function [ upstream commit eab5540a388e282397bdd24bdb878b966c330aaf ] Fixes: 0fec218c33ff ("pkg/metrics: set all metrics as a no-op unless they are enabled") Reported-by: Christian <christian.huening@figo.io> Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>	27 June 2019, 00:14:05 UTC
e87fd0d	Ray Bejjani	19 June 2019, 13:05:23 UTC	CI: Clean workspace when all stages complete [ upstream commit 3214f85e38d0f0ab795a42f2391ea2a743e6ef24 ] We previously cleaned the workspace when the test stage ends, and not when the pipeline exits. This is probably benign but some archival operations may be interrupted by a premature cleanup. Signed-off-by: Ray Bejjani <ray@covalent.io> Signed-off-by: Joe Stringer <joe@cilium.io>	27 June 2019, 00:14:05 UTC
5bb8204	Ray Bejjani	19 June 2019, 13:03:34 UTC	CI: Clean VMs and reclaim disk after jobs complete [ upstream commit 0af448da73fb9d8ebb64de6a6e1052f52b9a2a43 ] We cleanup when starting a job, to ensure we have enough space. There is a race here, however, where if a job exits and leaves a full disk jenkins may no longer schedule on it. We now clean when a job completes, since the cleanup script is more conservative on when to prune vagrant images. Signed-off-by: Ray Bejjani <ray@covalent.io> Signed-off-by: Joe Stringer <joe@cilium.io>	27 June 2019, 00:14:05 UTC
55bd99a	Ray Bejjani	19 June 2019, 12:37:48 UTC	CI: Report last seen error in CiliumPreFlightCheck [ upstream commit fdb0eaa9ad31dc4923f18c4b510f7fa964fb5e5e ] We reported a shadowed error variable that was always nil. This obscured the reason why the preflight was failing. Signed-off-by: Ray Bejjani <ray@covalent.io> Signed-off-by: Joe Stringer <joe@cilium.io>	27 June 2019, 00:14:05 UTC
24baf2e	Ian Vernon	17 June 2019, 22:28:18 UTC	fqdn: correctly populate Source IP and Port in `notifyOnDNSMsg` [ upstream commit 5c12962622c8e531be7b6cd95b9d56783b1fc12b ] [ Backporter's notes: Had to manually fix up the IPPort parameter ] Before, we were calling `endpoint.String()` for each DNS request going through the DNS proxy and providing that as the value for `SrcIPPort` for the `AddressingInfo` portion of the corresponding `LogRecord` for said DNS request. This in and of itself was a bug (the endpoint string is not a valid hostname / port tuple). On top of this, this was a performance bottleneck, because of the amount of CPU being used to marshal the endpoint into JSON and convert it into a string, per the output of running `gops pprof-cpu <cilium-agent PID>` and running `list notifyOnDNSMsg`: ``` ROUTINE ======================== main.(Daemon).notifyOnDNSMsg in /home/vagrant/go/src/github.com/cilium/cilium/daemon/fqdn.go 10ms 2.75s (flat, cum) 47.33% of Total . . 399: err := errors.New("DNS request cannot be associated with an existing endpoint") . . 400: log.WithError(err).Error("cannot find matching endpoint") . . 401: endMetric() . . 402: return err . . 403: } . 10ms 404: qname, responseIPs, TTL, CNAMEs, rcode, recordTypes, qTypes, err := dnsproxy.ExtractMsgDetails(msg) . . 405: if err != nil { . . 406: // This error is ok because all these values are used for reporting, or filling in the cache. . . 407: log.WithError(err).Error("cannot extract DNS message details") . . 408: } . . 409: . 60ms 410: ep.UpdateProxyStatistics("dns", uint16(serverPort), false, !msg.Response, verdict) . 240ms 411: record := logger.NewLogRecord(proxy.DefaultEndpointInfoRegistry, ep, flowType, false, . . 412: func(lr logger.LogRecord) { lr.LogRecord.TransportProtocol = accesslog.TransportProtocol(protoID) }, . . 413: logger.LogTags.Verdict(verdict, reason), . . 414: logger.LogTags.Addressing(logger.AddressingInfo{ . 1.81s 415: SrcIPPort: ep.String(), . . 416: DstIPPort: serverAddr, . . 417: SrcIdentity: ep.GetIdentity().Uint32(), . . 418: }), ``` Instead, provide the IP/Port tuple from the initial request into all `NotifyOnDNSMsg` invocations for the `DNSProxy`. This correctly populates the `SrcIPPort` field, and improves performance due to not having to marshal the endpoint. When running the aforementioned pprof commands again, the performance is clearly much better when running cilium against the same set of pods making DNS requests at the same cadence: ``` ROUTINE ======================== main.(Daemon).notifyOnDNSMsg in /home/vagrant/go/src/github.com/cilium/cilium/daemon/fqdn.go 0 460ms (flat, cum) 17.83% of Total . . 405: if err != nil { . . 406: // This error is ok because all these values are used for reporting, or filling in the cache. . . 407: log.WithError(err).Error("cannot extract DNS message details") . . 408: } . . 409: . 30ms 410: ep.UpdateProxyStatistics("dns", uint16(serverPort), false, !msg.Response, verdict) . 70ms 411: record := logger.NewLogRecord(proxy.DefaultEndpointInfoRegistry, ep, flowType, false, . . 412: func(lr logger.LogRecord) { lr.LogRecord.TransportProtocol = accesslog.TransportProtocol(protoID) }, . . 413: logger.LogTags.Verdict(verdict, reason), . . 414: logger.LogTags.Addressing(logger.AddressingInfo{ . . 415: SrcIPPort: epIPPort, . . 416: DstIPPort: serverAddr, . . 417: SrcIdentity: ep.GetIdentity().Uint32(), . . 418: }), ``` Fixes: 2935f589ab Signed-off by: Ian Vernon <ian@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>	27 June 2019, 00:14:05 UTC
015081c	Ian Vernon	18 June 2019, 20:17:28 UTC	test: do not overwrite context in `GetPodNamesContext` [ upstream commit 2bc3c85422aa324fdd143d17c5b79448b36a22d3 ] Instead, create a child context which has a shorter timeout, since retrieving pod names from Kubernetes should not take a long time. This ensures that we respect the parent context's timeout, but also ensure that we do not wait an inordinate amount of time for getting said pod names. Signed-off by: Ian Vernon <ian@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>	27 June 2019, 00:14:05 UTC
dccf6fb	Ian Vernon	18 June 2019, 20:15:18 UTC	test: change `GetPodNames` to have a timeout [ upstream commit b01fb6b7a69430dfbe4ee337965209311dc8dcde ] Having a default timeout associated with the default case ensures that if the SSH command to get the pod names gets stuck, we will still timeout. Signed-off by: Ian Vernon <ian@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>	27 June 2019, 00:14:05 UTC
bea8137	Ian Vernon	07 June 2019, 20:55:56 UTC	test: make sure that `GetPodNames` times out after 30 seconds [ upstream commit 5f62996a6431f8b3c9d6f2a2ece4eaa4bf7d2ee8 ] Signed-off by: Ian Vernon <ian@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>	27 June 2019, 00:14:05 UTC
6f3a0de	Ray Bejjani	25 April 2019, 12:24:20 UTC	CI: Ensure k8s execs cancel contexts [ upstream commit 6ecfff82c2314dd9d847645361b57e2646eed64b ] [ Backporter's notes: Previous backport f7a1bd149a40f237850c91baa9b6987de6c67061 backported half of this commit, so this commit backports the other half.. ] We make many calls that end up in SSHMeta.Exec. It would pass context.Background to SSHMeta.ExecContext. Unfortunately, the way ExecContext is written relies on the context expiring eventually, otherwise it will leak goroutines (and also leave the session open, potentially). We now pass in a cancellable context and cancel it when we exit Exec. There is still no direct timeout in this case. Signed-off-by: Ray Bejjani <ray@covalent.io> Signed-off-by: Joe Stringer <joe@cilium.io>	27 June 2019, 00:14:05 UTC
d069999	Thomas Graf	29 May 2019, 08:52:36 UTC	test: Fix NodeCleanMetadata by using --overwrite [ upstream commit 6a27fc16583927b476e3cef8b4ee048f3547e5f2 ] Fixes the following errors: ``` cmd: "kubectl annotate nodes k8s1 io.cilium.network.ipv4-pod-cidr" exitCode: 1 duration: 76.869764ms stdout: stderr: error: at least one annotation update is required ``` Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>	27 June 2019, 00:14:05 UTC
b937d41	Ian Vernon	18 June 2019, 15:52:41 UTC	test: add timeout to `waitToDeleteCilium` helper function [ upstream commit 1f88ae5b3456db21384be5028ca7af6a4b5e24cf ] If the desired state within the function was never reached, it would loop infinitely. This may be the cause behind some CI tests getting stuck recently. Even if it is not the root cause of said issues, having some timeout associated with the operation is reasonable to ensure that said function terminates. Also provide context down to functions with SSH into the VMs, so that we do not hang forever in the case that an SSH session is not able to be established. Signed-off by: Ian Vernon <ian@cilium.io> Signed-off-by: Ian Vernon <ian@cilium.io>	19 June 2019, 23:31:13 UTC
51bf80e	André Martins	13 June 2019, 15:36:25 UTC	.travis: update travis golang to 1.12.5 [ upstream commit 421157a7689a3c4fd685eab00f5af4f113f443c1 ] Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Maciej Kwiek <maciej@covalent.io>	19 June 2019, 23:31:13 UTC
d27fc24	Maciej Kwiek	14 June 2019, 14:23:54 UTC	Don't set debug to true in monitor test [ upstream commit 32c51a54ca205643368a6c6f8934f70b808f4721 ] Signed-off-by: Maciej Kwiek <maciej@covalent.io>	19 June 2019, 23:31:13 UTC
ba9abd8	André Martins	17 June 2019, 14:34:05 UTC	pkg/lock: fix RUnlockIgnoreTime [ upstream commit 47032fcfb241a2457c0e761a638a6ec5d56b3e2a ] These functions were wrongly executing Unlocks internally instead of RUnlocks. Fixes: 8e976dd1598b ("pkg/lock: add detector if a lock was held for more than n seconds") Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Maciej Kwiek <maciej@covalent.io>	19 June 2019, 23:31:13 UTC
c545ace	André Martins	17 June 2019, 13:43:28 UTC	daemon: fix endpoint restore when endpoints are not available [ upstream commit ef04e0f5e2e16c421942bfb18d01f34136b04ba6 ] As containers can be removed between cilium-agent restarts, their endpoints will no longer be valid and they can't be restored. By not being restored in this case, the channel signalizing they were not restored was not written making the go routine to be kept running until the cilium-agent forever. Fixes: 5035a582d952 ("daemon: synchronously add endpoints to endpointmanager in \`regenerateRestoredEndpoints\`") Fixes: 20a49da060fd ("Consistently check for liveness of endpoint when re-locking (#5116)") Fixes: f59daacc1382 ("daemon: on restore, run identity allocation in the background") Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Maciej Kwiek <maciej@covalent.io>	19 June 2019, 23:31:13 UTC
f1c4b8c	Maciej Kwiek	13 June 2019, 09:39:29 UTC	Preload vagrant boxes in k8s upstream jenkinsfile [ upstream commit 0b7f92a3df68fcad8ded552057c8a09e8fdd2d36 ] Signed-off-by: Maciej Kwiek <maciej@covalent.io> Signed-off-by: Ray Bejjani <ray@covalent.io>	13 June 2019, 16:23:08 UTC
67df575	ifeanyi	27 April 2019, 12:07:15 UTC	pkg/health: Fix IPv6 URL format in HTTP probe [ upstream commit 69037fe562f3e66e242424ca3461c872054ddf79 ] The current URL address formatting for HTTP probes does not support IPv6 literal addresses very well. Since an IPv6 address itself contains ':' it should be enclosed within '[' and ']' to be able to properly parse the port from the address. E.g for an endpoint `f00d::a10:0:0:a26e`, we will make an HTTP request: `http://f00d::a10:0:0:a26e:4240/v1beta/hello` which throws an error: `Get http://f00d::a10:0:0:a26e:4240/v1beta/hello: invalid URL port ":a10:0:0:a26e:4240"`. Instead, the request could be something like: `http://[f00d::a10:0:0:a26e]:4240/v1beta/hello` * Enclose health endpoint addresses within brackets during HTTP probe. This doesn't affect the behavior for IPv4 literal addresses. * In case of an error, set the probe result `status` to the error message instead of the generic 'Connection timed out' message. If there aren't already reasons for having the generic message, then it might help the user debug probe issues. Fixes #7804 Signed-off-by: ifeanyi <ify1992@yahoo.com> Signed-off-by: Ray Bejjani <ray@covalent.io>	13 June 2019, 16:23:08 UTC
ec8aa02	Ian Vernon	08 June 2019, 23:53:20 UTC	test: use context with timeout to ensure that Cilium log gathering takes <= 5 minutes [ upstream commit 1cf9c4de3ce183e597aee4545f16a13cd5436b7f ] If the test environment is in a bad state, the log gathering steps can stall the CI, causing it to timeout. Provide context to functions which is connected to a timeout so that once said timeout is reached, the log-gathering will short-circuit and exit quickly. Signed-off by: Ian Vernon <ian@cilium.io> Signed-off-by: Ray Bejjani <ray@covalent.io>	13 June 2019, 16:23:08 UTC
581f034	Sebastian Wicki	11 June 2019, 15:17:44 UTC	k8s: Introduce test for multiple From/To selectors [ upstream commit e8a227b134bc464e959e3ba0ea1f082ab70c82b1 ] This ensures that if we have multiple selectors in a `From` or `To` clause, that we combine them according to the proper k8s semantics. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: Ray Bejjani <ray@covalent.io>	13 June 2019, 16:23:08 UTC
6fcc19c	Sebastian Wicki	11 June 2019, 09:28:27 UTC	k8s: Fix policies with multiple From/To selectors [ upstream commit 37b4d12fcc27265425e3382662aa228c84898046 ] Instead of combining all selectors of the ingress `From` clause into a single IngressRule (and thus accidentally performing a logical `and` for differently typed selectors), this commit introduces a new `IngressRule` for each k8s selector (and thereby logically `or`ing them). The same logic is also applied for `EgressRule`s generated from a `To` clause. If specified, L4 ports are pushed down into the rules generated from the adjacent `From`/`To` clause to maintain proper semantics. Fixes: #8231 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: Ray Bejjani <ray@covalent.io>	13 June 2019, 16:23:08 UTC
90f24fe	Ian Vernon	10 June 2019, 17:37:44 UTC	test: create session and run commands asynchronously [ upstream commit 47637ec12a67ec1b0a4bd94375246fef6857fa0c ] The existing code surrounding the use of `context.Context` to enforce timeouts is correct, but the way that commands are ran currently in the CI code makes their use somewhat useless. This is because previously, we called `newSession` and `runCommand` synchronously befre we select on the deadline of provided contexts being reached. This means that if the calls to create sessions / run commands hang forever, we will never reach the timeout provided to us via Context. In order to ensure that we do not block forever on running of commands / creation of sessions, do the aforementioned in a goroutine. Unfortunately, there is no way via the SSH library we currently use in the CI to propagate Context to creation of sessions, running commands, etc., so in the case that Contexts are reached, we try to close the session we have created as to not leak goroutines. Add a test which ensures that we correctly exit upon the deadline being reached for a given context. Signed-off by: Ian Vernon <ian@cilium.io> Signed-off-by: Ray Bejjani <ray@covalent.io>	13 June 2019, 16:23:08 UTC
75004c3	André Martins	06 June 2019, 17:58:09 UTC	test: bump to k8s 1.14.3 [ upstream commit ad057a5ba4e893c94bff8dc684b2da373d70163c ] Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Ray Bejjani <ray@covalent.io>	13 June 2019, 16:23:08 UTC
fe2df0e	André Martins	29 May 2019, 11:36:07 UTC	test: error out if no-spec policies is allowed in k8s >= 1.15 [ upstream commit 50a99b4a02ff211fa1f7b42406898f6a1c7570e0 ] Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Ray Bejjani <ray@covalent.io>	13 June 2019, 16:23:08 UTC
8e3b30c	André Martins	29 May 2019, 08:39:21 UTC	test/provision: upgrade k8s 1.15 to 1.15.0-beta.2 [ upstream commit 42e70e8b828301d489441ea8384198f87372b51e ] Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Ray Bejjani <ray@covalent.io>	13 June 2019, 16:23:08 UTC
f7a1bd1	Ian Vernon	06 June 2019, 03:23:30 UTC	test: have timeout for `Exec` [ upstream commit e667025a268412321139f2a4d7cd388053f7f5da ] Otherwise, cases which cause subsequent child function calls to get stuck upon receiving from `ctx.Done()` will never be called, as the previously provided context was never canceled. Bounding all executions to take at most four minutes is a reasonable expectation for our CI. Signed-off by: Ian Vernon <ian@cilium.io> Signed-off-by: Ray Bejjani <ray@covalent.io>	13 June 2019, 16:23:08 UTC
fb22e5a	André Martins	03 June 2019, 13:25:24 UTC	pkg/kvstore: introduced a dedicated session for locks [ upstream commit c1b05dfba81a0b3095b165920e685f59c52e76a5 ] If a cilium agent was restarted while holding a distributed TTL in the KVStore it would make that lock from only being released after the default TTL of the Cilium agent expired, 15 minutes. This commit introduces a new session dedicated for locks which will have a TTL of 25 seconds making the lock to be released after 25 seconds if the cilium agent is restarted. Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	12 June 2019, 22:53:02 UTC
9e7010e	André Martins	03 June 2019, 09:40:33 UTC	pkg/kvstore: implement new IfLocked methods for etcd [ upstream commit 84291c30ece5c9f9796dbf490929e1558e1f6ec5 ] Implemented all IfLocked methods for etcd. Consul will be done afterwards as the majority of our users are using etcd. Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	12 June 2019, 22:53:02 UTC
17576d2	André Martins	31 May 2019, 20:15:29 UTC	kvstore/allocator: make the allocator aware of kvstore lock holding [ upstream commit 74545efb52e301780c7ac9ce65470d8898a05af4 ] The allocator is an important piece in the architecture, for this reason we should only perform operations in the KVStore if the allocator is still holding the lock of the KVStore. Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	12 June 2019, 22:53:02 UTC
81dd4e6	André Martins	31 May 2019, 20:11:55 UTC	pkg/kvstore: add Comparator() to KVLocker [ upstream commit 334dbe453e62fd2c9dce98a4c381bf35909756e7 ] Comparator() will help the implementations of each KVStore to detect if they are still holding the lock when performing an operation in the KVStore. Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	12 June 2019, 22:53:02 UTC
8e62b0a	André Martins	31 May 2019, 20:06:19 UTC	pkg/kvstore: add new IfLocked methods to perform txns [ upstream commit 2b1f20c18461faefe124239a3d9b2542ce1092d3 ] IfLocked methods should be used when the kvstore client is holding a lock and each operation should only be performed if the kvstore client is holding the lock. Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	12 June 2019, 22:53:02 UTC
0dae31a	André Martins	07 June 2019, 08:12:26 UTC	test: bump k8s 1.13 to 1.13.7 [ upstream commit 688699b277b81450ea0e5d6107353f02deff940b ] Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	12 June 2019, 22:53:02 UTC
16ee6d7	ifeanyi	29 May 2019, 17:55:29 UTC	test: Enable IPv6 forwarding in test VMs [ upstream commit 22e9cc364ccab2f76fdef208bbeb2def17bdf0e1 ] This issue became visible after noticing in #7869 that there currently isn't IPv6 node connectivity via health endpoints. Signed-off-by: ifeanyi <ify1992@yahoo.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	12 June 2019, 22:53:02 UTC
98cdcb6	Joe Stringer	06 June 2019, 21:01:52 UTC	docs: Remove architecture target links [ upstream commit 02ab7ab5589479646cf2935dc5a6648f4bf5f744 ] Several attempts to fix these URL targets have been made, and none of them seem to work in all setups (eg, RTD, the render-docs container, etc). Since we can't seem to figure out a good way to fix the link, just remove the target link. Users can right click and open the URL to view the image in a larger size. Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	12 June 2019, 22:53:02 UTC
018925d	André Martins	06 June 2019, 10:13:59 UTC	test: add serial ports to CI VMs [ upstream commit ec9019ac1961f865bc0e280cd1b36bce0bcb50df ] This will allow us to connect to a VM that is not rechable via SSH so we can debug what was wrong with the VM. To access the VM can be done with: ``` socat -d -d ./k8s1-1.14-ttyS0.sock PTY screen /dev/pts/1 115200 ``` Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	12 June 2019, 22:53:02 UTC
08bbb25	André Martins	06 June 2019, 13:19:52 UTC	*.Jenkinsfile: remove leftover failFast [ upstream commit 0739d7d79e04be1f60cbc15ba333460d528baab0 ] Fixes: 166d9c2333c0 ("Separate envs for tests in jenkins k8s pipeline") Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	12 June 2019, 22:53:02 UTC
b75dc68	Ian Vernon	05 June 2019, 20:20:46 UTC	endpoint: make sure `updateRegenerationStatistics` is called within anonymous function [ upstream commit 5c0974c6bf1d8172c3a20d35d587e432ec52b28b ] This ensures that the value of `retErr` that is passed into the function is the value after `regenerate` returns, not the value at the time the function was deferred, in which case it would be nil. An example of this behavior when deferred directly: https://play.golang.org/p/YRRm4VpiI92 (incorrect). And when wrapped in an anonymous function: https://play.golang.org/p/NWxyhID5Eeg (correct). Signed-off by: Ian Vernon <ian@cilium.io> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	12 June 2019, 22:53:02 UTC
1de682e	Ian Vernon	06 June 2019, 18:13:39 UTC	Prepare for v1.5.3 Signed-off by: Ian Vernon <ian@cilium.io>	06 June 2019, 19:44:00 UTC
911eb3f	Ian Vernon	04 June 2019, 20:50:35 UTC	test: do not spawn goroutines to wait for canceled context in `RunCommandContext` [ upstream commit cf14b6a19283bd994f3dc6cd429c29457c40adef ] It was observed while running `gops stack` for the Ginkgo test suite locally, that we were leaking goroutines that were getting stuck while waiting for SSH sessions to finish. We accrued over 1000 of these per K8s CI run: ``` 16 unique stack traces 960 occurences. Sample stack trace: github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh.(Session).Wait(0xc00098e000, 0x230f79c, 0x1a) /Users/ianvernon/go/src/github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh/session.go:403 +0x57 github.com/cilium/cilium/test/helpers.(SSHClient).RunCommandContext.func1(0x2502600, 0xc000542280, 0xc00098e000, 0xc0001aa780, 0xc000268220) /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:262 +0x1cc created by github.com/cilium/cilium/test/helpers.(SSHClient).RunCommandContext /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:253 +0x13c ``` This example shows that there were over 960 goroutines stuck on `session.Wait()`. Whenever we run a command via SSH, we call `runCommand`. When we call `runCommand`, it calls `session.Run`, which calls `session.Start()` and `session.Wait()`. I observed that that calling `Wait()` on a session which already has had `Run` invoked will never return, even if we try to call `session.Close()` before invoking `session.Wait()`. This indicates that our logic for trying to kill the session if the context which is provided to `RunCommandContext` is canceled is flawed, as waiting for the session to finish before closing it will block infinitely. I enabled debug mode for the SSH library we use (`golang.org/x/crypto/ssh`), and I see that the session receives an EOF message before* we even try to close the session: ``` >------- session started, and session.Run() invoked 2019/06/05 08:16:59 send global(2): ssh.channelOpenMsg{ChanType:"session", PeersID:0x2, PeersWindow:0x200000, MaxPacketSize:0x8000, TypeSpecificData:[]uint8(nil)} 2019/06/05 08:16:59 decoding(2): 91 &ssh.channelOpenConfirmMsg{PeersID:0x2, MyID:0x0, MyWindow:0x0, MaxPacketSize:0x8000, TypeSpecificData:[]uint8{}} - 17 bytes 2019/06/05 08:16:59 send(2): ssh.channelRequestMsg{PeersID:0x0, Request:"exec", WantReply:true, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x6d, 0x6b, 0x75, 0x62, 0x65, 0x63, 0x74, 0x6c, 0x20, 0x65, 0x78, 0x65, 0x63, 0x20, 0x2d, 0x6e, 0x20, 0x6b, 0x75, 0x62, 0x65, 0x2d, 0x73, 0x79, 0x73, 0x74, 0x65, 0x6d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x2d, 0x62, 0x63, 0x63, 0x6d, 0x34, 0x20, 0x2d, 0x2d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x20, 0x73, 0x74, 0x61, 0x74, 0x75, 0x73, 0x20, 0x2d, 0x6f, 0x20, 0x6a, 0x73, 0x6f, 0x6e, 0x70, 0x61, 0x74, 0x68, 0x3d, 0x27, 0x7b, 0x2e, 0x63, 0x6c, 0x75, 0x73, 0x74, 0x65, 0x72, 0x2e, 0x6e, 0x6f, 0x64, 0x65, 0x73, 0x5b, 0x2a, 0x5d, 0x2e, 0x70, 0x72, 0x69, 0x6d, 0x61, 0x72, 0x79, 0x2d, 0x61, 0x64, 0x64, 0x72, 0x65, 0x73, 0x73, 0x2e, 0x2a, 0x7d, 0x27}} 2019/06/05 08:16:59 decoding(2): 93 &ssh.windowAdjustMsg{PeersID:0x2, AdditionalBytes:0x200000} - 9 bytes 2019/06/05 08:16:59 decoding(2): 99 &ssh.channelRequestSuccessMsg{PeersID:0x2} - 5 bytes >------- EOF sent on channel (not by us; we have not closed the sesion yet! 2019/06/05 08:16:59 send(2): ssh.channelEOFMsg{PeersID:0x0} 2019/06/05 08:16:59 decoding(2): data packet - 181 bytes 2019/06/05 08:16:59 send(2): ssh.windowAdjustMsg{PeersID:0x0, AdditionalBytes:0xac} 2019/06/05 08:16:59 decoding(2): 96 &ssh.channelEOFMsg{PeersID:0x2} - 5 bytes 2019/06/05 08:16:59 decoding(2): 98 &ssh.channelRequestMsg{PeersID:0x2, Request:"exit-status", WantReply:false, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x0}} - 25 bytes 2019/06/05 08:16:59 decoding(2): 97 &ssh.channelCloseMsg{PeersID:0x2} - 5 bytes 2019/06/05 08:16:59 send(2): ssh.channelCloseMsg{PeersID:0x0} >------- we try to close the session, and receive the following error failed to close session: EOF ``` It appears that we cannot close the session, since an EOF has already been sent for it. I am not exactly sure where this comes from. I've posted an issue / question in the GitHub repository for golang: https://github.com/golang/go/issues/32453 . Our attempts to send signals (SIGHUP and SIGINT) are met by this same EOF error as well; there is no point on waiting for the session to finish in this case, so just try to close it and move on, and not leak goroutines that will be stuck forever. When running `gops` now against the Gingko test suite, we no longer accrue a ton of these goroutines blocked on `session.Wait()` - the biggest # of occcurrences for "similar" goroutines is at most 2 in a stack trace captured below, for example: ``` $ ../contrib/scripts/consolidate_go_stacktrace.py stack9.out \| head -n 15 14 unique stack traces 2 occurences. Sample stack trace: internal/poll.runtime_pollWait(0x3f14d50, 0x72, 0xc000a6cad0) /usr/local/go/src/runtime/netpoll.go:173 +0x66 internal/poll.(*pollDesc).wait(0xc00043c218, 0x72, 0xffffffffffffff00, 0x24e7ac0, 0x3225738) /usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9a ... ``` Signed-off by: Ian Vernon <ian@cilium.io> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	06 June 2019, 07:41:16 UTC
a049140	Thomas Graf	05 June 2019, 11:41:03 UTC	node/store: Do not delete node key in kvstore on node registration failure [ upstream commit e8cb4205e5027cec21dc726691f7e5347a34438f ] When registration fails, do not attempt to remove an eventually existing key in the kvstore. This may in fact remove a key that is already present from a previous agent run. If the agent fails to update the key it will eventually expire via the lease. Removing it on the first failed attempt to udpate will trigger an unwanted delete notification. Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	06 June 2019, 07:41:16 UTC
15fdc43	Thomas Graf	05 June 2019, 10:28:02 UTC	kvstore/store: Do not remove local key on sync failure [ upstream commit 960da244c42d1d9d806a9c76bbfc2620ccd30bb4 ] The only failure scenario is if the key could not be updated/inserted. Removing the local key may remove an already existing key in the kvstore. To keep the behavior of UpdateLocalKeySync() consistent in all cases, the store is locked and the local key is only updated after successful insertion into the kvstore. Removing the key erroneously can lead to a delete notification being triggered for the own key. Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	06 June 2019, 07:41:16 UTC
cd1cbc7	Thomas Graf	05 June 2019, 00:39:30 UTC	node: Delay handling of node delete events received via kvstore [ upstream commit 3137ad5eee6c617e1acc8d5b9fcae7c899a77c97 ] As for other kvstore keys, each node will protect its own node key and re-create it as needed on loss of kvstore state. However, when immediately acting upon receiving a node delete event via the kvstore, the removal of routes and similar can cause etcd itself to become unreachable when etcd is hosted as a pod. A subsequent readdition of the key can then not be received and the node is lost forever. There are several more complex long-term options including relying on k8s node state. For a short-term resolution, introduce a 30 seconds delay when handling node delete events from the kvstore and require the node to not be re-created in that timeframe. This workaround works regardless of how node discovery is performed. A potential side effect of this is that when a node re-appears with different IP addressing. In that scenario, the k8s node delete event will forcefully remove all state. Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	06 June 2019, 07:41:16 UTC
5aeccef	André Martins	28 May 2019, 07:52:58 UTC	test/provision: bump k8s 1.12 to 1.12.9 [ upstream commit fcb7a5a3171e8bf5d11bcf868b7e3f6c2bbf6b76 ] Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	06 June 2019, 07:41:16 UTC
e282209	André Martins	03 June 2019, 14:01:35 UTC	pkg/kvstore: do not always UpdateIfDifferent with and without lease [ upstream commit 361c99dd1082d63384748b44e78098281a9c5808 ] We should only update the value of the given key if the lease is different only if we are trying to update it lease as well. Fixes: 0259ad0568fd ("pkg/kvstore: perform update if value or lease are different") Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	06 June 2019, 07:41:16 UTC
5941dd8	Maciej Kwiek	14 May 2019, 09:08:22 UTC	Don't overwrite minRequired in WaitforNPods [ upstream commit 1105fb428d2f38c3ce65215a519d3d9738d31807 ] If `minRequired` was set to 0, it was overwritten on the first pass of `body` function (since it's in this function's closure) and if some pods were being deleted during the test startup time, it could cause `minRequired` to be higher than possible with test pods number. Signed-off-by: Maciej Kwiek <maciej@covalent.io> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	06 June 2019, 07:41:16 UTC
cab47ed	Joe Stringer	11 April 2019, 18:55:43 UTC	daemon: Don't log endpoint restore if IP alloc fails [ upstream commit f063e7e4b22a075f49c0f19477942cfbb9902dbe ] In the case where the IP for a restoring endpoint cannot be re-allocated, log this and error out earlier, before reporting that the endpoint is being restored. Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	06 June 2019, 07:41:16 UTC
1fe3479	Joe Stringer	11 April 2019, 18:53:41 UTC	daemon: Refactor individual endpoint restore [ upstream commit 6f01080db685816bf0d69f0f92623b86acd29378 ] Factor out aspects of restoring an endpoint which require directly interacting with other daemon-managed components. Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	06 June 2019, 07:41:16 UTC
a7688a0	Ian Vernon	03 June 2019, 21:42:14 UTC	test: provide context which will be cancled to `CiliumExecContext` [ upstream commit 42a899c30348dd77d5314d1478b45cb0fc94049f ] Providing just `context.Background()` means that the context provided to `CiliumExecContext` is never canceled. Provide a context that uses the existing 4 minute timeout, and cancel the context after the call to `CiliumExecContext` is finished so that goroutines don't leak. Signed-off by: Ian Vernon <ian@cilium.io> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	06 June 2019, 07:41:16 UTC
3907fc9	André Martins	04 June 2019, 12:34:06 UTC	Jenkinsfile: backport all Jenkinsfile from master We have made lots of changes in jenkins that were not and should have been backported into this branch. For this reason this commit is a copy of all Jenkinsfiles present in master with the k8s versions being tested changed to be v1.5 specific. Signed-off-by: André Martins <andre@cilium.io>	05 June 2019, 15:32:54 UTC
2686534	Thomas Graf	04 June 2019, 11:10:06 UTC	doc: Document regressions in 1.5.0 and 1.5.1 Signed-off-by: Thomas Graf <thomas@cilium.io>	04 June 2019, 13:59:46 UTC
fce6b8f	Ian Vernon	30 May 2019, 00:35:59 UTC	Prepare for release v1.5.2 Signed-off by: Ian Vernon <ian@cilium.io> Signed-off-by: Thomas Graf <thomas@cilium.io>	04 June 2019, 13:59:46 UTC
e39374b	Thomas Graf	04 June 2019, 07:14:04 UTC	test: Disable unstable K8sDatapathConfig Encapsulation Check connectivity with transparent encryption and VXLAN encapsulation Signed-off-by: Thomas Graf <thomas@cilium.io>	04 June 2019, 10:27:50 UTC
9e010c5	Maciej Kwiek	24 May 2019, 11:03:01 UTC	Add kvstore quorum check to Cilium precheck [ upstream commit 10a0e337f0e845d5fa772db8a9f7d162d6d05500 ] Signed-off-by: Maciej Kwiek <maciej@covalent.io> Signed-off-by: Thomas Graf <thomas@cilium.io>	04 June 2019, 10:27:50 UTC
9f756cb	André Martins	17 May 2019, 17:15:40 UTC	pkg/kvstore: acquire a random initlock [ upstream commit 394c93ec0fa69c9c6731944fe5adcb27d75265c6 ] If all cilium agents are trying to acquire the same lock they can return a wrong perception that etcd does not have quorum but in reality it can be the fact that one client got killed while it was holding the lock and its lease hasn't expire yet. Making all agents to acquire a random path will make it more unlikely for 2 agents to acquire the same path. Fixes: 680b2ee4b96c ("kvstore: Wait for kvstore to reach quorum") Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Thomas Graf <thomas@cilium.io>	04 June 2019, 10:27:50 UTC
46777c9	Thomas Graf	07 May 2019, 10:09:46 UTC	kvstore: Wait for kvstore to reach quorum [ upstream commit 680b2ee4b96c335c8873301b869100a0614fc00c ] So far, the channel returned by the Connected() function of the kvstore client was closed when the kvstore could be connected to. This did not guarantee that the kvstore actually had quorum and operations could still block and timeout afterwards. Wait for a distributed lock to be acquired to identify the moment quorum has been reached. Also indicate the quorum status in the kvstore status of `cilium status` for etcd. Signed-off-by: Thomas Graf <thomas@cilium.io>	04 June 2019, 10:27:50 UTC
fe5e8f5	Thomas Graf	03 June 2019, 09:46:05 UTC	ipcache: Fix automatic recovery of deleted ipcache entries [ upstream commit f7064bf5242e4e85b2b512c2e0cb8cbfceb22379 ] The existing delete recovery logic was relying on upsert() to be called on kvReferenceCounter. This was never done though so when the delete handler verified for local existance, no entry was ever found. Remove the obsolete upsert() handler of kvReferenceCounter as no reference counting is needed and insert an entry into marshaledIPIDPairs from UpsertIPToKVStore() directly. Fixes: b76bb7f504a ("ipcache: Fix refcounting of endpoint IPCache entries") Signed-off-by: Thomas Graf <thomas@cilium.io>	04 June 2019, 10:27:50 UTC
0b10a34	Daniel Borkmann	22 May 2019, 21:23:02 UTC	tests, k8s: add monitor dump helper for debugging [ upstream commit c7f32d6b12a69c298ecaded5a141a6f01561dc49 ] So far MonitorStart() command was only a helper for SSHMeta type. As a result, we couldn't use it for the Kubernetes tests under k8sT/. Thus, add support for MonitorStart() for Kubectl type along with an example usage. Having the helper avoids that others need to reimplement it. I added the option that one can specify a file name where the contents should be saved for the dump in case one wants to trace multiple exec calls to curl and whatnot without appending to the same log. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Thomas Graf <thomas@cilium.io>	04 June 2019, 10:27:50 UTC
2d5baaa	Daniel Borkmann	17 May 2019, 14:45:15 UTC	bugtool: add raw dumps of all lb and lb-related maps [ upstream commit e9d0989fa270256576934ad6b0e13629a0bef7ab ] It turned out to be tremendously helpful to have these at hand when debugging some of the recent issues from lb handling. For newer LLVMs we can do pretty-printing for them in future. Also add a few other misc commands from iproute2 that could be helpful in future. Fixes: #8081 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Thomas Graf <thomas@cilium.io>	04 June 2019, 10:27:50 UTC
8a1c4ba	André Martins	29 May 2019, 11:50:21 UTC	pkg/labels: ignore all labels that match the regex "annotation.*" [ upstream commit 53c2e816bceb415f5a18bb15b578565403a3b834 ] This will help to decrease the cardinality of labels used to create a security identity since pod annotations can be used for enumerous reasons and aren't used for network policy enforcement. Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Thomas Graf <thomas@cilium.io>	04 June 2019, 10:27:50 UTC
f6c8e38	Martynas Pumputis	29 May 2019, 14:33:58 UTC	docs: Add note about keeping enable-legacy-services [ upstream commit 86ddc72e61534e940938f49403a9c5e97bf8ebe4 ] When `--enable-legacy-services` is set to true, we consider the legacy maps as a source of the truth. Thus, if a user disables and afterwards enables the option, the svc-v2 maps will be lost. Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Thomas Graf <thomas@cilium.io>	04 June 2019, 10:27:50 UTC
2d86de3	Martynas Pumputis	29 May 2019, 14:23:31 UTC	docs: Add note about running preflight-with-rm-svc-v2.yaml [ upstream commit a115fceb992ed5fa1d084650c9acbb151594f335 ] This is needed for users who want to do the v1.5 -> v1.4 -> v1.5 upgrade. Main reason for requiring this is that in v1.4 we do not maintain the svc-v2 BPF maps, so they become stale. When doing the service restoration from legacy to v2, we do not compare endpoint entries of the maps (neither does `syncLBMapsWithK8s`), so it's better to remove the v2 maps to force full restoration from the legacy maps. Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Thomas Graf <thomas@cilium.io>	04 June 2019, 10:27:50 UTC
a4c2e6a	Martynas Pumputis	29 May 2019, 14:06:24 UTC	examples: Add preflight DaemonSet for svc-v2 removal [ upstream commit 04d11b5e57fb2ffd2ddeb7dcdfc1ad2dbaa2f881 ] This DaemonSet is going to be used by users who upgraded to Cilium v1.5, then downgraded to <v1.5 and want to upgrade to v1.5 again. Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Thomas Graf <thomas@cilium.io>	04 June 2019, 10:27:50 UTC

Newer
Older