https://github.com/JuliaLang/julia
Revision 174fa0e8af7d63725a34def188848670f13d3247 authored by Keno Fischer on 11 March 2020, 22:19:44 UTC, committed by Keno Fischer on 12 March 2020, 03:19:14 UTC
This removes a bad test from Distributed. The failure here looks similar to #35081,
but is not quite the same (though the failure is also suppressed by the fix in #35081).
The issue in #35081 is that the `Distributed` test tries to use the `@distributed` macro,
on the main test cluster, which in turn initializes the lazy connections between the workers
in the all_to_all cluster topology. Now, in #35081, the issue was that this initialization on the
other worker was delayed and in the meantime the test harness had reaped the worker running
the Distributed test, so having the other worker connect to it failed (with ECONNREFUSED).

Here we face a similar issue. As in #35081, the use of the `@distributed` macro causes the lazy
links to be initialized, but here the issue is that the Distributed tests call the internal function
`bind_client_port` which tries to allocate a socket on the common client port. In the actual Distributed
source, we only ever call this function together with the `SO_REUSEPORT` setsocketopt, and we rely
this option being set on all sockets initialized on our client port. Here, however, we do not set this option,
so whether or not the subsequent worker-to-worker connection (triggered by the distributed macro
from #35081) succeeds depends on whether the socket created here was reaped by GC (and likely
has subsequently exited the CLOSE_WAIT state) allowing its reallocation. If not, we observe the CI
error (but with EADDRINUSE rather than ECONNREFUSED).

This PR just deletes the problematic test. I don't think it has much value. The tested function is an internal
implementation detail and is not exported. It is exercised plenty during regular cluster construction.
One could try to test the higher level `socket_reuse_port` function, but to do so properly would basically
amount to building the inverse logic of that function into that test, which seems unnecessary.

For future reference, to aid in debugging similar issues, here's a dtrace script I used to print the
sequence of socket operations performed by Julia. It showed this sequence, though until we found
#35081, I didn't understand how it came to be. Nevertheless, it may be helpful for future similar
issues:

```
inline string JULIA = "julia";

#pragma D option quiet

/* From sys/socket.h */
inline int af_inet = 2;
inline int af_inet6 = 30;
inline int so_reuseport = 0x0200;

syscall::setsockopt:entry /execname==JULIA && arg2==so_reuseport/ {
    self->start = timestamp;
    self->fd = arg0;
}

syscall::setsockopt:return /execname==JULIA && self->start/ {
    printf("%-11s %-6d %-4d %-16s %-3d SO_REUSEPORT %d\n", probefunc, pid, arg0, execname,
        self->fd, errno);
    self->start = 0;
}

/* Adapted from https://github.com/brendangregg/DTrace-book-scripts/blob/master/Chap6/soconnect.d */
dtrace:::BEGIN
{
    /* Add translations as desired from /usr/include/sys/errno.h */
    err[0]            = "Success";
    err[EINTR]        = "Interrupted syscall";
    err[EIO]          = "I/O error";
    err[EACCES]       = "Permission denied";
    err[ENETDOWN]     = "Network is down";
    err[ENETUNREACH]  = "Network unreachable";
    err[ECONNRESET]   = "Connection reset";
    err[ECONNREFUSED] = "Connection refused";
    err[ETIMEDOUT]    = "Timed out";
    err[EHOSTDOWN]    = "Host down";
    err[EHOSTUNREACH] = "No route to host";
    err[EINPROGRESS]  = "In progress";

    printf("%-11s %-6s %-4s %-16s %-3s %-16s %-5s %8s %s\n", "SYSCALL", "PID", "FD", "PROCESS", "FAM",
        "ADDRESS", "PORT", "LAT(us)", "RESULT");
}


syscall::connect*:entry, syscall::bind:entry /execname==JULIA/
{
    /* assume this is sockaddr_in until we can examine family */
    this->s = (struct sockaddr_in *)copyin(arg1, sizeof (struct sockaddr));
    self->fd = arg0;
    this->f = this->s->sin_family;
}

syscall::connect*:entry, syscall::bind:entry /this->f == af_inet && execname==JULIA/
{
    self->family = this->f;
    self->port = ntohs(this->s->sin_port);
    self->address = inet_ntop(self->family, (void *)&this->s->sin_addr);
    self->start = timestamp;
}


syscall::connect*:entry, syscall::bind:entry /this->f == af_inet6 && execname==JULIA/
{
    /* refetch for sockaddr_in6 */
    this->s6 = (struct sockaddr_in6 *)copyin(arg1,
        sizeof (struct sockaddr_in6));
    self->family = this->f;
    self->port = ntohs(this->s6->sin6_port);
    self->address = inet_ntop(self->family, (struct in6_addr *)&this->s6->sin6_addr);
    self->start = timestamp;
}

syscall::connect*:return, syscall::bind:return /self->start && execname==JULIA/
{
    this->delta = (timestamp - self->start) / 1000;
    printf("%-11s %-6d %-4d %-16s %-3d %-16s %-5d %8d %d\n", probefunc, pid, self->fd, execname,
        self->family, self->address, self->port, this->delta, errno);
    self->family = 0;
    self->address = 0;
    self->port = 0;
    self->start = 0;
}
```
1 parent 5c21be3
History
Tip revision: 174fa0e8af7d63725a34def188848670f13d3247 authored by Keno Fischer on 11 March 2020, 22:19:44 UTC
Distributed: Delete bad test
Tip revision: 174fa0e
File Mode Size
.devcontainer
base
contrib
deps
doc
etc
src
stdlib
test
ui
.appveyor.yml -rw-r--r-- 2.1 KB
.gitattributes -rw-r--r-- 65 bytes
.gitignore -rw-r--r-- 273 bytes
.mailmap -rw-r--r-- 11.0 KB
.travis.yml -rw-r--r-- 6.1 KB
CITATION.bib -rw-r--r-- 2.6 KB
CONTRIBUTING.md -rw-r--r-- 21.4 KB
HISTORY.md -rw-r--r-- 275.6 KB
LICENSE.md -rw-r--r-- 5.0 KB
Make.inc -rw-r--r-- 41.4 KB
Makefile -rw-r--r-- 27.0 KB
NEWS.md -rw-r--r-- 6.2 KB
README.md -rw-r--r-- 6.3 KB
VERSION -rw-r--r-- 10 bytes
sysimage.mk -rw-r--r-- 3.8 KB

README.md

back to top