-
Notifications
You must be signed in to change notification settings - Fork 910
Description
This issue started with this post on the users list: https://www.mail-archive.com/[email protected]/msg34892.html
I Zoom'ed with Scott, and we dug into this a bit. The root cause of the problem appears to be the do_child()
function in the default ODLS module. Specifically:
long fd, fdmax = sysconf(_SC_OPEN_MAX); |
This line is calling sysconf()
to determine how many FD's to close in the child process that was just forked (before the exec).
On Scott's system, the value returned from this call is -1, which gets interpreted as long_max (i.e., in the billions). His system appears to hang, but it's not hung -- it's really just that the child is looping billions of times calling close()
in this loop:
ompi/orte/mca/odls/default/odls_default_module.c
Lines 404 to 411 in 8a1f456
for(fd=3; fd<fdmax; fd++) { | |
if ( | |
#if OPAL_PMIX_V1 | |
fd != cd->opts.p_internal[1] && | |
#endif | |
fd != write_fd) { | |
close(fd); | |
} |
Scott is running:
We interactively added an opal_output()
in the ODLS default component and saw:
- The value returned by
sysconf()
is -1, which is interpreted as long_max (something in the billions). - After the call,
errno
was set to 22/Invalid argument (although I neglected to set errno to 0 before the call tosysconf()
, so I'm not 100% sure that that errno value is from that call tosysconf()
) - The value of
_SC_OPEN_MAX
is 5 (which is the same as it is on my Intel MacOS 12.3.1 machine)
This is happening on Open MPI 4.1.x, but since this code hasn't changed in forever, I suspect it's happening on all versions of Open MPI / PRTE.