-
Notifications
You must be signed in to change notification settings - Fork 908
Closed
Description
In triaging some outstanding problems I've been seeing for a while now in MTT on the 5.0.x and master branches, I've noticed on several platforms that tests in the ibm/collective/intercomm periodically timeout and are marked as failed.
It appears that periodically these tests hang in the MPI_Comm_accept/connect phase with tracebacks like
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00002b104e30e193 in PMIx_Connect () from /usr/projects/artab/users/hpp/ompi/install_v5.0.x/lib/libpmix.so.0
#2 0x00002b104baa42b1 in ompi_dpm_connect_accept () from /usr/projects/artab/users/hpp/ompi/install_v5.0.x/lib/libmpi.so.0
#3 0x00002b104bae4a32 in PMPI_Comm_accept () from /usr/projects/artab/users/hpp/ompi/install_v5.0.x/lib/libmpi.so.0
#4 0x000000000040159b in main (argc=1, argv=0x7ffc80db1b08) at iscatterv_inter.c:67
I'm not seeing these failures in the 4.0.x and 4.1.x but haven't tried to reproduce with these branches.
Note it can take quite a few back to back runs before one of these tests hangs. iscatterv_inter seems to be one of the more reliable producers.