Skip to content

MTT: several intercomm tests periodically hang in call to PMIx_Connect #8958

@hppritcha

Description

@hppritcha

In triaging some outstanding problems I've been seeing for a while now in MTT on the 5.0.x and master branches, I've noticed on several platforms that tests in the ibm/collective/intercomm periodically timeout and are marked as failed.

It appears that periodically these tests hang in the MPI_Comm_accept/connect phase with tracebacks like

#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00002b104e30e193 in PMIx_Connect () from /usr/projects/artab/users/hpp/ompi/install_v5.0.x/lib/libpmix.so.0
#2  0x00002b104baa42b1 in ompi_dpm_connect_accept () from /usr/projects/artab/users/hpp/ompi/install_v5.0.x/lib/libmpi.so.0
#3  0x00002b104bae4a32 in PMPI_Comm_accept () from /usr/projects/artab/users/hpp/ompi/install_v5.0.x/lib/libmpi.so.0
#4  0x000000000040159b in main (argc=1, argv=0x7ffc80db1b08) at iscatterv_inter.c:67

I'm not seeing these failures in the 4.0.x and 4.1.x but haven't tried to reproduce with these branches.

Note it can take quite a few back to back runs before one of these tests hangs. iscatterv_inter seems to be one of the more reliable producers.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions