-
Notifications
You must be signed in to change notification settings - Fork 908
Closed
Description
Hi, I've been seeing some crashes related to btl/sm and XPMEM, during MPI_FInalize().
Environment:
Open MPI 5.0.x (#b640590) (from git)
CentOS 8, aarch64
Example execution:
$(which mpirun) --host localhost:160 --mca coll basic,libnbc --mca pml ob1 --mca btl sm,self --mca smsc xpmem osu_bcast
Backtrace:
Program terminated with signal SIGBUS, Bus error.
(gdb) bt
#0 0x0000ffffae4ed550 in mca_btl_sm_check_fboxes () at ../../../../opal/mca/btl/sm/btl_sm_fbox.h:241
#1 mca_btl_sm_component_progress () at btl_sm_component.c:578
#2 0x0000ffffae4717a8 in opal_progress () at runtime/opal_progress.c:224
#3 0x0000ffffaeaa625c in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:299
#4 0x00000000004019f0 in main (argc=<optimized out>, argv=<optimized out>) at osu_bcast.c:119
I claim that it is related to XPMEM, because if I set smsc=cma
it goes away (and because it is XPMEM that traditionally triggers bus errors?). This is an aarch64 system, but I can also reproduce the error in an x86 one. For what it's worth, I do remember a similar (or same?) bug even before smsc
's time, so it might not be directly related to smsc.