Skip to content

Bus error with btl/sm+XPMEM in MPI_Finalize() #9868

@gkatev

Description

@gkatev

Hi, I've been seeing some crashes related to btl/sm and XPMEM, during MPI_FInalize().

Environment:

Open MPI 5.0.x (#b640590) (from git)
CentOS 8, aarch64

Example execution:

$(which mpirun) --host localhost:160 --mca coll basic,libnbc --mca pml ob1 --mca btl sm,self --mca smsc xpmem osu_bcast

Backtrace:

Program terminated with signal SIGBUS, Bus error.
(gdb) bt
#0  0x0000ffffae4ed550 in mca_btl_sm_check_fboxes () at ../../../../opal/mca/btl/sm/btl_sm_fbox.h:241
#1  mca_btl_sm_component_progress () at btl_sm_component.c:578
#2  0x0000ffffae4717a8 in opal_progress () at runtime/opal_progress.c:224
#3  0x0000ffffaeaa625c in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:299
#4  0x00000000004019f0 in main (argc=<optimized out>, argv=<optimized out>) at osu_bcast.c:119

I claim that it is related to XPMEM, because if I set smsc=cma it goes away (and because it is XPMEM that traditionally triggers bus errors?). This is an aarch64 system, but I can also reproduce the error in an x86 one. For what it's worth, I do remember a similar (or same?) bug even before smsc's time, so it might not be directly related to smsc.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions