Skip to content

BTL vader (and sm) crash when filesystem of session directory is too small #4553

@jsquyres

Description

@jsquyres

Götz Waschk (@LaHaine) cited in a thread starting here https://www.mail-archive.com/[email protected]/msg30820.html that he would see crashes with the vader and sm BTLs when he went above a certain number of processes in the job (1024, in his case).

Later in the thread, it was determined that the issue was that the /tmp filesystem where the session directory was located (and where the vader and SM BTLs put their memory-mapped files) was too small -- the job was crashing when we filled up the filesystem.

Open MPI shouldn't crash in this case. It would be 97% better if we emit an opal_show_help() message saying specifically what happened (i.e., that we effectively ran out of space in the session directory) here and gracefully die. But segv'ing -- or otherwise crashing -- feels like it should be an avoidable error, and doesn't help the user diagnose what went wrong / how to fix it.

Note that the stack traces cited on the email thread were from the v1.10.x series, and probably aren't useful for checking exactly where this is happening on master and more recent release series (indeed, the sm BTL was removed starting with Open MPI v3.0.x; I list the sm BTL here simply because we're still supporting the v2.1.x series). But it shouldn't be hard to duplicate this error in a controlled environment and find where exactly the "out of space" issue is causing vader to crash (and sm, if we care).

Hopefully, this will lead to a fairly easy fix of emitting a show_help message and killing the job in an orderly fashion.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions