Skip to content

Signal forwarding is broken in ompi-release #2075

@artpol84

Description

@artpol84

I'm seeing litter on our jenkins server. The reason is that timeout signal is not properly propagated to the application processes in v2.x. The following example hangs:

$timeout -s SIGSEGV 2m /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/bin/oshrun -np 8 --bind-to none -x SHMEM_SYMMETRIC_HEAP_SIZE=256M --mca btl_openib_if_include mlx5_0:1 -x MXM_RDMA_PORTS=mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc,cm --mca rmaps_base_dist_hca mlx5_0:1 --mca sshmem_verbs_hca_name mlx5_0:1 --mca spml ucx -mca pml ucx taskset -c 12,13 sleep 10000
[jenkins03:02612] *** Process received signal ***
[jenkins03:02612] Signal: Segmentation fault (11)
[jenkins03:02612] Signal code:  (0)
[jenkins03:02612] Failing at address: 0x10af00000a33
[jenkins03:02612] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x7ffff6898100]
[jenkins03:02612] [ 1] /usr/lib64/libc.so.6(epoll_wait+0x33)[0x7ffff65be7a3]
[jenkins03:02612] [ 2] /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/lib/libopen-pal.so.20(+0x8ec93)[0x7ffff785ac93]
[jenkins03:02612] [ 3] /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/lib/libopen-pal.so.20(opal_libevent2022_event_base_loop+0x170)[0x7ffff785e6e0]
[jenkins03:02612] [ 4] /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/bin/oshrun[0x40541a]
[jenkins03:02612] [ 5] /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/bin/oshrun[0x403730]
[jenkins03:02612] [ 6] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ffff64e9b15]
[jenkins03:02612] [ 7] /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/bin/oshrun[0x403649]
[jenkins03:02612] *** End of error message ***
oshrun: Forwarding signal 18 to job

master is not affected:

/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/bin/oshrun -np 8 --bind-to none -x SHMEM_SYMMETRIC_HEAP_SIZE=256M --report-state-on-timeout --get-stack-traces --timeout 20 --mca btl_openib_if_include mlx5_0:1 -x MXM_RDMA_PORTS=mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc,cm --mca rmaps_base_dist_hca mlx5_0:1 --mca sshmem_verbs_hca_name mlx5_0:1 --mca spml ucx -mca pml ucx taskset -c 10,11 sleep 10000
--------------------------------------------------------------------------
The user-provided time limit for job execution has been reached:

  Timeout: 20 seconds

The job will now be aborted.  Please check your code and/or
adjust/remove the job execution time limit (as specified by --timeout
command line option or MPIEXEC_TIMEOUT environment variable).
--------------------------------------------------------------------------
DATA FOR JOB: [12390,0]
    Num apps: 1 Num procs: 1    JobState: ALL DAEMONS REPORTED  Abort: False
    Num launched: 0 Num reported: 1 Num terminated: 0

    Procs:
        Rank: 0 Node: jenkins03 PID: 2845   State: RUNNING  ExitCode 0

DATA FOR JOB: [12390,1]
    Num apps: 1 Num procs: 8    JobState: RUNNING   Abort: False
    Num launched: 8 Num reported: 0 Num terminated: 0

    Procs:
        Rank: 0 Node: jenkins03 PID: 2853   State: RUNNING  ExitCode 0
        Rank: 1 Node: jenkins03 PID: 2854   State: RUNNING  ExitCode 0
        Rank: 2 Node: jenkins03 PID: 2855   State: RUNNING  ExitCode 0
        Rank: 3 Node: jenkins03 PID: 2856   State: RUNNING  ExitCode 0
        Rank: 4 Node: jenkins03 PID: 2857   State: RUNNING  ExitCode 0
        Rank: 5 Node: jenkins03 PID: 2858   State: RUNNING  ExitCode 0
        Rank: 6 Node: jenkins03 PID: 2859   State: RUNNING  ExitCode 0
        Rank: 7 Node: jenkins03 PID: 2860   State: RUNNING  ExitCode 0

Waiting for stack traces (this may take a few moments)...
STACK TRACE FOR PROC [[12390,1],0] (jenkins03, PID 2853)
    #0  0x00007ffff7ad7400 in __nanosleep_nocancel () from /usr/lib64/libc.so.6
    #1  0x0000000000403e5f in rpl_nanosleep ()
    #2  0x0000000000403cc0 in xnanosleep ()
    #3  0x00000000004016cd in main ()

STACK TRACE FOR PROC [[12390,1],1] (jenkins03, PID 2854)
    #0  0x00007ffff7ad7400 in __nanosleep_nocancel () from /usr/lib64/libc.so.6
    #1  0x0000000000403e5f in rpl_nanosleep ()
    #2  0x0000000000403cc0 in xnanosleep ()
    #3  0x00000000004016cd in main ()

STACK TRACE FOR PROC [[12390,1],2] (jenkins03, PID 2855)
    #0  0x00007ffff7ad7400 in __nanosleep_nocancel () from /usr/lib64/libc.so.6
    #1  0x0000000000403e5f in rpl_nanosleep ()
    #2  0x0000000000403cc0 in xnanosleep ()
    #3  0x00000000004016cd in main ()

STACK TRACE FOR PROC [[12390,1],3] (jenkins03, PID 2856)
    #0  0x00007ffff7ad7400 in __nanosleep_nocancel () from /usr/lib64/libc.so.6
    #1  0x0000000000403e5f in rpl_nanosleep ()
    #2  0x0000000000403cc0 in xnanosleep ()
    #3  0x00000000004016cd in main ()

STACK TRACE FOR PROC [[12390,1],4] (jenkins03, PID 2857)
    #0  0x00007ffff7ad7400 in __nanosleep_nocancel () from /usr/lib64/libc.so.6
    #1  0x0000000000403e5f in rpl_nanosleep ()
    #2  0x0000000000403cc0 in xnanosleep ()
    #3  0x00000000004016cd in main ()

STACK TRACE FOR PROC [[12390,1],5] (jenkins03, PID 2858)
    #0  0x00007ffff7ad7400 in __nanosleep_nocancel () from /usr/lib64/libc.so.6
    #1  0x0000000000403e5f in rpl_nanosleep ()
    #2  0x0000000000403cc0 in xnanosleep ()
    #3  0x00000000004016cd in main ()

STACK TRACE FOR PROC [[12390,1],6] (jenkins03, PID 2859)
    #0  0x00007ffff7ad7400 in __nanosleep_nocancel () from /usr/lib64/libc.so.6
    #1  0x0000000000403e5f in rpl_nanosleep ()
    #2  0x0000000000403cc0 in xnanosleep ()
    #3  0x00000000004016cd in main ()

STACK TRACE FOR PROC [[12390,1],7] (jenkins03, PID 2860)
    #0  0x00007ffff7ad7400 in __nanosleep_nocancel () from /usr/lib64/libc.so.6
    #1  0x0000000000403e5f in rpl_nanosleep ()
    #2  0x0000000000403cc0 in xnanosleep ()
    #3  0x00000000004016cd in main ()

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions