-
Notifications
You must be signed in to change notification settings - Fork 908
Closed
Labels
Description
I'm seeing litter on our jenkins server. The reason is that timeout signal is not properly propagated to the application processes in v2.x. The following example hangs:
$timeout -s SIGSEGV 2m /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/bin/oshrun -np 8 --bind-to none -x SHMEM_SYMMETRIC_HEAP_SIZE=256M --mca btl_openib_if_include mlx5_0:1 -x MXM_RDMA_PORTS=mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc,cm --mca rmaps_base_dist_hca mlx5_0:1 --mca sshmem_verbs_hca_name mlx5_0:1 --mca spml ucx -mca pml ucx taskset -c 12,13 sleep 10000
[jenkins03:02612] *** Process received signal ***
[jenkins03:02612] Signal: Segmentation fault (11)
[jenkins03:02612] Signal code: (0)
[jenkins03:02612] Failing at address: 0x10af00000a33
[jenkins03:02612] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x7ffff6898100]
[jenkins03:02612] [ 1] /usr/lib64/libc.so.6(epoll_wait+0x33)[0x7ffff65be7a3]
[jenkins03:02612] [ 2] /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/lib/libopen-pal.so.20(+0x8ec93)[0x7ffff785ac93]
[jenkins03:02612] [ 3] /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/lib/libopen-pal.so.20(opal_libevent2022_event_base_loop+0x170)[0x7ffff785e6e0]
[jenkins03:02612] [ 4] /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/bin/oshrun[0x40541a]
[jenkins03:02612] [ 5] /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/bin/oshrun[0x403730]
[jenkins03:02612] [ 6] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ffff64e9b15]
[jenkins03:02612] [ 7] /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/bin/oshrun[0x403649]
[jenkins03:02612] *** End of error message ***
oshrun: Forwarding signal 18 to job
master is not affected:
/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/bin/oshrun -np 8 --bind-to none -x SHMEM_SYMMETRIC_HEAP_SIZE=256M --report-state-on-timeout --get-stack-traces --timeout 20 --mca btl_openib_if_include mlx5_0:1 -x MXM_RDMA_PORTS=mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc,cm --mca rmaps_base_dist_hca mlx5_0:1 --mca sshmem_verbs_hca_name mlx5_0:1 --mca spml ucx -mca pml ucx taskset -c 10,11 sleep 10000
--------------------------------------------------------------------------
The user-provided time limit for job execution has been reached:
Timeout: 20 seconds
The job will now be aborted. Please check your code and/or
adjust/remove the job execution time limit (as specified by --timeout
command line option or MPIEXEC_TIMEOUT environment variable).
--------------------------------------------------------------------------
DATA FOR JOB: [12390,0]
Num apps: 1 Num procs: 1 JobState: ALL DAEMONS REPORTED Abort: False
Num launched: 0 Num reported: 1 Num terminated: 0
Procs:
Rank: 0 Node: jenkins03 PID: 2845 State: RUNNING ExitCode 0
DATA FOR JOB: [12390,1]
Num apps: 1 Num procs: 8 JobState: RUNNING Abort: False
Num launched: 8 Num reported: 0 Num terminated: 0
Procs:
Rank: 0 Node: jenkins03 PID: 2853 State: RUNNING ExitCode 0
Rank: 1 Node: jenkins03 PID: 2854 State: RUNNING ExitCode 0
Rank: 2 Node: jenkins03 PID: 2855 State: RUNNING ExitCode 0
Rank: 3 Node: jenkins03 PID: 2856 State: RUNNING ExitCode 0
Rank: 4 Node: jenkins03 PID: 2857 State: RUNNING ExitCode 0
Rank: 5 Node: jenkins03 PID: 2858 State: RUNNING ExitCode 0
Rank: 6 Node: jenkins03 PID: 2859 State: RUNNING ExitCode 0
Rank: 7 Node: jenkins03 PID: 2860 State: RUNNING ExitCode 0
Waiting for stack traces (this may take a few moments)...
STACK TRACE FOR PROC [[12390,1],0] (jenkins03, PID 2853)
#0 0x00007ffff7ad7400 in __nanosleep_nocancel () from /usr/lib64/libc.so.6
#1 0x0000000000403e5f in rpl_nanosleep ()
#2 0x0000000000403cc0 in xnanosleep ()
#3 0x00000000004016cd in main ()
STACK TRACE FOR PROC [[12390,1],1] (jenkins03, PID 2854)
#0 0x00007ffff7ad7400 in __nanosleep_nocancel () from /usr/lib64/libc.so.6
#1 0x0000000000403e5f in rpl_nanosleep ()
#2 0x0000000000403cc0 in xnanosleep ()
#3 0x00000000004016cd in main ()
STACK TRACE FOR PROC [[12390,1],2] (jenkins03, PID 2855)
#0 0x00007ffff7ad7400 in __nanosleep_nocancel () from /usr/lib64/libc.so.6
#1 0x0000000000403e5f in rpl_nanosleep ()
#2 0x0000000000403cc0 in xnanosleep ()
#3 0x00000000004016cd in main ()
STACK TRACE FOR PROC [[12390,1],3] (jenkins03, PID 2856)
#0 0x00007ffff7ad7400 in __nanosleep_nocancel () from /usr/lib64/libc.so.6
#1 0x0000000000403e5f in rpl_nanosleep ()
#2 0x0000000000403cc0 in xnanosleep ()
#3 0x00000000004016cd in main ()
STACK TRACE FOR PROC [[12390,1],4] (jenkins03, PID 2857)
#0 0x00007ffff7ad7400 in __nanosleep_nocancel () from /usr/lib64/libc.so.6
#1 0x0000000000403e5f in rpl_nanosleep ()
#2 0x0000000000403cc0 in xnanosleep ()
#3 0x00000000004016cd in main ()
STACK TRACE FOR PROC [[12390,1],5] (jenkins03, PID 2858)
#0 0x00007ffff7ad7400 in __nanosleep_nocancel () from /usr/lib64/libc.so.6
#1 0x0000000000403e5f in rpl_nanosleep ()
#2 0x0000000000403cc0 in xnanosleep ()
#3 0x00000000004016cd in main ()
STACK TRACE FOR PROC [[12390,1],6] (jenkins03, PID 2859)
#0 0x00007ffff7ad7400 in __nanosleep_nocancel () from /usr/lib64/libc.so.6
#1 0x0000000000403e5f in rpl_nanosleep ()
#2 0x0000000000403cc0 in xnanosleep ()
#3 0x00000000004016cd in main ()
STACK TRACE FOR PROC [[12390,1],7] (jenkins03, PID 2860)
#0 0x00007ffff7ad7400 in __nanosleep_nocancel () from /usr/lib64/libc.so.6
#1 0x0000000000403e5f in rpl_nanosleep ()
#2 0x0000000000403cc0 in xnanosleep ()
#3 0x00000000004016cd in main ()