-
Notifications
You must be signed in to change notification settings - Fork 909
Closed
Description
A bunch of Mellanox Jenkins runs have been failing on master with this kind of error:
18:32:14 + /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-2/ompi_install1/bin/mpirun -np 8 --map-by dist -mca rmaps_dist_device mlx4_0 -x TEST_CLOSEST_NUMA -x TEST_PHYS_ID_COUNT -x TEST_CORE_ID_COUNT /scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-2/jenkins_scripts/jenkins/ompi/mindist_test
18:32:16
18:32:16 Success rank - 4: only one NUMA is scheduled.
18:32:16
18:32:16 Success rank - 6: only one NUMA is scheduled.
18:32:16
18:32:16 Success rank - 0: only one NUMA is scheduled.
18:32:16
18:32:16 Success rank - 2: only one NUMA is scheduled.
18:32:16
18:32:16 Error rank - 5: scheduled on wrong NUMA node - 1, should be 0
18:32:16
18:32:16 Error rank - 3: scheduled on wrong NUMA node - 1, should be 0
18:32:16
18:32:16 Error rank - 1: scheduled on wrong NUMA node - 1, should be 0
18:32:16
18:32:16 Error rank - 7: scheduled on wrong NUMA node - 1, should be 0
18:32:17 -------------------------------------------------------
18:32:17 Primary job terminated normally, but 1 process returned
18:32:17 a non-zero exit code. Per user-direction, the job has been aborted.
18:32:17 -------------------------------------------------------
18:32:17 --------------------------------------------------------------------------
18:32:17 mpirun detected that one or more processes exited with non-zero status, thus causing
18:32:17 the job to be terminated. The first process to do so was:
18:32:17
18:32:17 Process name: [[34935,1],5]
18:32:17 Exit code: 1
18:32:17 --------------------------------------------------------------------------
In #1612 (comment), @jladd-mlnx reported that the PMIx external component changes (i.e., the changes from that PR) seem to be what broke this test.
@rhc54 Can you investigate?