-
Notifications
You must be signed in to change notification settings - Fork 908
Description
We first noticed this in the v3.0.x
release stream, as a difference in behavior from the v2.x
release stream. I believe this to also impact v3.1.x
and master
.
This is a fallout of pushing the mapping/ordering mechanism to the backend nodes, and likely some of the improvements for the DVM and comm_spawn. Any fix would need to be careful to not break or hinder those features.
SPMD case
We are launching mpirun
from node c712f6n01
, and using the following hostfile to land the first few ranks on a remote node first (c712f6n04
) before using the local node.
shell$ cat hostfile-b
c712f6n04 slots=2
c712f6n01 slots=2
c712f6n03 slots=2
c712f6n02 slots=2
In v2.x you can see the order is preserved with rank 0,1 on c712f6n04
followed by 2,3 on c712f6n01
(where mpirun/HNP resides):
shell$ mpirun --hostfile ./hostfile-b ./hello_c | sort
0/ 8) [c712f6n04] 56846 Hello, world!
1/ 8) [c712f6n04] 56847 Hello, world!
2/ 8) [c712f6n01] 65059 Hello, world!
3/ 8) [c712f6n01] 65060 Hello, world!
4/ 8) [c712f6n03] 62352 Hello, world!
5/ 8) [c712f6n03] 62353 Hello, world!
6/ 8) [c712f6n02] 74593 Hello, world!
7/ 8) [c712f6n02] 74594 Hello, world!
In v3.0.x you can see that the HNP node is always first in the list followed by the ordered list from the hostfile, less the HNP node. This puts ranks 0,1 on c712f6n01
instead of c712f6n04
, as the user desired.
shell$ mpirun --hostfile ./hostfile-b ./hello_c | sort
0/ 8) [c712f6n01] 64629 Hello, world!
1/ 8) [c712f6n01] 64630 Hello, world!
2/ 8) [c712f6n04] 56447 Hello, world!
3/ 8) [c712f6n04] 56448 Hello, world!
4/ 8) [c712f6n03] 61943 Hello, world!
5/ 8) [c712f6n03] 61944 Hello, world!
6/ 8) [c712f6n02] 74189 Hello, world!
7/ 8) [c712f6n02] 74190 Hello, world!
Expected result should match v2.x's behavior with this hostfile - preserve ordering:
shell$ mpirun --hostfile ./hostfile-b ./hello_c | sort
0/ 8) [c712f6n04] 56846 Hello, world!
1/ 8) [c712f6n04] 56847 Hello, world!
2/ 8) [c712f6n01] 65059 Hello, world!
3/ 8) [c712f6n01] 65060 Hello, world!
4/ 8) [c712f6n03] 62352 Hello, world!
5/ 8) [c712f6n03] 62353 Hello, world!
6/ 8) [c712f6n02] 74593 Hello, world!
7/ 8) [c712f6n02] 74594 Hello, world!
MPMD case
Again, we are launching mpirun
from node c712f6n01
.
Consider these two hostfiles containing different orderings of these four machines.
shell$ cat hostfile-b
c712f6n04 slots=2
c712f6n01 slots=2
c712f6n03 slots=2
c712f6n02 slots=2
shell$ cat hostfile-c
c712f6n04 slots=2
c712f6n02 slots=2
c712f6n03 slots=2
c712f6n01 slots=2
The hello world program will print out the argument set in it's output to make it clear which app context it originated from (The A
and B
values in the output below).
In v2.x we get some odd behavior in the second app context mapping (likely due to the bookmark not being reset quite right - notice the iteration step between app context assignments of ranks 3 and 4):
shell$ mpirun --np 4 --map-by node --hostfile ./hostfile-b ./hello_c A : --np 4 --hostfile ./hostfile-c ./hello_c B | sort
0/ 8) [c712f6n04] 56926 Hello, world! A
1/ 8) [c712f6n01] 65108 Hello, world! A
2/ 8) [c712f6n03] 62435 Hello, world! A
3/ 8) [c712f6n02] 74671 Hello, world! A
4/ 8) [c712f6n02] 74672 Hello, world! B
5/ 8) [c712f6n03] 62436 Hello, world! B
6/ 8) [c712f6n01] 65109 Hello, world! B
7/ 8) [c712f6n04] 56927 Hello, world! B
In v3.0.x we get a more consistent pattern, but not quite what we want:
shell$ mpirun --np 4 --map-by node --hostfile ./hostfile-b ./hello_c A : --np 4 --map-by node --hostfile ./hostfile-c ./hello_c B | sort
0/ 8) [c712f6n01] 64736 Hello, world! A
1/ 8) [c712f6n04] 56615 Hello, world! A
2/ 8) [c712f6n03] 62110 Hello, world! A
3/ 8) [c712f6n02] 74355 Hello, world! A
4/ 8) [c712f6n01] 64737 Hello, world! B
5/ 8) [c712f6n04] 56616 Hello, world! B
6/ 8) [c712f6n03] 62111 Hello, world! B
7/ 8) [c712f6n02] 74356 Hello, world! B
Expected result should be as follows - preserve ordering per-app-context's hostfile:
0/ 8) [c712f6n04] 64736 Hello, world! A
1/ 8) [c712f6n01] 56615 Hello, world! A
2/ 8) [c712f6n03] 62110 Hello, world! A
3/ 8) [c712f6n02] 74355 Hello, world! A
4/ 8) [c712f6n04] 64737 Hello, world! B
5/ 8) [c712f6n02] 56616 Hello, world! B
6/ 8) [c712f6n03] 62111 Hello, world! B
7/ 8) [c712f6n01] 74356 Hello, world! B
Discussion
The ordering in the v3.0.x
series has to do with the orte_node_pool
ordering. In this list, the HNP node is always first followed by hosts as they are discovered. Meaning that in the MPMD case the first time we see node c712f6n04
, for example is in the first hostfile (hostfile-b
), so it is added to the orte_node_pool
in the second position, just after the HNP. And so on through the first hostfile. When the second hostfile is encountered (hostfile-c
) we already have the hosts in the orte_node_pool
so we don't re-add them.
For the RMAPS mechanism they are dealing with the true ordering from the hostfile when they make their mapping decisions (e.g., in orte_rmaps_rr_map
), but that context is lost when ORTE moves into the orte_plm_base_launch_apps
state. In the application launch, when we pack the job launch message, in orte_util_encode_nodemap
, we use the orte_node_pool
ordering to structure the launch message which determines rank ordering. That ordering is incorrect with respect to the per-app-context hostfile.
I think this is a legitimate bug to fix, as it prevents users from controlling the rank ordering with respect to node order. However, after digging into this a bit, it looks like a pretty invasive change to make (and a delicate one at that so we don't break any other expected behavior in the process). I need to reflect on this a bit more, but wanted to post the issue for discussion.