Skip to content

hostfile ordering not honored when HNP is used in allocation #4327

@jjhursey

Description

@jjhursey

We first noticed this in the v3.0.x release stream, as a difference in behavior from the v2.x release stream. I believe this to also impact v3.1.x and master.

This is a fallout of pushing the mapping/ordering mechanism to the backend nodes, and likely some of the improvements for the DVM and comm_spawn. Any fix would need to be careful to not break or hinder those features.

SPMD case

We are launching mpirun from node c712f6n01, and using the following hostfile to land the first few ranks on a remote node first (c712f6n04) before using the local node.

shell$ cat hostfile-b
c712f6n04 slots=2
c712f6n01 slots=2
c712f6n03 slots=2
c712f6n02 slots=2

In v2.x you can see the order is preserved with rank 0,1 on c712f6n04 followed by 2,3 on c712f6n01 (where mpirun/HNP resides):

shell$ mpirun --hostfile ./hostfile-b ./hello_c | sort
  0/  8) [c712f6n04] 56846 Hello, world!
  1/  8) [c712f6n04] 56847 Hello, world!
  2/  8) [c712f6n01] 65059 Hello, world!
  3/  8) [c712f6n01] 65060 Hello, world!
  4/  8) [c712f6n03] 62352 Hello, world!
  5/  8) [c712f6n03] 62353 Hello, world!
  6/  8) [c712f6n02] 74593 Hello, world!
  7/  8) [c712f6n02] 74594 Hello, world!

In v3.0.x you can see that the HNP node is always first in the list followed by the ordered list from the hostfile, less the HNP node. This puts ranks 0,1 on c712f6n01 instead of c712f6n04, as the user desired.

shell$ mpirun --hostfile ./hostfile-b ./hello_c | sort
  0/  8) [c712f6n01] 64629 Hello, world!
  1/  8) [c712f6n01] 64630 Hello, world!
  2/  8) [c712f6n04] 56447 Hello, world!
  3/  8) [c712f6n04] 56448 Hello, world!
  4/  8) [c712f6n03] 61943 Hello, world!
  5/  8) [c712f6n03] 61944 Hello, world!
  6/  8) [c712f6n02] 74189 Hello, world!
  7/  8) [c712f6n02] 74190 Hello, world!

Expected result should match v2.x's behavior with this hostfile - preserve ordering:

shell$ mpirun --hostfile ./hostfile-b ./hello_c | sort
  0/  8) [c712f6n04] 56846 Hello, world!
  1/  8) [c712f6n04] 56847 Hello, world!
  2/  8) [c712f6n01] 65059 Hello, world!
  3/  8) [c712f6n01] 65060 Hello, world!
  4/  8) [c712f6n03] 62352 Hello, world!
  5/  8) [c712f6n03] 62353 Hello, world!
  6/  8) [c712f6n02] 74593 Hello, world!
  7/  8) [c712f6n02] 74594 Hello, world!

MPMD case

Again, we are launching mpirun from node c712f6n01.

Consider these two hostfiles containing different orderings of these four machines.

shell$ cat hostfile-b
c712f6n04 slots=2
c712f6n01 slots=2
c712f6n03 slots=2
c712f6n02 slots=2
shell$ cat hostfile-c
c712f6n04 slots=2
c712f6n02 slots=2
c712f6n03 slots=2
c712f6n01 slots=2

The hello world program will print out the argument set in it's output to make it clear which app context it originated from (The A and B values in the output below).

In v2.x we get some odd behavior in the second app context mapping (likely due to the bookmark not being reset quite right - notice the iteration step between app context assignments of ranks 3 and 4):

shell$ mpirun --np 4 --map-by node --hostfile ./hostfile-b ./hello_c A : --np 4 --hostfile ./hostfile-c ./hello_c B | sort
  0/  8) [c712f6n04] 56926 Hello, world! A
  1/  8) [c712f6n01] 65108 Hello, world! A
  2/  8) [c712f6n03] 62435 Hello, world! A
  3/  8) [c712f6n02] 74671 Hello, world! A
  4/  8) [c712f6n02] 74672 Hello, world! B
  5/  8) [c712f6n03] 62436 Hello, world! B
  6/  8) [c712f6n01] 65109 Hello, world! B
  7/  8) [c712f6n04] 56927 Hello, world! B

In v3.0.x we get a more consistent pattern, but not quite what we want:

shell$ mpirun --np 4 --map-by node --hostfile ./hostfile-b ./hello_c A : --np 4 --map-by node --hostfile ./hostfile-c ./hello_c B | sort
  0/  8) [c712f6n01] 64736 Hello, world! A
  1/  8) [c712f6n04] 56615 Hello, world! A
  2/  8) [c712f6n03] 62110 Hello, world! A
  3/  8) [c712f6n02] 74355 Hello, world! A
  4/  8) [c712f6n01] 64737 Hello, world! B
  5/  8) [c712f6n04] 56616 Hello, world! B
  6/  8) [c712f6n03] 62111 Hello, world! B
  7/  8) [c712f6n02] 74356 Hello, world! B

Expected result should be as follows - preserve ordering per-app-context's hostfile:

  0/  8) [c712f6n04] 64736 Hello, world! A
  1/  8) [c712f6n01] 56615 Hello, world! A
  2/  8) [c712f6n03] 62110 Hello, world! A
  3/  8) [c712f6n02] 74355 Hello, world! A
  4/  8) [c712f6n04] 64737 Hello, world! B
  5/  8) [c712f6n02] 56616 Hello, world! B
  6/  8) [c712f6n03] 62111 Hello, world! B
  7/  8) [c712f6n01] 74356 Hello, world! B

Discussion

The ordering in the v3.0.x series has to do with the orte_node_pool ordering. In this list, the HNP node is always first followed by hosts as they are discovered. Meaning that in the MPMD case the first time we see node c712f6n04, for example is in the first hostfile (hostfile-b), so it is added to the orte_node_pool in the second position, just after the HNP. And so on through the first hostfile. When the second hostfile is encountered (hostfile-c) we already have the hosts in the orte_node_pool so we don't re-add them.

For the RMAPS mechanism they are dealing with the true ordering from the hostfile when they make their mapping decisions (e.g., in orte_rmaps_rr_map), but that context is lost when ORTE moves into the orte_plm_base_launch_apps state. In the application launch, when we pack the job launch message, in orte_util_encode_nodemap, we use the orte_node_pool ordering to structure the launch message which determines rank ordering. That ordering is incorrect with respect to the per-app-context hostfile.

I think this is a legitimate bug to fix, as it prevents users from controlling the rank ordering with respect to node order. However, after digging into this a bit, it looks like a pretty invasive change to make (and a delicate one at that so we don't break any other expected behavior in the process). I need to reflect on this a bit more, but wanted to post the issue for discussion.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions