-
Notifications
You must be signed in to change notification settings - Fork 908
Open
Description
Background information
What version of Open MPI are you using?
v5.0.0rc7
Describe how Open MPI was installed
tarball
Please describe the system on which you are running
- Operating system/version: Linux 4.19.0-18-cloud-amd64 SMP Debian 4.19.208-1 (2021-09-29) x86_64 GNU/Linux
- Network type: TCP/IP
Details of the problem
I am trying to make a distributed system built on OpenMPI continue past a node failure. In order to do this I must detect and handle a node failure.
I am using OpenMPI v5rc7, run with "--with-ft ulfm", and have set "MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN)". It seems the node failure is not returned as an error that can be handled in the code.
Example:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char* argv[])
{
MPI_Init(&argc, &argv);
int comm_size;
MPI_Comm_size(MPI_COMM_WORLD, &comm_size);
int my_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
int window_buffer = 0;
if (my_rank == 1)
{
window_buffer = 12345;
}
MPI_Win window;
MPI_Win_create(&window_buffer, sizeof(int), sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &window);
MPI_Win_fence(0, window);
int value_fetched;
if(my_rank == 0)
{
// Network fails. Attempt to fetch the value from the MPI process 1 window
system("sudo iptables -A OUTPUT -d 10.166.0.18 -j DROP");
system("sudo iptables -A INPUT -s 10.166.0.18 -j DROP");
int err = MPI_Get(&value_fetched, 1, MPI_INT, 1, 0, 1, MPI_INT, window);
// Handle error
if (err)
{
printf("Received error from MPI_Get: %d\n", err);
}
// reset firewall
system("sudo iptables --flush");
}
MPI_Win_fence(0, window);
MPI_Win_free(&window);
MPI_Finalize();
return EXIT_SUCCESS;
}
$ /home/ompi5rc7/bin/mpic++ example.cpp
$ /home/ompi5rc7/bin/mpirun --with-ft ulfm -n 2 --hostfile ../hosts ./a.out
--------------------------------------------------------------------------
WARNING: The selected 'osc' module 'rdma' is not tested for post-failure
operation, yet you have requested support for fault tolerance.
When using this component, normal failure free operation is expected;
However, failures may cause the application to abort, crash or deadlock.
In this framework, the following components are tested to operate under
failure scenarios: {}
--------------------------------------------------------------------------
1 more process has sent help message help-mpi-ft.txt / module:untested:failundef
1 more process has sent help message help-mpi-ft.txt / module:untested:failundef
< long wait here >
--------------------------------------------------------------------------
Sorry! You were supposed to get help about:
node-died
But I couldn't open the help file:
(null). Sorry!
--------------------------------------------------------------------------
I have also tried running with "/home/ompi5rc7/bin/mpirun --with-ft ulfm --mca btl tcp,self -n 2 --hostfile ../hosts ./a.out" but get the same output. I am not using RDMA.
Is it possible to print out the error code after a node failure?