Skip to content

OpenMPI hangs on MPI_Test with InfiniBand and high message rate #4863

@Noxoomo

Description

@Noxoomo

Background information

I have GPU-based application, which uses MPI to transfer messages and data between several nodes.
It works fine on 1GB/s networks, but stucks in deadlock if i switch to InfiniBand.

System description, OpenMPI details and etc

I've reproduced issue on two clusters with InfiniBand network. I've used several OpenMPI versions builded from source for all runs (see details below)

First cluster consists of Intel dual-socket servers with Ubuntu 12.04 and NVIDIA GPUs, Mellanox Technologies MT27500 Family [ConnectX-3] for InfiniBand. On this cluster i've tried OpenMPI 2.1.2 and OpenMPI 3.0

On the second one there are dual-socket Intel servers with Ubuntu 16.04 and NVIDIA GPUs, Mellanox Technologies MT27700 Family [ConnectX-4]  for InifniBand. This cluster is IPV6-only, so on it OpenMPI was builded from master with several patches to fix IPV6-compability.

All MPI builds were with CUDA support.
I could provide more information on request.


Details of the problem

My application sends a lot of small message. All sends and receives are async (ISend, IRecv) and i use MPI_Test in loop to check for completeness of operation. On InfiniBand networks MPI_Test would not "return true" for MPI_ISend which was received by other host (communications in my application usually uses unique tags so i dumped every receive and send request and successful call to MPI_Test and from logs i saw that some MPI_ISend was received, but sender was notified about it). As a result, my application waits forever.

I checked the same code with MVAPICH2 and everything work fine.

My application is complex, so I've reproduced the same issue with far more simple code, attached bellow:


#include <iostream>
#include <vector>

#define MPI_SAFE_CALL(cmd)                                                    \
   {                                                                          \
        int mpiErrNo = (cmd);                                                 \
        if (MPI_SUCCESS != mpiErrNo) {                                        \
            char msg[MPI_MAX_ERROR_STRING];                                   \
            int len;                                                          \
            MPI_Error_string(mpiErrNo, msg, &len);                            \
            std::cout << "MPI failed with error code :" << mpiErrNo           \
                                << " " << msg << std::endl;                   \
            MPI_Abort(MPI_COMM_WORLD, mpiErrNo);                              \
        }                                                                     \
    }


class TMpiRequest {
public:
    bool IsComplete() const {
        if (!Flag) {
            MPI_SAFE_CALL(MPI_Test(Request.get(), &Flag, &Status));
        }
        return static_cast<bool>(Flag);
    }


    TMpiRequest(TMpiRequest&& other) {
        if (this != &other) {
            this->Flag = other.Flag;
            this->Request.swap(other.Request);
            this->Status = other.Status;
            other.Clear();
        }
    }

    TMpiRequest& operator=(TMpiRequest&& other) {
        if (this != &other) {
            this->Flag = other.Flag;
            this->Request.swap(other.Request);
            this->Status = other.Status;
            other.Clear();
        }
        return *this;
    }

    TMpiRequest() {
    }

    ~TMpiRequest() {
    }

    TMpiRequest(std::unique_ptr<MPI_Request>&& request)
            : Flag(0)
              , Request(std::move(request)) {
        IsComplete();
    }

    void Clear() {
        Flag = -1;
        Request = nullptr;
    }

private:
    mutable int Flag = -1;
    std::unique_ptr<MPI_Request> Request;
    mutable MPI_Status Status;
};


TMpiRequest ReadAsync(char* data, int dataSize, int sourceRank, int tag) {
    std::unique_ptr<MPI_Request> request(new MPI_Request);
    MPI_SAFE_CALL(MPI_Irecv(data, dataSize, MPI_CHAR, sourceRank, tag, MPI_COMM_WORLD, request.get()));
    return {std::move(request)};
}


TMpiRequest WriteAsync(const char* data, int dataSize, int destRank, int tag) {
    std::unique_ptr<MPI_Request> request(new MPI_Request);
    MPI_SAFE_CALL(MPI_Issend(data, dataSize, MPI_CHAR, destRank, tag, MPI_COMM_WORLD, request.get()));
    return {std::move(request)};
}

int main(int argc, char** argv) {
    int providedLevel;
    int hostCount;
    int hostId;
    int threadLevel = MPI_THREAD_MULTIPLE;

    MPI_SAFE_CALL(MPI_Init_thread(&argc, &argv, threadLevel, &providedLevel));
    MPI_SAFE_CALL(MPI_Comm_size(MPI_COMM_WORLD, &hostCount));
    MPI_SAFE_CALL(MPI_Comm_rank(MPI_COMM_WORLD, &hostId));

    int tag = 0;
    const int otherRank = hostId == 0 ? 1 : 0;

    for (int i = 0; i < 100000; ++i) {
        std::vector<TMpiRequest> requests;
        const int batchSize = 256;//16 + random.NextUniformL() % 256;
        std::vector<std::vector<char> > data;

        for (int j = 0; j < batchSize; ++j) {
            ++tag;
            const bool isWrite = (2 * j < batchSize) == hostId;
            const int size = 127;//random.NextUniformL() % 4096;
            data.push_back(std::vector<char>());
            data.back().resize(size);
            if (isWrite) {
                requests.push_back(WriteAsync(data.back().data(), size, otherRank, tag));
            } else {
                requests.push_back(ReadAsync(data.back().data(), size, otherRank, tag));
            }
        }
        std::cout << "Send batch # " << i << " of size " << batchSize << std::endl;
        while (requests.size()) {
            std::vector<TMpiRequest> pending;
            for (auto& request : requests) {
                if (!request.IsComplete()) {
                    pending.push_back(std::move(request));
                }
            }
            requests.swap(pending);
        }
        std::cout << "Wait complete batch done " << batchSize << std::endl;
    }
}

Backtrace:

Thread 3 (Thread 0x7f285d984700 (LWP 424323)):
#0 0x00007f28612bfa13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f2860961438 in epoll_dispatch (base=0x107402400, tv=) at epoll.c:407
#2 0x00007f28609647ff in opal_libevent2022_event_base_loop (base=0x107402400, flags=1) at event.c:1630
#3 0x00007f285ee7089e in progress_engine () from /usr/local/openmpi-git/lib/openmpi/mca_pmix_pmix3x.so
#4 0x00007f28615896ba in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007f28612bf41d in clone () from /lib/x86_64-linux-gnu/libc.so.6
Thread 2 (Thread 0x7f285f8c4700 (LWP 424322)):
#0 0x00007f28612b374d in poll () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f286096e0b8 in poll (__timeout=, __nfds=4, __fds=0x107102d00) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2 poll_dispatch (base=0x107400000, tv=) at poll.c:165
#3 0x00007f28609647ff in opal_libevent2022_event_base_loop (base=0x107400000, flags=1) at event.c:1630
#4 0x00007f286091d8fe in progress_engine () from /usr/local/openmpi-git/lib/libopen-pal.so.0
#5 0x00007f28615896ba in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#6 0x00007f28612bf41d in clone () from /lib/x86_64-linux-gnu/libc.so.6
Thread 1 (Thread 0x7f28620b7740 (LWP 424321)):
#0 0x00007f286158d4ee in pthread_mutex_unlock () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f28589d1c42 in btl_openib_component_progress () from /usr/local/openmpi-git/lib/openmpi/mca_btl_openib.so
#2 0x00007f2860917d1c in opal_progress () from /usr/local/openmpi-git/lib/libopen-pal.so.0
#3 0x00007f28617e5c83 in ompi_request_default_test () from /usr/local/mpi/lib/libmpi.so.0
#4 0x00007f2861826fb9 in PMPI_Test () from /usr/local/mpi/lib/libmpi.so.0
#5 0x0000000000203068 in TMpiRequest::IsComplete (this=0x107021880) at /place/noxoomo/mini-arcadia/junk/noxoomo/failed_mpi/main.cpp:31
#6 0x0000000000202038 in main (argc=1, argv=0x7ffdb5c75f28) at /place/noxoomo/mini-arcadia/junk/noxoomo/failed_mpi/main.cpp:133

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions