Skip to content

UCX OSC violates MPI standard with accumulate + fetch and op #4688

@hjelmn

Description

@hjelmn

Thank you for taking the time to submit an issue!

Background information

The UCX OSC component includes an optimization for MPI_Fetch_and_op(). Unfortunately this optimization leads to incorrect results when mixing MPI_Fetch_and_op() with MPI_Accumulate().

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

master, v3.1.0

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Built from git checkout

Please describe the system on which you are running

  • Operating system/version: Linux nid00020 4.4.49-92.11.1_3.0-cray_ari_c BTL checkpoint friendly #1 SMP Mon Dec 11 23:32:19 UTC 2017 (3.0.99) x86_64 x86_64 x86_64 GNU/Linux
  • Computer hardware: Cray XC-40
  • Network type: Aries

Details of the problem

See the following program. This program will be placed into MTT today:

https://gist.github.com/hjelmn/c8e54a8a6526b939703a6b894f186bab

The program is simple. Each rank performs an MPI_Accumulate() of 1024 int32_t's on its left neighbor and an MPI_Fetch_and_op() on its right neighbor. This is a valid MPI program and it fails with osc/ucx. It passes with osc/rdma.

If this isn't fixed by v3.1.0 I recommend we software-disable the osc/ucx component until it is fixed since it is a correctness issue.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions