Skip to content

SPC: allow starting through MPI_T without explicitly setting MCA parameter #8065

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 4, 2021

Conversation

devreal
Copy link
Contributor

@devreal devreal commented Sep 28, 2020

So far, reading SPC counter through MPI_T was conditioned on the counters being attached via the MCA parameter mpi_spc_attach. This PR makes the following changes:

  1. Remove the MCA_BASE_PVAR_FLAG_CONTINUOUS flag to enable explicit starting and stopping of counters and track the number of sessions that attach to each variable.
  2. Inline calls that may be in the critical path (ompi_spc_record, ompi_spc_user_or_mpi, ompi_spc_update_watermark) to reduce overhead.
  3. Clean up some code (move includes, remove unused attributes, remove unused variable)
  4. Fix the SPC test to succeed even if an invalid pvar is encountered (e.g., a pvar that was de-registered by a component)
  5. Centralize counter property management and consolidate counter information into spc_event_t structure to ensure that only a single cache-line is touched during an update.
  6. Some additional minor changes for things I noticed (correct handling of timer and counter types, moved timer start/stop into header and fixed timer/watermark update functions)

@devreal
Copy link
Contributor Author

devreal commented Sep 28, 2020

It looks like one of the test bots timed out.

@lanl-ompi
Copy link
Contributor

Can one of the admins verify this patch?

@jsquyres
Copy link
Member

@devreal Looks like there's now a conflict on this PR. Can you resolve it?

@devreal
Copy link
Contributor Author

devreal commented Oct 26, 2020

@jsquyres The conflict should be resolved.

@bosilca Can you please review?

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We designed this component such that the internals of the SPC management remains hidden instead of being exposed through the static inline functions in the header file. The rest of the PR is OK, but not moving all update functions in the header.

@devreal
Copy link
Contributor Author

devreal commented Oct 26, 2020

The intention was that inlining these calls would reduce the overhead when SPC is enabled at compile time but disabled at runtime by cutting out the unconditional call. Reverting that is fine with me.

@bosilca
Copy link
Member

bosilca commented Oct 26, 2020

Last time we measured the cost of a function call it was significantly lower than the cost of an atomic operation. Plus SPC updates are supposed to be rare events, few times per request at most.

@devreal
Copy link
Contributor Author

devreal commented Nov 9, 2020

A added more commits to make sure that a counter update only touches a single cache line and fixed a few more issues I came across while working on that (correct types used for counter and timer updates, made watermark update thread-safe, and removed a race in timers). It's gotten quite big and I'm happy to rip out the latest changes into another PR if that's wanted.

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these changes looks good and have the potential to ease the use of SPC while also decreasing the overheads. But before we merge this into master I would like to evaluate the new overheads and compare with the original version.

@devreal
Copy link
Contributor Author

devreal commented Nov 12, 2020

After digging through the generated assembly code I made some more changes to the code. I will squash everything down before merging.

Here are some measurements (10M rank-local 4B recv+isend+wait in a tight loop, time taken through MPI_Wtime on my local box, with pml/ob1):

  1. With all counters enabled (--mca mpi_spc_attach all):
    a) Master: 3.763253s
    b) This branch: 3.498579s (~7% speedup)
  2. With no counters attached:
    a) Master: 2.410163s
    b) This branch: 2.260756s (~6% speedup)
  3. SPC disabled during configure: 2.198656s

There is still some small overhead compared to when SPC is completely disabled at configure time. This for the worst case though where there is no transfer of data between ranks and we're looking at about 220ns. This is mostly due to the additional loads and conditional jumps. It turns out that reading the cycle counter is relatively costly in this case (~6% of the time spent with all counters enabled, mostly due to the lfence before rdtsc). The rest is additional loads, conditional jumps, and stores. Overall I think the overhead is reasonable and the changes in this branch yield a slight reduction over what is currently in master.

…erhead

- only make MCA parameters available if SPC is enabled

- do not compile SPC code if SPC is disabled

- move includes into ompi_spc.c

- allow counters to be enabled through MPI_T without setting MCA parameter

- inline counter update calls that are likely in the critical path

- fix test to succeed even if encountering invalid pvars

- move timer_[start|stop] to header and move attachment info into ompi_spc_t

There is no need to store the name in the ompi_spc_t struct too, we can use that space
for the attachment info instead to avoid accessing another cache line.

- make timer/watermark flags a property of the spc description

This is meant to making adding counters easier in the future by
centralizing the necessary information. By storing a copy of these flags
in the ompi_spc_t structure (without adding to its size) reduces
cache pollution for timer/watermark events.

- allocate ompi_spc_t objects with cache-alignment

This prevents objects from spanning multiple cache lines and thus
ensures that only one cache line is loaded per update.

- fix handling of timer and timer conversion

- only call opal_timer_base_get_cycles if necesary to reduce overhead

- Remove use of OPAL_UNLIKELY to improve code generated by GCC

It appears that GCC makes less effort in optimizing the unlikely path
and generates bloated code.

- Allocate ompi_spc_events statically to reduce loads in critical path

- duplicate comm_world only when dumping is requested

Signed-off-by: Joseph Schuchart <[email protected]>
@devreal
Copy link
Contributor Author

devreal commented Nov 12, 2020

I have squashed everything down into one commit. Should be ready to go once the tests have passed.

@bosilca bosilca merged commit a7e91d8 into open-mpi:master Jan 4, 2021
@devreal devreal deleted the fix_spc_mpit branch October 3, 2022 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants