-
Notifications
You must be signed in to change notification settings - Fork 908
SPC: allow starting through MPI_T without explicitly setting MCA parameter #8065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It looks like one of the test bots timed out. |
Can one of the admins verify this patch? |
@devreal Looks like there's now a conflict on this PR. Can you resolve it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We designed this component such that the internals of the SPC management remains hidden instead of being exposed through the static inline functions in the header file. The rest of the PR is OK, but not moving all update functions in the header.
The intention was that inlining these calls would reduce the overhead when SPC is enabled at compile time but disabled at runtime by cutting out the unconditional call. Reverting that is fine with me. |
Last time we measured the cost of a function call it was significantly lower than the cost of an atomic operation. Plus SPC updates are supposed to be rare events, few times per request at most. |
A added more commits to make sure that a counter update only touches a single cache line and fixed a few more issues I came across while working on that (correct types used for counter and timer updates, made watermark update thread-safe, and removed a race in timers). It's gotten quite big and I'm happy to rip out the latest changes into another PR if that's wanted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of these changes looks good and have the potential to ease the use of SPC while also decreasing the overheads. But before we merge this into master I would like to evaluate the new overheads and compare with the original version.
f44164b
to
b3a5cf0
Compare
After digging through the generated assembly code I made some more changes to the code. I will squash everything down before merging. Here are some measurements (10M rank-local 4B recv+isend+wait in a tight loop, time taken through
There is still some small overhead compared to when SPC is completely disabled at configure time. This for the worst case though where there is no transfer of data between ranks and we're looking at about 220ns. This is mostly due to the additional loads and conditional jumps. It turns out that reading the cycle counter is relatively costly in this case (~6% of the time spent with all counters enabled, mostly due to the |
…erhead - only make MCA parameters available if SPC is enabled - do not compile SPC code if SPC is disabled - move includes into ompi_spc.c - allow counters to be enabled through MPI_T without setting MCA parameter - inline counter update calls that are likely in the critical path - fix test to succeed even if encountering invalid pvars - move timer_[start|stop] to header and move attachment info into ompi_spc_t There is no need to store the name in the ompi_spc_t struct too, we can use that space for the attachment info instead to avoid accessing another cache line. - make timer/watermark flags a property of the spc description This is meant to making adding counters easier in the future by centralizing the necessary information. By storing a copy of these flags in the ompi_spc_t structure (without adding to its size) reduces cache pollution for timer/watermark events. - allocate ompi_spc_t objects with cache-alignment This prevents objects from spanning multiple cache lines and thus ensures that only one cache line is loaded per update. - fix handling of timer and timer conversion - only call opal_timer_base_get_cycles if necesary to reduce overhead - Remove use of OPAL_UNLIKELY to improve code generated by GCC It appears that GCC makes less effort in optimizing the unlikely path and generates bloated code. - Allocate ompi_spc_events statically to reduce loads in critical path - duplicate comm_world only when dumping is requested Signed-off-by: Joseph Schuchart <[email protected]>
b3a5cf0
to
d11f625
Compare
I have squashed everything down into one commit. Should be ready to go once the tests have passed. |
So far, reading SPC counter through MPI_T was conditioned on the counters being attached via the MCA parameter
mpi_spc_attach
. This PR makes the following changes:MCA_BASE_PVAR_FLAG_CONTINUOUS
flag to enable explicit starting and stopping of counters and track the number of sessions that attach to each variable.ompi_spc_record
,ompi_spc_user_or_mpi
,ompi_spc_update_watermark
) to reduce overhead.spc_event_t
structure to ensure that only a single cache-line is touched during an update.