Skip to content

osc/sm: Fix a bus error on MPI_WIN_{POST,START}. #1259

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 15, 2016

Conversation

kawashima-fj
Copy link
Member

A bus error occurs in sm OSC under the following conditions.

  • sparc64 or any other architectures which need strict alignment.
  • MPI_WIN_POST or MPI_WIN_START is called for a window created by sm OSC.
  • The communicator size is odd and greater than 3.

The lines 283-285 in current ompi/mca/osc/sm/osc_sm_component.c has the following code.

module->global_state = (ompi_osc_sm_global_state_t *) (module->segment_base);
module->node_states = (ompi_osc_sm_node_state_t *) (module->global_state + 1);
module->posts[0] = (uint64_t *) (module->node_states + comm_size);

The size of ompi_osc_sm_node_state_t is multiples of 4 but not multiples of 8. So if comm_size is odd, module->posts[0] does not aligned to 8. This causes a bus error when accessing module->posts[i][j].

This patch fixes the alignment of module->posts[0] by inserting a padding between node_states and module->posts[0].

@hjelmn Please review.
I don't know whether this is the best fix method.
If you have a better method, tell me or create a new PR please.

v2.x and v1.10 branches have the same problem. I'll create PRs for them if this PR is accepted.

@hppritcha hppritcha added the bug label Dec 24, 2015
@hppritcha
Copy link
Member

@miked-mellanox could you check why the make dist failed for the mlnx jenkins?

@ggouaillardet
Copy link
Contributor

@kawashima-fj @hjelmn wouldn't it be easier to have the posts at the beginning ?
here is an alternate patch

diff --git a/ompi/mca/osc/sm/osc_sm_component.c b/ompi/mca/osc/sm/osc_sm_component.c
index 2e2948e..91da943 100644
--- a/ompi/mca/osc/sm/osc_sm_component.c
+++ b/ompi/mca/osc/sm/osc_sm_component.c
@@ -280,9 +280,9 @@ component_select(struct ompi_win_t *win, void **base, size_t size, int disp_unit
         module->posts = calloc (comm_size, sizeof (module->posts[0]));
         if (NULL == module->posts) return OMPI_ERR_TEMP_OUT_OF_RESOURCE;

-        module->global_state = (ompi_osc_sm_global_state_t *) (module->segment_base);
+        module->posts[0] = (uint64_t *) (module->segment_base);
+        module->global_state = (ompi_osc_sm_global_state_t *) (module->posts + comm_size * post_size);
         module->node_states = (ompi_osc_sm_node_state_t *) (module->global_state + 1);
-        module->posts[0] = (uint64_t *) (module->node_states + comm_size);

         for (i = 0, total = state_size + posts_size ; i < comm_size ; ++i) {
             if (i > 0) {

@kawashima-fj
Copy link
Member Author

Yes, it will be good.
The component owner (Nathan) will decide 😄

@kawashima-fj
Copy link
Member Author

In Gilles' patch, module->posts should be module->posts[0].
This is the updated patch and I confirmed it works fine.

diff --git a/ompi/mca/osc/sm/osc_sm_component.c b/ompi/mca/osc/sm/osc_sm_component.c
index 2e2948e..91da943 100644
--- a/ompi/mca/osc/sm/osc_sm_component.c
+++ b/ompi/mca/osc/sm/osc_sm_component.c
@@ -280,9 +280,9 @@ component_select(struct ompi_win_t *win, void **base, size_t size, int disp_unit
         module->posts = calloc (comm_size, sizeof (module->posts[0]));
         if (NULL == module->posts) return OMPI_ERR_TEMP_OUT_OF_RESOURCE;

-        module->global_state = (ompi_osc_sm_global_state_t *) (module->segment_base);
+        module->posts[0] = (uint64_t *) (module->segment_base);
+        module->global_state = (ompi_osc_sm_global_state_t *) (module->posts[0] + comm_size * post_size);
         module->node_states = (ompi_osc_sm_node_state_t *) (module->global_state + 1);
-        module->posts[0] = (uint64_t *) (module->node_states + comm_size);

         for (i = 0, total = state_size + posts_size ; i < comm_size ; ++i) {
             if (i > 0) {

@jsquyres
Copy link
Member

jsquyres commented Jan 4, 2016

@hjelmn Can you have a look at this?

@hjelmn
Copy link
Member

hjelmn commented Jan 5, 2016

:bot:retest:

@hjelmn
Copy link
Member

hjelmn commented Jan 5, 2016

Hmm, either I can commit the corrected patch or you can update this PR with that patch.

A bus error occurs in sm OSC under the following conditions.

- sparc64 or any other architectures which need strict alignment.
- `MPI_WIN_POST` or `MPI_WIN_START` is called for a window created
  by sm OSC.
- The communicator size is odd and greater than 3.

The lines 283-285 in current `ompi/mca/osc/sm/osc_sm_component.c` has
the following code.

```c
module->global_state = (ompi_osc_sm_global_state_t *) (module->segment_base);
module->node_states = (ompi_osc_sm_node_state_t *) (module->global_state + 1);
module->posts[0] = (uint64_t *) (module->node_states + comm_size);
```

The size of `ompi_osc_sm_node_state_t` is multiples of 4 but not
multiples of 8. So if `comm_size` is odd, `module->posts[0]` does
not aligned to 8. This causes a bus error when accessing
`module->posts[i][j]`.

This patch fixes the alignment of `module->posts[0]` by setting
`module->posts[0]` first.
@kawashima-fj
Copy link
Member Author

@hjelmn Please review again. I pushed a new commit based on Gilles' patch.

@kawashima-fj
Copy link
Member Author

@hjelmn Can you have a look at this?

@ggouaillardet
Copy link
Contributor

@hjelmn what is the status of this PR ?
did you commit the fix already ?

@hjelmn
Copy link
Member

hjelmn commented Feb 5, 2016

Need to retest. If it comes back clean will merge and PR to 2.x.

:bot:retest:

hjelmn added a commit that referenced this pull request Mar 15, 2016
osc/sm: Fix a bus error on MPI_WIN_{POST,START}.
@hjelmn hjelmn merged commit deae9e5 into open-mpi:master Mar 15, 2016
@kawashima-fj kawashima-fj deleted the pr/osc-sm-align branch April 13, 2016 04:26
jsquyres added a commit to jsquyres/ompi that referenced this pull request Aug 23, 2016
…mmon-symbols-if-git

Makefile.am: only check for common symbols on dev builds
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants