Skip to content

Make clusterwide config upload a separate stage #1266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 4, 2021
Merged

Conversation

Steap2448
Copy link
Contributor

@Steap2448 Steap2448 commented Feb 15, 2021

For historical reasons, there were two issues in the patch_clusterwide logics:

  1. The 5-second timeout for all communications was hardcoded.
  2. No optimizations were undertaken there, it was was suboptimal and wasteful.

It was OK when the config was small and there were only a few instances in the cluster, but it's a problem for big installations nowadays.

This patch introduces a new patch_clusterwide stage - upload (consolidated in a separate module). It spreads the data across instances in a network-efficient manner.

Also in this patch:

  • Change issues_test.lua and remove unnecessary checks from cypress sigstop.spec.js because the SIGStOP doesn't result in config lock and mismatch anymore.

I didn't forget about

  • Tests
  • Changelog
  • Documentation

Close #1145

@Steap2448 Steap2448 changed the title Clusterwide-upload module for config distribution Clusterwide-upload module for a config distribution Feb 15, 2021
@Steap2448 Steap2448 force-pushed the 2pc-prepare-stage branch 5 times, most recently from d1edaea to 037e24b Compare February 19, 2021 19:39
@rosik rosik force-pushed the 2pc-prepare-stage branch 2 times, most recently from bef2018 to 94f3001 Compare February 24, 2021 14:56
@rosik rosik changed the title Clusterwide-upload module for a config distribution Make clusterwide config upload a separate stage Feb 24, 2021
@rosik rosik force-pushed the 2pc-prepare-stage branch 2 times, most recently from 43d6481 to 9321d8e Compare February 24, 2021 16:50
@rosik rosik marked this pull request as ready for review February 24, 2021 16:58
@rosik rosik force-pushed the 2pc-prepare-stage branch 11 times, most recently from 16bdca7 to d03611e Compare February 25, 2021 22:25
@rosik rosik requested a review from olegrok February 26, 2021 13:36
@rosik rosik force-pushed the 2pc-prepare-stage branch from d03611e to 63fb0fc Compare March 1, 2021 11:01
rosik added a commit that referenced this pull request Mar 3, 2021
Recently we've encountered aweird test behavior: a test (unit.upload
from #1266) succeded when run alone, but failed in conjunction with
unit.cfg_errors test. The reason is in monkeypatching and reload of lua
modules used there.

With this patch all test functions are moved to a simple empty server
and run over net.box (remote-control)
@rosik rosik mentioned this pull request Mar 3, 2021
3 tasks
rosik added a commit that referenced this pull request Mar 3, 2021
Recently we've encountered a weird test behavior: a test (`unit.upload`
from #1266) succeded when run alone, but failed in conjunction with
`unit.cfg_errors` test. The reason is in monkeypatching and reloading of
lua modules used there.

With this patch, all test functions are moved to a simple empty server
and run over netbox (remote-control).
rosik added a commit that referenced this pull request Mar 3, 2021
Recently we've encountered a weird test behavior: a test (`unit.upload`
from #1266) succeded when run alone, but failed in conjunction with
`unit.cfg_errors` test. The reason is in monkeypatching and reloading of
lua modules used there.

With this patch, all test functions are moved to a simple empty server
and run over netbox (remote-control).
@rosik rosik force-pushed the 2pc-prepare-stage branch from 63fb0fc to 9c71a65 Compare March 3, 2021 16:24

local upload_path = get_upload_path(upload_id)
local ok, err = utils.mktree(fio.dirname(upload_path))
fiber.testcancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testcancel raises. You won't perform vars.upload_fibers[upload_id] = nil if this function raises an error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok. In this case the upload_fibers will be cleaned by the cleanup stage which cancelled the fiber.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but it's not so obvious, please add a comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, in fact we can't guarantee that cleanup stage will be executed someday in case of temporary problems with network.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but it's not so obvious, please add a comment.

Done

However, in fact we can't guarantee that cleanup stage will be executed someday in case of temporary problems with network.

The necessity of persisting upload_fibers is to avoid a race between begin and cleanup. The necessity of cleaning up the upload_fibers is to avoid canceling an innocent fiber that belongs to the iproto fiber pool and has already been reused by that time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To sum up, if the cleanup stage wasn't called for some reason, the garbage will remain in /tmp and in lua context. And it's ok for now.

local upload_path = get_upload_path(upload_id)
local payload_path = fio.pathjoin(upload_path, 'payload')
local ok, err = utils.file_write(payload_path, payload)
fiber.testcancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testcancel raises. You won't perform vars.upload_fibers[upload_id] = nil if this function raises an error.

local random_path = utils.randomize_path(upload_path)

local ok = fio.rename(upload_path, random_path)
local _errno = errno()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's redundant action to call errno here. It's not needed if ok is true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main danger of errno is that it may be overridden by some other consequent fio operation. I've changed this place, but in another way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local _errno
local ok, err = fio.some_operation()
if not ok then
   _errno = errno()
end

No redefinition, to excess actions.


local ok = fio.rmtree(random_path)
local _errno = errno()
if not ok then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's redundant action to call errno here. It's not needed if ok is true.

Comment on lines +311 to +314
_G.__cartridge_upload_begin = function(...) return errors.pcall('E', upload_begin, ...) end
_G.__cartridge_upload_transmit = function(...) return errors.pcall('E', upload_transmit, ...) end
_G.__cartridge_upload_finish = function(...) return errors.pcall('E', upload_finish, ...) end
_G.__cartridge_upload_cleanup = function(...) return errors.pcall('E', upload_cleanup, ...) end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't actually need to pass ... here because you exactly know a number of arguments - one upload_id.
varargs are bit slower than usual function call with determined number of arguments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok to sacrifice that performance for the sake of code brevity. Hope it's not too much.

Comment on lines +24 to +25
{env = {TARANTOOL_UPLOAD_PREFIX = g.upload_path_1}},
{env = {TARANTOOL_UPLOAD_PREFIX = g.upload_path_1}},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's reasonable to increase number of servers up to 5 or even more. In case of problems with concurrency and shared resource problems we have more chances to catch a problem in CI.

@rosik rosik merged commit dfd75b9 into master Mar 4, 2021
@rosik rosik deleted the 2pc-prepare-stage branch March 4, 2021 12:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[2pt] Uploading big config fails on large clusters
4 participants