Skip to content

[API server] graceful upgrade with client retry #6048

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Jun 25, 2025
Merged

Conversation

aylei
Copy link
Collaborator

@aylei aylei commented Jun 20, 2025

This PR introduce graceful upgrade support with a client-side retry mechanism, which break down the upgrade process to the following states:

  1. Serving state: all requests are served
  2. User initiate an rolling-upgrade, kubelet send SIGTERM to API server, API server enters Cordon state. Behavior:
    • For new requests: return 503 (ask for client retry)
    • For internal requests: cancel
    • For /logs requests: cancel and mark it should be retried, the client will get retry error when streaming the log or call /api/get on the logs request
    • For other requests: wait for them to complete
  3. All requests finish or the wait timeout reaches, enter Terminating state:
    • Nginx returns 503 for all requests, which asks client to retry
  4. New server Pod created, enter Serving state:
    • All the requests that are waiting for retry on the server will get through to the API server.

The key differences with our previous graceful rolling-upgrade:

  • The entire process is now fully automated;
  • Client retry ensures there is no error occurs if the graceful rolling upgrade get completed within the timeout, where timeout is configurable;

Known issues:

  • There is still a downtime since during the rolling-upgrade, new requests have to wait. This will be addressed in the future by introducing multi-replicas support for API server;
  • Nginx sends 503 to client whenever there is no live backends (e.g. the API server is crashed and cannot be recovered), which may cause the client retries for a very long time (now the default retry timeout is 10 minutes).

UX:

Message when log tailing is interrupted by server upgrade:

image

Recovered after the upgrade complete:

image

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

aylei added 7 commits June 20, 2025 23:46
@aylei aylei marked this pull request as ready for review June 23, 2025 16:12
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aylei! The PR looks awesome! Left some questions below. Also, is it possible to have a README for how to test this?

aylei and others added 7 commits June 24, 2025 10:48
Co-authored-by: Zhanghao Wu <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
@aylei
Copy link
Collaborator Author

aylei commented Jun 24, 2025

Signed-off-by: Aylei <[email protected]>
@aylei
Copy link
Collaborator Author

aylei commented Jun 25, 2025

aylei added 2 commits June 25, 2025 11:33
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this possible @aylei! LGTM once the tests passed.

We need the following two things to be added as well:

  1. Docs for the behavior of the client side retry
  2. Test script to avoid regression

aylei added 2 commits June 25, 2025 13:02
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
@aylei
Copy link
Collaborator Author

aylei commented Jun 25, 2025

@aylei
Copy link
Collaborator Author

aylei commented Jun 25, 2025

/quicktest-core

Signed-off-by: Aylei <[email protected]>
@aylei aylei merged commit 469d2b5 into master Jun 25, 2025
16 checks passed
@aylei aylei deleted the graceful-upgrade branch June 25, 2025 07:45
@aylei aylei mentioned this pull request Jul 4, 2025
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants