Skip to content

KAFKA-18845: Fix flaky QuorumControllerTest#testUncleanShutdownBrokerElrEnabled #19240

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 20, 2025

Conversation

FrankYang0529
Copy link
Member

@FrankYang0529 FrankYang0529 commented Mar 19, 2025

There're two root causes:

  1. When we unclean shutdown brokerToBeTheLeader, we didn't wait for the result. That means when we send heartbeat to unfence broker, it has chance to use stale broker epoch to send the request. [0]
  2. We use different replica directory to unclean shutdown broker. Even if broker is unfenced, it cannot get an online directory, so the brokerToBeTheLeader cannot be elected as a new leader. [1]

[0]

// Unclean shutdown should not remove the last known ELR members.
active.registerBroker(
anonymousContextFor(ApiKeys.BROKER_REGISTRATION),
new BrokerRegistrationRequestData().
setBrokerId(brokerToBeTheLeader).
setClusterId(active.clusterId()).
setFeatures(features).
setIncarnationId(Uuid.randomUuid()).
setPreviousBrokerEpoch(brokerEpochs.get(brokerToBeTheLeader)).
setLogDirs(List.of(Uuid.randomUuid())).
setListeners(listeners));
// Unfence the last one in the ELR, it should be elected.
sendBrokerHeartbeatToUnfenceBrokers(active, List.of(brokerToBeTheLeader), brokerEpochs);

[1]

@Override
public boolean test(int brokerId) {
if (!isAcceptableLeader.test(brokerId)) {
return false;
}
Uuid replicaDirectory = partition.directory(brokerId);
return clusterControl.hasOnlineDir(brokerId, replicaDirectory);
}

Reviewers: David Arthur [email protected]

@github-actions github-actions bot added tests Test fixes (including flaky tests) kraft small Small PRs labels Mar 19, 2025
@FrankYang0529 FrankYang0529 requested a review from chia7712 March 19, 2025 11:53
@@ -376,10 +375,12 @@ public void testElrEnabledByDefault() throws Throwable {
}
}

@Flaky("KAFKA-18845")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FrankYang0529 we need to keep the annotation here until the test proves itself to no longer be flaky. I have updated the wiki https://cwiki.apache.org/confluence/display/KAFKA/Flaky+Test+Management to reflect this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reminder. I added Flaky back and will remove it after we have sufficient data.

setListeners(listeners));
brokerEpochs.put(brokerToBeTheLeader, replyLeader.get().epoch());
partition = active.replicationControl().getPartition(topicIdFoo, 0);
assertArrayEquals(new int[]{brokerToBeTheLeader}, partition.elr, partition.toString());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a real message instead of just the partition if the assertion fails?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add more content for it. Thanks.

Copy link
Member

@mumrah mumrah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mumrah mumrah merged commit 71875ec into apache:trunk Mar 20, 2025
25 checks passed
@FrankYang0529 FrankYang0529 deleted the KAFKA-18845 branch March 21, 2025 01:30
chia7712 pushed a commit that referenced this pull request Apr 15, 2025
…ownBrokerElrEnabled (#19403)

It has been around two weeks since fixing
QuorumControllerTest#testUncleanShutdownBrokerElrEnabled PR
#19240 was merged. There is no flaky
result after 2025/03/21, so it has enough evidence to prove the flaky is
fixed. It's good to remove flaky tag.

Reviewers: Chia-Ping Tsai <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kraft small Small PRs tests Test fixes (including flaky tests)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants