fix: prevent integer overflow in candle backend sequence length calcu… #681

xrwang8 · 2025-07-11T03:52:21Z

Fix integer overflow in Candle backend sequence length calculation

What does this PR do?

This PR fixes a critical integer overflow bug in the Candle backend that causes CUDA driver crashes and massive memory allocation requests (~18.4 EB) when processing certain batch configurations.

Root Cause: The issue occurs in backends/candle/src/lib.rs where sequence lengths are calculated using unsigned integer subtraction without overflow protection:

(batch.cumulative_seq_lengths[i + 1] - batch.cumulative_seq_lengths[i]) as usize

When cumulative_seq_lengths[i] > cumulative_seq_lengths[i + 1] (due to malformed batch data), the subtraction underflows, producing a very large u32 value that gets cast to usize, resulting in massive memory allocation requests that crash the CUDA driver.

Solution: Replace unsafe subtraction with checked_sub() to detect overflow conditions and fail fast with a clear error message:

batch.cumulative_seq_lengths[i + 1]
    .checked_sub(batch.cumulative_seq_lengths[i])
    .expect("Invalid cumulative sequence lengths: sequence lengths must be non-decreasing")
    as usize

Symptoms this fixes:

CUDA driver errors: CUDA_ERROR_UNKNOWN
HAMi OOM errors with extremely large memory requests
Service becoming unresponsive after initial successful requests
nvidia-smi showing corrupted memory usage displays

Error logs example:

[HAMI-core ERROR]: Device 0 OOM 18446744073702267944 / 24146608128
DriverError(CUDA_ERROR_UNKNOWN, "<Failure when calling cuGetErrorString()>")

Impact:

Fixes critical CUDA driver crashes and OOM errors
Improves error diagnostics and debugging experience
Maintains full backward compatibility for valid inputs
Negligible performance impact (single integer comparison per sequence)

Testing: Added unit test to verify the fix correctly detects and handles invalid cumulative sequence lengths.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@Narsil

…lation This commit fixes a critical integer overflow bug in the Candle backend that causes CUDA driver crashes and massive memory allocation requests. The issue occurred when calculating sequence lengths using unsigned integer subtraction without overflow protection: (batch.cumulative_seq_lengths[i + 1] - batch.cumulative_seq_lengths[i]) as usize When cumulative_seq_lengths[i] > cumulative_seq_lengths[i + 1], the subtraction underflows, producing a very large u32 value that gets cast to usize, resulting in memory allocation requests of ~18.4 EB. Signed-off-by: xrwang8 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: prevent integer overflow in candle backend sequence length calcu… #681

fix: prevent integer overflow in candle backend sequence length calcu… #681

xrwang8 commented Jul 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

fix: prevent integer overflow in candle backend sequence length calcu… #681

Are you sure you want to change the base?

fix: prevent integer overflow in candle backend sequence length calcu… #681

Conversation

xrwang8 commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix integer overflow in Candle backend sequence length calculation

What does this PR do?

Before submitting

Who can review?

Uh oh!

Uh oh!

xrwang8 commented Jul 11, 2025 •

edited

Loading