-
Notifications
You must be signed in to change notification settings - Fork 984
Optimize partition_validity function used in sort kernels #7937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize partition_validity function used in sort kernels #7937
Conversation
- Preallocate vectors based on known null counts - Avoid dynamic dispatch by calling `NullBuffer::is_valid` instead of `Array::is_valid`
@alamb can you run your benchmarking script for the sort kernels on this PR? Improvements look unexpectedly good to me 😄 |
Pretty good @jhorstmann , i tested locally for one case, it is actually 1.2 faster: sort i32 nulls to indices 2^10
time: [3.9156 µs 3.9226 µs 3.9306 µs]
change: [−21.573% −21.387% −21.194%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
1 (1.00%) high mild
4 (4.00%) high severe |
🤖 |
🤖: Benchmark completed Details
|
Close/reopen to retrigger CI |
The bechmak results also look good to me -- the only one that reports something slow is already so fast I think it is mostly measurement error
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at this code carefully and I think it is correct. Nicely done @jhorstmann -- I had a few suggestions for being pedantic with safety that I think should be addressed prior to merge but all in all very nice work
I agree, probably measurement overhead. But Update: looked at this benchmark in a profiler, no call to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it
🚀 FYI @Dandandan and @zhuqi-lucas as you may be interested in this one too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you for the great work, and i added it to this epic
One potential improvement for further is:
Using word-level (u64) bit scanning.
} | ||
}); | ||
|
||
assert_eq!(null_idx, null_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to use debug_assert_eq?
I am not sure if it will have less improvement for using assert_eq.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure these asserts can never fail, but they also don't add any overhead considering the loop above does about 2 comparisons per array element.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jhorstmann for checking, i was thinking, we will have many batches(Array) for datafusion big query.
🤖 |
🤖: Benchmark completed Details
|
Still looking really good -- thank you @zhuqi-lucas for your suggestion |
Thank you @alamb @jhorstmann , looks good from the latest performance! |
Thanks again |
…map scan (up to 30% faster) (#7962) # Which issue does this PR close? This PR is follow-up for: #7937 I want to experiment the performance for Using word-level (u64) bit scanning: Details: #7937 (review) # Rationale for this change Using word-level (u64) bit scanning Use set_indices to implement this, but we need u32 index , so i also add set_indices_u32, the performance shows %7 improvement comparing to set_indices then to case to u32. # What changes are included in this PR? Using word-level (u64) bit scanning Use set_indices to implement this, but we need u32 index , so i also add set_indices_u32, the performance shows %7 improvement comparing to set_indices then to case to u32. # Are these changes tested? Yes, add unit test also fuzz testing, also existed testing coverage sort fuzz. # Are there any user-facing changes? No --------- Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
Optimize
partition_validity
function used in sort kernelsNullBuffer::is_valid
instead ofArray::is_valid
spare_capacity_mut
instead of usingpush
Rationale for this change
Microbenchmark results for
sort_kernels
compared tomain
, only looking at benchmarks matching "nulls to indices":Are these changes tested?
Covered by existing tests
Are there any user-facing changes?
No, the method is internal to the sort kernels.