Optimize partition_validity function used in sort kernels #7937

jhorstmann · 2025-07-16T08:02:53Z

Which issue does this PR close?

Optimize partition_validity function used in sort kernels

Preallocate vectors based on known null counts
Avoid dynamic dispatch by calling NullBuffer::is_valid instead of Array::is_valid
Avoid capacity checks inside loop by writing to spare_capacity_mut instead of using push
Closes Optimize sort kernels partition_validity method #7936.

Rationale for this change

Microbenchmark results for sort_kernels compared to main, only looking at benchmarks matching "nulls to indices":

sort i32 nulls to indices 2^10
                        time:   [4.9325 µs 4.9370 µs 4.9422 µs]
                        change: [−20.303% −20.133% −19.974%] (p = 0.00 < 0.05)
                        Performance has improved.

sort i32 nulls to indices 2^12
                        time:   [20.096 µs 20.209 µs 20.327 µs]
                        change: [−26.819% −26.275% −25.697%] (p = 0.00 < 0.05)
                        Performance has improved.

sort f32 nulls to indices 2^12
                        time:   [26.329 µs 26.366 µs 26.406 µs]
                        change: [−29.487% −29.331% −29.146%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string[0-10] nulls to indices 2^12
                        time:   [70.667 µs 70.762 µs 70.886 µs]
                        change: [−20.057% −19.935% −19.819%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string[0-100] nulls to indices 2^12
                        time:   [101.98 µs 102.44 µs 102.99 µs]
                        change: [−0.3501% +0.0835% +0.4918%] (p = 0.71 > 0.05)
                        No change in performance detected.

sort string[0-400] nulls to indices 2^12
                        time:   [84.952 µs 85.024 µs 85.102 µs]
                        change: [−5.3969% −4.9827% −4.6421%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string[10] nulls to indices 2^12
                        time:   [72.486 µs 72.664 µs 72.893 µs]
                        change: [−14.937% −14.781% −14.599%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string[100] nulls to indices 2^12
                        time:   [71.354 µs 71.606 µs 71.902 µs]
                        change: [−17.207% −16.795% −16.373%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string[1000] nulls to indices 2^12
                        time:   [73.088 µs 73.195 µs 73.311 µs]
                        change: [−16.705% −16.599% −16.483%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string_view[10] nulls to indices 2^12
                        time:   [32.592 µs 32.654 µs 32.731 µs]
                        change: [−15.722% −15.512% −15.310%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string_view[0-400] nulls to indices 2^12
                        time:   [32.981 µs 33.074 µs 33.189 µs]
                        change: [−25.570% −25.132% −24.700%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string_view_inlined[0-12] nulls to indices 2^12
                        time:   [28.467 µs 28.496 µs 28.529 µs]
                        change: [−22.978% −22.786% −22.574%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string[10] dict nulls to indices 2^12
                        time:   [94.463 µs 94.503 µs 94.542 µs]
                        change: [−11.386% −11.165% −10.961%] (p = 0.00 < 0.05)
                        Performance has improved.

Are these changes tested?

Covered by existing tests

Are there any user-facing changes?

No, the method is internal to the sort kernels.

- Preallocate vectors based on known null counts - Avoid dynamic dispatch by calling `NullBuffer::is_valid` instead of `Array::is_valid`

jhorstmann · 2025-07-16T10:11:26Z

@alamb can you run your benchmarking script for the sort kernels on this PR? Improvements look unexpectedly good to me 😄

zhuqi-lucas · 2025-07-16T11:10:52Z

Pretty good @jhorstmann , i tested locally for one case, it is actually 1.2 faster:

sort i32 nulls to indices 2^10
                        time:   [3.9156 µs 3.9226 µs 3.9306 µs]
                        change: [−21.573% −21.387% −21.194%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe

alamb · 2025-07-16T15:14:39Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize-sort-partition-validity (850fddf) to c40830e diff
BENCH_NAME=sort_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench sort_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=optimize-sort-partition-validity
Results will be posted here when complete

alamb · 2025-07-16T15:32:49Z

🤖: Benchmark completed

Details

group                                                   main                                   optimize-sort-partition-validity
-----                                                   ----                                   --------------------------------
lexsort (bool, bool) 2^12                               1.00    117.6±0.53µs        ? ?/sec    1.00    117.4±0.40µs        ? ?/sec
lexsort (bool, bool) nulls 2^12                         1.00    163.3±0.30µs        ? ?/sec    1.01    164.2±0.35µs        ? ?/sec
lexsort (f32, f32) 2^10                                 1.00     45.5±0.42µs        ? ?/sec    1.02     46.4±0.18µs        ? ?/sec
lexsort (f32, f32) 2^12                                 1.00    213.1±0.28µs        ? ?/sec    1.00    213.7±0.47µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 10                        1.00     38.5±0.09µs        ? ?/sec    1.00     38.5±0.09µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 100                       1.01     41.1±0.13µs        ? ?/sec    1.00     40.8±0.07µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 1000                      1.00     79.1±0.16µs        ? ?/sec    1.00     79.0±0.23µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 2^12                      1.00    213.1±0.51µs        ? ?/sec    1.00    213.7±0.23µs        ? ?/sec
lexsort (f32, f32) nulls 2^10                           1.00     55.8±0.28µs        ? ?/sec    1.00     55.8±0.10µs        ? ?/sec
lexsort (f32, f32) nulls 2^12                           1.00    261.7±0.42µs        ? ?/sec    1.00    261.3±0.46µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 10                  1.00     88.0±0.23µs        ? ?/sec    1.00     87.6±0.50µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 100                 1.01     89.0±0.26µs        ? ?/sec    1.00     88.4±0.22µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 1000                1.02    100.9±2.32µs        ? ?/sec    1.00     99.3±0.25µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 2^12                1.00    261.6±0.52µs        ? ?/sec    1.00    261.3±0.57µs        ? ?/sec
rank f32 2^12                                           1.00     69.2±0.26µs        ? ?/sec    1.00     68.9±0.20µs        ? ?/sec
rank f32 nulls 2^12                                     1.03     36.2±0.09µs        ? ?/sec    1.00     35.2±0.09µs        ? ?/sec
rank string[10] 2^12                                    1.00    250.0±0.49µs        ? ?/sec    1.01    251.9±0.29µs        ? ?/sec
rank string[10] nulls 2^12                              1.00    120.6±0.23µs        ? ?/sec    1.01    121.4±0.24µs        ? ?/sec
sort f32 2^12                                           1.00     65.1±0.28µs        ? ?/sec    1.01     65.7±0.25µs        ? ?/sec
sort f32 nulls 2^12                                     1.00     29.8±0.14µs        ? ?/sec    1.00     29.7±0.19µs        ? ?/sec
sort f32 nulls to indices 2^12                          1.32     70.8±0.18µs        ? ?/sec    1.00     53.5±0.27µs        ? ?/sec
sort f32 to indices 2^12                                1.00     72.9±0.46µs        ? ?/sec    1.00     72.7±0.20µs        ? ?/sec
sort i32 2^10                                           1.00      7.8±0.02µs        ? ?/sec    1.00      7.8±0.02µs        ? ?/sec
sort i32 2^12                                           1.00     37.8±0.11µs        ? ?/sec    1.00     37.9±0.17µs        ? ?/sec
sort i32 nulls 2^10                                     1.00      4.7±0.01µs        ? ?/sec    1.00      4.7±0.01µs        ? ?/sec
sort i32 nulls 2^12                                     1.00     20.1±0.06µs        ? ?/sec    1.00     20.1±0.05µs        ? ?/sec
sort i32 nulls to indices 2^10                          1.28      9.8±0.06µs        ? ?/sec    1.00      7.7±0.34µs        ? ?/sec
sort i32 nulls to indices 2^12                          1.20     52.5±0.22µs        ? ?/sec    1.00     43.9±0.11µs        ? ?/sec
sort i32 to indices 2^10                                1.01     11.2±0.03µs        ? ?/sec    1.00     11.1±0.02µs        ? ?/sec
sort i32 to indices 2^12                                1.00     52.8±0.18µs        ? ?/sec    1.00     52.8±0.18µs        ? ?/sec
sort primitive run 2^12                                 1.00      6.4±0.02µs        ? ?/sec    1.12      7.2±0.02µs        ? ?/sec
sort primitive run to indices 2^12                      1.00      8.9±0.03µs        ? ?/sec    1.00      8.9±0.02µs        ? ?/sec
sort string[0-100] nulls to indices 2^12                1.04    184.9±1.59µs        ? ?/sec    1.00    178.4±0.41µs        ? ?/sec
sort string[0-100] to indices 2^12                      1.00    347.5±0.83µs        ? ?/sec    1.01    352.2±0.95µs        ? ?/sec
sort string[0-10] nulls to indices 2^12                 1.05    150.5±1.03µs        ? ?/sec    1.00    144.0±0.29µs        ? ?/sec
sort string[0-10] to indices 2^12                       1.00    267.6±0.61µs        ? ?/sec    1.00    268.6±0.74µs        ? ?/sec
sort string[0-400] nulls to indices 2^12                1.05    160.4±0.33µs        ? ?/sec    1.00    153.2±0.65µs        ? ?/sec
sort string[0-400] to indices 2^12                      1.00    290.7±0.93µs        ? ?/sec    1.01    293.7±0.99µs        ? ?/sec
sort string[1000] nulls to indices 2^12                 1.06    149.2±0.56µs        ? ?/sec    1.00    141.4±0.48µs        ? ?/sec
sort string[1000] to indices 2^12                       1.00    254.2±1.10µs        ? ?/sec    1.01    256.1±2.02µs        ? ?/sec
sort string[100] nulls to indices 2^12                  1.06    144.8±0.22µs        ? ?/sec    1.00    137.1±0.50µs        ? ?/sec
sort string[100] to indices 2^12                        1.00    251.9±0.77µs        ? ?/sec    1.00    252.5±1.75µs        ? ?/sec
sort string[10] dict nulls to indices 2^12              1.04    178.9±0.33µs        ? ?/sec    1.00    171.3±0.69µs        ? ?/sec
sort string[10] dict to indices 2^12                    1.00    315.2±0.57µs        ? ?/sec    1.00    316.7±0.47µs        ? ?/sec
sort string[10] nulls to indices 2^12                   1.06    147.0±0.39µs        ? ?/sec    1.00    139.0±0.43µs        ? ?/sec
sort string[10] to indices 2^12                         1.00    248.0±0.50µs        ? ?/sec    1.01    249.3±0.99µs        ? ?/sec
sort string_view[0-400] nulls to indices 2^12           1.12     86.9±0.19µs        ? ?/sec    1.00     77.3±0.25µs        ? ?/sec
sort string_view[0-400] to indices 2^12                 1.00    125.1±0.35µs        ? ?/sec    1.01    126.5±0.24µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.16     70.4±0.25µs        ? ?/sec    1.00     60.8±0.36µs        ? ?/sec
sort string_view[10] to indices 2^12                    1.00    106.2±0.29µs        ? ?/sec    1.00    106.0±0.28µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.15     66.9±0.39µs        ? ?/sec    1.00     58.4±0.23µs        ? ?/sec
sort string_view_inlined[0-12] to indices 2^12          1.00     93.8±0.86µs        ? ?/sec    1.00     93.4±0.65µs        ? ?/sec

alamb · 2025-07-16T15:45:39Z

Close/reopen to retrigger CI

alamb · 2025-07-16T15:46:18Z

The bechmak results also look good to me -- the only one that reports something slow is already so fast I think it is mostly measurement error

sort primitive run 2^12 1.00 6.4±0.02µs ? ?/sec 1.12 7.2±0.02µs ? ?/sec

alamb

I looked at this code carefully and I think it is correct. Nicely done @jhorstmann -- I had a few suggestions for being pedantic with safety that I think should be addressed prior to merge but all in all very nice work

arrow-ord/src/sort.rs

jhorstmann · 2025-07-16T16:11:35Z

The bechmak results also look good to me -- the only one that reports something slow is already so fast I think it is mostly measurement error

sort primitive run 2^12 1.00 6.4±0.02µs ? ?/sec 1.12 7.2±0.02µs ? ?/sec

I agree, probably measurement overhead. But sort_to_indices for run end encoded arrays does actually call partition_validity and then ignores the result, so it's also possible that the compiler previously optimized that away. I'll take another look and maybe handle ree arrays earlier in that function.

Update: looked at this benchmark in a profiler, no call to partition_validity to be seen, so this seems to have been random fluctuation.

alamb

Love it

alamb · 2025-07-17T13:52:10Z

🚀

FYI @Dandandan and @zhuqi-lucas as you may be interested in this one too

zhuqi-lucas

LGTM, thank you for the great work, and i added it to this epic

#7937

One potential improvement for further is:

Using word-level (u64) bit scanning.

zhuqi-lucas · 2025-07-17T15:20:41Z

arrow-ord/src/sort.rs

+                }
+            });
+
+            assert_eq!(null_idx, null_count);


Is it possible to use debug_assert_eq?

I am not sure if it will have less improvement for using assert_eq.

I'm pretty sure these asserts can never fail, but they also don't add any overhead considering the loop above does about 2 comparisons per array element.

Thank you @jhorstmann for checking, i was thinking, we will have many batches(Array) for datafusion big query.

alamb · 2025-07-17T15:30:20Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize-sort-partition-validity (c42d303) to c40830e diff
BENCH_NAME=sort_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench sort_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=optimize-sort-partition-validity
Results will be posted here when complete

alamb · 2025-07-17T15:48:48Z

🤖: Benchmark completed

Details

group                                                   main                                   optimize-sort-partition-validity
-----                                                   ----                                   --------------------------------
lexsort (bool, bool) 2^12                               1.01    118.0±0.49µs        ? ?/sec    1.00    117.1±0.93µs        ? ?/sec
lexsort (bool, bool) nulls 2^12                         1.00    163.7±0.96µs        ? ?/sec    1.00    163.8±0.35µs        ? ?/sec
lexsort (f32, f32) 2^10                                 1.00     45.5±0.10µs        ? ?/sec    1.00     45.6±0.06µs        ? ?/sec
lexsort (f32, f32) 2^12                                 1.00    213.6±0.29µs        ? ?/sec    1.03    219.0±0.62µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 10                        1.00     38.6±0.06µs        ? ?/sec    1.00     38.4±0.08µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 100                       1.00     41.1±0.06µs        ? ?/sec    1.00     41.0±0.09µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 1000                      1.00     79.0±0.12µs        ? ?/sec    1.00     79.4±0.17µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 2^12                      1.00    214.5±0.58µs        ? ?/sec    1.00    214.1±0.43µs        ? ?/sec
lexsort (f32, f32) nulls 2^10                           1.00     55.9±0.17µs        ? ?/sec    1.00     56.1±0.19µs        ? ?/sec
lexsort (f32, f32) nulls 2^12                           1.00    261.5±0.72µs        ? ?/sec    1.00    261.3±0.69µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 10                  1.00     87.5±0.25µs        ? ?/sec    1.00     87.7±0.19µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 100                 1.00     88.7±0.20µs        ? ?/sec    1.00     88.9±0.17µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 1000                1.00    100.1±0.47µs        ? ?/sec    1.00    100.4±0.33µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 2^12                1.00    261.4±0.64µs        ? ?/sec    1.00    261.7±1.02µs        ? ?/sec
rank f32 2^12                                           1.00     69.2±0.38µs        ? ?/sec    1.00     69.5±0.28µs        ? ?/sec
rank f32 nulls 2^12                                     1.00     35.9±0.09µs        ? ?/sec    1.01     36.1±0.12µs        ? ?/sec
rank string[10] 2^12                                    1.00    250.7±0.58µs        ? ?/sec    1.00    250.6±0.51µs        ? ?/sec
rank string[10] nulls 2^12                              1.00    120.4±0.26µs        ? ?/sec    1.00    120.2±0.27µs        ? ?/sec
sort f32 2^12                                           1.00     65.4±0.33µs        ? ?/sec    1.00     65.3±0.26µs        ? ?/sec
sort f32 nulls 2^12                                     1.00     29.7±0.10µs        ? ?/sec    1.00     29.8±0.12µs        ? ?/sec
sort f32 nulls to indices 2^12                          1.38     69.9±0.43µs        ? ?/sec    1.00     50.6±0.13µs        ? ?/sec
sort f32 to indices 2^12                                1.00     72.7±0.39µs        ? ?/sec    1.01     73.1±0.45µs        ? ?/sec
sort i32 2^10                                           1.00      7.8±0.03µs        ? ?/sec    1.00      7.7±0.02µs        ? ?/sec
sort i32 2^12                                           1.00     37.8±0.11µs        ? ?/sec    1.00     37.8±0.14µs        ? ?/sec
sort i32 nulls 2^10                                     1.00      4.7±0.01µs        ? ?/sec    1.00      4.7±0.01µs        ? ?/sec
sort i32 nulls 2^12                                     1.00     20.1±0.04µs        ? ?/sec    1.00     20.2±0.07µs        ? ?/sec
sort i32 nulls to indices 2^10                          1.39     10.0±0.10µs        ? ?/sec    1.00      7.2±0.02µs        ? ?/sec
sort i32 nulls to indices 2^12                          1.28     52.4±0.15µs        ? ?/sec    1.00     40.8±0.13µs        ? ?/sec
sort i32 to indices 2^10                                1.00     11.2±0.03µs        ? ?/sec    1.00     11.2±0.02µs        ? ?/sec
sort i32 to indices 2^12                                1.00     52.9±0.33µs        ? ?/sec    1.00     53.1±0.28µs        ? ?/sec
sort primitive run 2^12                                 1.00      6.4±0.02µs        ? ?/sec    1.01      6.5±0.01µs        ? ?/sec
sort primitive run to indices 2^12                      1.00      8.9±0.02µs        ? ?/sec    1.00      8.9±0.02µs        ? ?/sec
sort string[0-100] nulls to indices 2^12                1.07    185.0±0.90µs        ? ?/sec    1.00    173.4±0.43µs        ? ?/sec
sort string[0-100] to indices 2^12                      1.00    348.5±2.03µs        ? ?/sec    1.01    350.9±1.20µs        ? ?/sec
sort string[0-10] nulls to indices 2^12                 1.09    151.7±0.76µs        ? ?/sec    1.00    139.4±0.53µs        ? ?/sec
sort string[0-10] to indices 2^12                       1.00    268.9±1.78µs        ? ?/sec    1.00    267.7±0.70µs        ? ?/sec
sort string[0-400] nulls to indices 2^12                1.07    160.9±0.43µs        ? ?/sec    1.00    149.8±1.46µs        ? ?/sec
sort string[0-400] to indices 2^12                      1.00    292.7±1.31µs        ? ?/sec    1.00    293.1±3.53µs        ? ?/sec
sort string[1000] nulls to indices 2^12                 1.08    149.5±2.17µs        ? ?/sec    1.00    138.0±0.52µs        ? ?/sec
sort string[1000] to indices 2^12                       1.00    255.5±2.42µs        ? ?/sec    1.00    254.6±1.12µs        ? ?/sec
sort string[100] nulls to indices 2^12                  1.09    145.2±1.20µs        ? ?/sec    1.00    133.6±0.37µs        ? ?/sec
sort string[100] to indices 2^12                        1.00    253.1±0.60µs        ? ?/sec    1.00    252.9±3.12µs        ? ?/sec
sort string[10] dict nulls to indices 2^12              1.06    178.8±0.52µs        ? ?/sec    1.00    168.0±0.44µs        ? ?/sec
sort string[10] dict to indices 2^12                    1.00    315.0±0.58µs        ? ?/sec    1.00    316.3±0.95µs        ? ?/sec
sort string[10] nulls to indices 2^12                   1.09    147.5±0.36µs        ? ?/sec    1.00    135.9±1.57µs        ? ?/sec
sort string[10] to indices 2^12                         1.00    248.0±0.50µs        ? ?/sec    1.00    248.6±2.70µs        ? ?/sec
sort string_view[0-400] nulls to indices 2^12           1.17     86.6±0.26µs        ? ?/sec    1.00     73.8±0.21µs        ? ?/sec
sort string_view[0-400] to indices 2^12                 1.00    125.0±0.20µs        ? ?/sec    1.00    125.5±0.20µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.19     70.0±0.58µs        ? ?/sec    1.00     58.8±0.34µs        ? ?/sec
sort string_view[10] to indices 2^12                    1.00    106.1±0.32µs        ? ?/sec    1.00    106.1±0.35µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.21     67.0±0.39µs        ? ?/sec    1.00     55.5±0.60µs        ? ?/sec
sort string_view_inlined[0-12] to indices 2^12          1.00     93.3±0.65µs        ? ?/sec    1.00     93.8±1.24µs        ? ?/sec

alamb · 2025-07-17T18:37:33Z

Still looking really good -- thank you @zhuqi-lucas for your suggestion

zhuqi-lucas · 2025-07-18T03:12:54Z

Thank you @alamb @jhorstmann , looks good from the latest performance!

alamb · 2025-07-18T11:56:01Z

Thanks again

zhuqi-lucas · 2025-07-18T13:10:38Z

LGTM, thank you for the great work, and i added it to this epic

#7937

One potential improvement for further is:

Using word-level (u64) bit scanning.

I create the follow-up experiment for this possible improvement:

#7962

…map scan (up to 30% faster) (#7962) # Which issue does this PR close? This PR is follow-up for: #7937 I want to experiment the performance for Using word-level (u64) bit scanning: Details: #7937 (review) # Rationale for this change Using word-level (u64) bit scanning Use set_indices to implement this, but we need u32 index , so i also add set_indices_u32, the performance shows %7 improvement comparing to set_indices then to case to u32. # What changes are included in this PR? Using word-level (u64) bit scanning Use set_indices to implement this, but we need u32 index , so i also add set_indices_u32, the performance shows %7 improvement comparing to set_indices then to case to u32. # Are these changes tested? Yes, add unit test also fuzz testing, also existed testing coverage sort fuzz. # Are there any user-facing changes? No --------- Co-authored-by: Andrew Lamb <[email protected]>

Optimize partition_validity used in sort kernels

850fddf

- Preallocate vectors based on known null counts - Avoid dynamic dispatch by calling `NullBuffer::is_valid` instead of `Array::is_valid`

mbrobbel approved these changes Jul 16, 2025

View reviewed changes

alamb closed this Jul 16, 2025

alamb reopened this Jul 16, 2025

alamb approved these changes Jul 16, 2025

View reviewed changes

arrow-ord/src/sort.rs Show resolved Hide resolved

arrow-ord/src/sort.rs Outdated Show resolved Hide resolved

arrow-ord/src/sort.rs Show resolved Hide resolved

jhorstmann added 2 commits July 16, 2025 19:53

Extract null_count variable

8011217

Add asserts and safety comment

c42d303

github-actions bot added the arrow Changes to the arrow crate label Jul 16, 2025

alamb approved these changes Jul 17, 2025

View reviewed changes

zhuqi-lucas mentioned this pull request Jul 17, 2025

[EPIC] A collection of improvement for the performance for sort and compare and gc, etc #7802

Open

12 tasks

zhuqi-lucas approved these changes Jul 17, 2025

View reviewed changes

zhuqi-lucas reviewed Jul 17, 2025

View reviewed changes

alamb merged commit 233dad3 into apache:main Jul 18, 2025
17 checks passed

zhuqi-lucas mentioned this pull request Jul 18, 2025

Perf: improve sort via partition_validity to use fast path for bit map scan (up to 30% faster) #7962

Merged

alamb mentioned this pull request Jul 28, 2025

Optimize sort kernels partition_validity method #7936

Closed

Optimize partition_validity function used in sort kernels #7937

Optimize partition_validity function used in sort kernels #7937

Uh oh!

Conversation

jhorstmann commented Jul 16, 2025

Which issue does this PR close?

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

jhorstmann commented Jul 16, 2025

Uh oh!

zhuqi-lucas commented Jul 16, 2025

Uh oh!

alamb commented Jul 16, 2025

Uh oh!

alamb commented Jul 16, 2025

Uh oh!

alamb commented Jul 16, 2025

Uh oh!

alamb commented Jul 16, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jhorstmann commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Jul 17, 2025

Uh oh!

zhuqi-lucas left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhuqi-lucas Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhorstmann Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

zhuqi-lucas Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Jul 17, 2025

Uh oh!

alamb commented Jul 17, 2025

Uh oh!

alamb commented Jul 17, 2025

Uh oh!

zhuqi-lucas commented Jul 18, 2025

Uh oh!

Uh oh!

alamb commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhuqi-lucas commented Jul 18, 2025

Uh oh!

Uh oh!

jhorstmann commented Jul 16, 2025 •

edited

Loading

zhuqi-lucas left a comment •

edited

Loading

zhuqi-lucas Jul 17, 2025 •

edited

Loading

alamb commented Jul 18, 2025 •

edited

Loading