Skip to content

Optimize partition_validity function used in sort kernels #7937

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 18, 2025

Conversation

jhorstmann
Copy link
Contributor

Which issue does this PR close?

Optimize partition_validity function used in sort kernels

  • Preallocate vectors based on known null counts
  • Avoid dynamic dispatch by calling NullBuffer::is_valid instead of Array::is_valid
  • Avoid capacity checks inside loop by writing to spare_capacity_mut instead of using push
  • Closes Optimize sort kernels partition_validity method #7936.

Rationale for this change

Microbenchmark results for sort_kernels compared to main, only looking at benchmarks matching "nulls to indices":

sort i32 nulls to indices 2^10
                        time:   [4.9325 µs 4.9370 µs 4.9422 µs]
                        change: [−20.303% −20.133% −19.974%] (p = 0.00 < 0.05)
                        Performance has improved.

sort i32 nulls to indices 2^12
                        time:   [20.096 µs 20.209 µs 20.327 µs]
                        change: [−26.819% −26.275% −25.697%] (p = 0.00 < 0.05)
                        Performance has improved.

sort f32 nulls to indices 2^12
                        time:   [26.329 µs 26.366 µs 26.406 µs]
                        change: [−29.487% −29.331% −29.146%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string[0-10] nulls to indices 2^12
                        time:   [70.667 µs 70.762 µs 70.886 µs]
                        change: [−20.057% −19.935% −19.819%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string[0-100] nulls to indices 2^12
                        time:   [101.98 µs 102.44 µs 102.99 µs]
                        change: [−0.3501% +0.0835% +0.4918%] (p = 0.71 > 0.05)
                        No change in performance detected.

sort string[0-400] nulls to indices 2^12
                        time:   [84.952 µs 85.024 µs 85.102 µs]
                        change: [−5.3969% −4.9827% −4.6421%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string[10] nulls to indices 2^12
                        time:   [72.486 µs 72.664 µs 72.893 µs]
                        change: [−14.937% −14.781% −14.599%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string[100] nulls to indices 2^12
                        time:   [71.354 µs 71.606 µs 71.902 µs]
                        change: [−17.207% −16.795% −16.373%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string[1000] nulls to indices 2^12
                        time:   [73.088 µs 73.195 µs 73.311 µs]
                        change: [−16.705% −16.599% −16.483%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string_view[10] nulls to indices 2^12
                        time:   [32.592 µs 32.654 µs 32.731 µs]
                        change: [−15.722% −15.512% −15.310%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string_view[0-400] nulls to indices 2^12
                        time:   [32.981 µs 33.074 µs 33.189 µs]
                        change: [−25.570% −25.132% −24.700%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string_view_inlined[0-12] nulls to indices 2^12
                        time:   [28.467 µs 28.496 µs 28.529 µs]
                        change: [−22.978% −22.786% −22.574%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string[10] dict nulls to indices 2^12
                        time:   [94.463 µs 94.503 µs 94.542 µs]
                        change: [−11.386% −11.165% −10.961%] (p = 0.00 < 0.05)
                        Performance has improved.

Are these changes tested?

Covered by existing tests

Are there any user-facing changes?

No, the method is internal to the sort kernels.

 - Preallocate vectors based on known null counts
 - Avoid dynamic dispatch by calling `NullBuffer::is_valid` instead of `Array::is_valid`
@jhorstmann
Copy link
Contributor Author

@alamb can you run your benchmarking script for the sort kernels on this PR? Improvements look unexpectedly good to me 😄

@zhuqi-lucas
Copy link
Contributor

Pretty good @jhorstmann , i tested locally for one case, it is actually 1.2 faster:

sort i32 nulls to indices 2^10
                        time:   [3.9156 µs 3.9226 µs 3.9306 µs]
                        change: [21.573% −21.387% −21.194%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe

@alamb
Copy link
Contributor

alamb commented Jul 16, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize-sort-partition-validity (850fddf) to c40830e diff
BENCH_NAME=sort_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench sort_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=optimize-sort-partition-validity
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Jul 16, 2025

🤖: Benchmark completed

Details

group                                                   main                                   optimize-sort-partition-validity
-----                                                   ----                                   --------------------------------
lexsort (bool, bool) 2^12                               1.00    117.6±0.53µs        ? ?/sec    1.00    117.4±0.40µs        ? ?/sec
lexsort (bool, bool) nulls 2^12                         1.00    163.3±0.30µs        ? ?/sec    1.01    164.2±0.35µs        ? ?/sec
lexsort (f32, f32) 2^10                                 1.00     45.5±0.42µs        ? ?/sec    1.02     46.4±0.18µs        ? ?/sec
lexsort (f32, f32) 2^12                                 1.00    213.1±0.28µs        ? ?/sec    1.00    213.7±0.47µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 10                        1.00     38.5±0.09µs        ? ?/sec    1.00     38.5±0.09µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 100                       1.01     41.1±0.13µs        ? ?/sec    1.00     40.8±0.07µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 1000                      1.00     79.1±0.16µs        ? ?/sec    1.00     79.0±0.23µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 2^12                      1.00    213.1±0.51µs        ? ?/sec    1.00    213.7±0.23µs        ? ?/sec
lexsort (f32, f32) nulls 2^10                           1.00     55.8±0.28µs        ? ?/sec    1.00     55.8±0.10µs        ? ?/sec
lexsort (f32, f32) nulls 2^12                           1.00    261.7±0.42µs        ? ?/sec    1.00    261.3±0.46µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 10                  1.00     88.0±0.23µs        ? ?/sec    1.00     87.6±0.50µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 100                 1.01     89.0±0.26µs        ? ?/sec    1.00     88.4±0.22µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 1000                1.02    100.9±2.32µs        ? ?/sec    1.00     99.3±0.25µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 2^12                1.00    261.6±0.52µs        ? ?/sec    1.00    261.3±0.57µs        ? ?/sec
rank f32 2^12                                           1.00     69.2±0.26µs        ? ?/sec    1.00     68.9±0.20µs        ? ?/sec
rank f32 nulls 2^12                                     1.03     36.2±0.09µs        ? ?/sec    1.00     35.2±0.09µs        ? ?/sec
rank string[10] 2^12                                    1.00    250.0±0.49µs        ? ?/sec    1.01    251.9±0.29µs        ? ?/sec
rank string[10] nulls 2^12                              1.00    120.6±0.23µs        ? ?/sec    1.01    121.4±0.24µs        ? ?/sec
sort f32 2^12                                           1.00     65.1±0.28µs        ? ?/sec    1.01     65.7±0.25µs        ? ?/sec
sort f32 nulls 2^12                                     1.00     29.8±0.14µs        ? ?/sec    1.00     29.7±0.19µs        ? ?/sec
sort f32 nulls to indices 2^12                          1.32     70.8±0.18µs        ? ?/sec    1.00     53.5±0.27µs        ? ?/sec
sort f32 to indices 2^12                                1.00     72.9±0.46µs        ? ?/sec    1.00     72.7±0.20µs        ? ?/sec
sort i32 2^10                                           1.00      7.8±0.02µs        ? ?/sec    1.00      7.8±0.02µs        ? ?/sec
sort i32 2^12                                           1.00     37.8±0.11µs        ? ?/sec    1.00     37.9±0.17µs        ? ?/sec
sort i32 nulls 2^10                                     1.00      4.7±0.01µs        ? ?/sec    1.00      4.7±0.01µs        ? ?/sec
sort i32 nulls 2^12                                     1.00     20.1±0.06µs        ? ?/sec    1.00     20.1±0.05µs        ? ?/sec
sort i32 nulls to indices 2^10                          1.28      9.8±0.06µs        ? ?/sec    1.00      7.7±0.34µs        ? ?/sec
sort i32 nulls to indices 2^12                          1.20     52.5±0.22µs        ? ?/sec    1.00     43.9±0.11µs        ? ?/sec
sort i32 to indices 2^10                                1.01     11.2±0.03µs        ? ?/sec    1.00     11.1±0.02µs        ? ?/sec
sort i32 to indices 2^12                                1.00     52.8±0.18µs        ? ?/sec    1.00     52.8±0.18µs        ? ?/sec
sort primitive run 2^12                                 1.00      6.4±0.02µs        ? ?/sec    1.12      7.2±0.02µs        ? ?/sec
sort primitive run to indices 2^12                      1.00      8.9±0.03µs        ? ?/sec    1.00      8.9±0.02µs        ? ?/sec
sort string[0-100] nulls to indices 2^12                1.04    184.9±1.59µs        ? ?/sec    1.00    178.4±0.41µs        ? ?/sec
sort string[0-100] to indices 2^12                      1.00    347.5±0.83µs        ? ?/sec    1.01    352.2±0.95µs        ? ?/sec
sort string[0-10] nulls to indices 2^12                 1.05    150.5±1.03µs        ? ?/sec    1.00    144.0±0.29µs        ? ?/sec
sort string[0-10] to indices 2^12                       1.00    267.6±0.61µs        ? ?/sec    1.00    268.6±0.74µs        ? ?/sec
sort string[0-400] nulls to indices 2^12                1.05    160.4±0.33µs        ? ?/sec    1.00    153.2±0.65µs        ? ?/sec
sort string[0-400] to indices 2^12                      1.00    290.7±0.93µs        ? ?/sec    1.01    293.7±0.99µs        ? ?/sec
sort string[1000] nulls to indices 2^12                 1.06    149.2±0.56µs        ? ?/sec    1.00    141.4±0.48µs        ? ?/sec
sort string[1000] to indices 2^12                       1.00    254.2±1.10µs        ? ?/sec    1.01    256.1±2.02µs        ? ?/sec
sort string[100] nulls to indices 2^12                  1.06    144.8±0.22µs        ? ?/sec    1.00    137.1±0.50µs        ? ?/sec
sort string[100] to indices 2^12                        1.00    251.9±0.77µs        ? ?/sec    1.00    252.5±1.75µs        ? ?/sec
sort string[10] dict nulls to indices 2^12              1.04    178.9±0.33µs        ? ?/sec    1.00    171.3±0.69µs        ? ?/sec
sort string[10] dict to indices 2^12                    1.00    315.2±0.57µs        ? ?/sec    1.00    316.7±0.47µs        ? ?/sec
sort string[10] nulls to indices 2^12                   1.06    147.0±0.39µs        ? ?/sec    1.00    139.0±0.43µs        ? ?/sec
sort string[10] to indices 2^12                         1.00    248.0±0.50µs        ? ?/sec    1.01    249.3±0.99µs        ? ?/sec
sort string_view[0-400] nulls to indices 2^12           1.12     86.9±0.19µs        ? ?/sec    1.00     77.3±0.25µs        ? ?/sec
sort string_view[0-400] to indices 2^12                 1.00    125.1±0.35µs        ? ?/sec    1.01    126.5±0.24µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.16     70.4±0.25µs        ? ?/sec    1.00     60.8±0.36µs        ? ?/sec
sort string_view[10] to indices 2^12                    1.00    106.2±0.29µs        ? ?/sec    1.00    106.0±0.28µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.15     66.9±0.39µs        ? ?/sec    1.00     58.4±0.23µs        ? ?/sec
sort string_view_inlined[0-12] to indices 2^12          1.00     93.8±0.86µs        ? ?/sec    1.00     93.4±0.65µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Jul 16, 2025

Close/reopen to retrigger CI

@alamb alamb closed this Jul 16, 2025
@alamb alamb reopened this Jul 16, 2025
@alamb
Copy link
Contributor

alamb commented Jul 16, 2025

The bechmak results also look good to me -- the only one that reports something slow is already so fast I think it is mostly measurement error

sort primitive run 2^12 1.00 6.4±0.02µs ? ?/sec 1.12 7.2±0.02µs ? ?/sec

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at this code carefully and I think it is correct. Nicely done @jhorstmann -- I had a few suggestions for being pedantic with safety that I think should be addressed prior to merge but all in all very nice work

@jhorstmann
Copy link
Contributor Author

jhorstmann commented Jul 16, 2025

The bechmak results also look good to me -- the only one that reports something slow is already so fast I think it is mostly measurement error

sort primitive run 2^12 1.00 6.4±0.02µs ? ?/sec 1.12 7.2±0.02µs ? ?/sec

I agree, probably measurement overhead. But sort_to_indices for run end encoded arrays does actually call partition_validity and then ignores the result, so it's also possible that the compiler previously optimized that away. I'll take another look and maybe handle ree arrays earlier in that function.

Update: looked at this benchmark in a profiler, no call to partition_validity to be seen, so this seems to have been random fluctuation.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jul 16, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it

@alamb
Copy link
Contributor

alamb commented Jul 17, 2025

🚀

FYI @Dandandan and @zhuqi-lucas as you may be interested in this one too

Copy link
Contributor

@zhuqi-lucas zhuqi-lucas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you for the great work, and i added it to this epic

#7937

One potential improvement for further is:

Using word-level (u64) bit scanning.

}
});

assert_eq!(null_idx, null_count);
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use debug_assert_eq?

I am not sure if it will have less improvement for using assert_eq.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure these asserts can never fail, but they also don't add any overhead considering the loop above does about 2 comparisons per array element.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jhorstmann for checking, i was thinking, we will have many batches(Array) for datafusion big query.

@alamb
Copy link
Contributor

alamb commented Jul 17, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize-sort-partition-validity (c42d303) to c40830e diff
BENCH_NAME=sort_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench sort_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=optimize-sort-partition-validity
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Jul 17, 2025

🤖: Benchmark completed

Details

group                                                   main                                   optimize-sort-partition-validity
-----                                                   ----                                   --------------------------------
lexsort (bool, bool) 2^12                               1.01    118.0±0.49µs        ? ?/sec    1.00    117.1±0.93µs        ? ?/sec
lexsort (bool, bool) nulls 2^12                         1.00    163.7±0.96µs        ? ?/sec    1.00    163.8±0.35µs        ? ?/sec
lexsort (f32, f32) 2^10                                 1.00     45.5±0.10µs        ? ?/sec    1.00     45.6±0.06µs        ? ?/sec
lexsort (f32, f32) 2^12                                 1.00    213.6±0.29µs        ? ?/sec    1.03    219.0±0.62µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 10                        1.00     38.6±0.06µs        ? ?/sec    1.00     38.4±0.08µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 100                       1.00     41.1±0.06µs        ? ?/sec    1.00     41.0±0.09µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 1000                      1.00     79.0±0.12µs        ? ?/sec    1.00     79.4±0.17µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 2^12                      1.00    214.5±0.58µs        ? ?/sec    1.00    214.1±0.43µs        ? ?/sec
lexsort (f32, f32) nulls 2^10                           1.00     55.9±0.17µs        ? ?/sec    1.00     56.1±0.19µs        ? ?/sec
lexsort (f32, f32) nulls 2^12                           1.00    261.5±0.72µs        ? ?/sec    1.00    261.3±0.69µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 10                  1.00     87.5±0.25µs        ? ?/sec    1.00     87.7±0.19µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 100                 1.00     88.7±0.20µs        ? ?/sec    1.00     88.9±0.17µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 1000                1.00    100.1±0.47µs        ? ?/sec    1.00    100.4±0.33µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 2^12                1.00    261.4±0.64µs        ? ?/sec    1.00    261.7±1.02µs        ? ?/sec
rank f32 2^12                                           1.00     69.2±0.38µs        ? ?/sec    1.00     69.5±0.28µs        ? ?/sec
rank f32 nulls 2^12                                     1.00     35.9±0.09µs        ? ?/sec    1.01     36.1±0.12µs        ? ?/sec
rank string[10] 2^12                                    1.00    250.7±0.58µs        ? ?/sec    1.00    250.6±0.51µs        ? ?/sec
rank string[10] nulls 2^12                              1.00    120.4±0.26µs        ? ?/sec    1.00    120.2±0.27µs        ? ?/sec
sort f32 2^12                                           1.00     65.4±0.33µs        ? ?/sec    1.00     65.3±0.26µs        ? ?/sec
sort f32 nulls 2^12                                     1.00     29.7±0.10µs        ? ?/sec    1.00     29.8±0.12µs        ? ?/sec
sort f32 nulls to indices 2^12                          1.38     69.9±0.43µs        ? ?/sec    1.00     50.6±0.13µs        ? ?/sec
sort f32 to indices 2^12                                1.00     72.7±0.39µs        ? ?/sec    1.01     73.1±0.45µs        ? ?/sec
sort i32 2^10                                           1.00      7.8±0.03µs        ? ?/sec    1.00      7.7±0.02µs        ? ?/sec
sort i32 2^12                                           1.00     37.8±0.11µs        ? ?/sec    1.00     37.8±0.14µs        ? ?/sec
sort i32 nulls 2^10                                     1.00      4.7±0.01µs        ? ?/sec    1.00      4.7±0.01µs        ? ?/sec
sort i32 nulls 2^12                                     1.00     20.1±0.04µs        ? ?/sec    1.00     20.2±0.07µs        ? ?/sec
sort i32 nulls to indices 2^10                          1.39     10.0±0.10µs        ? ?/sec    1.00      7.2±0.02µs        ? ?/sec
sort i32 nulls to indices 2^12                          1.28     52.4±0.15µs        ? ?/sec    1.00     40.8±0.13µs        ? ?/sec
sort i32 to indices 2^10                                1.00     11.2±0.03µs        ? ?/sec    1.00     11.2±0.02µs        ? ?/sec
sort i32 to indices 2^12                                1.00     52.9±0.33µs        ? ?/sec    1.00     53.1±0.28µs        ? ?/sec
sort primitive run 2^12                                 1.00      6.4±0.02µs        ? ?/sec    1.01      6.5±0.01µs        ? ?/sec
sort primitive run to indices 2^12                      1.00      8.9±0.02µs        ? ?/sec    1.00      8.9±0.02µs        ? ?/sec
sort string[0-100] nulls to indices 2^12                1.07    185.0±0.90µs        ? ?/sec    1.00    173.4±0.43µs        ? ?/sec
sort string[0-100] to indices 2^12                      1.00    348.5±2.03µs        ? ?/sec    1.01    350.9±1.20µs        ? ?/sec
sort string[0-10] nulls to indices 2^12                 1.09    151.7±0.76µs        ? ?/sec    1.00    139.4±0.53µs        ? ?/sec
sort string[0-10] to indices 2^12                       1.00    268.9±1.78µs        ? ?/sec    1.00    267.7±0.70µs        ? ?/sec
sort string[0-400] nulls to indices 2^12                1.07    160.9±0.43µs        ? ?/sec    1.00    149.8±1.46µs        ? ?/sec
sort string[0-400] to indices 2^12                      1.00    292.7±1.31µs        ? ?/sec    1.00    293.1±3.53µs        ? ?/sec
sort string[1000] nulls to indices 2^12                 1.08    149.5±2.17µs        ? ?/sec    1.00    138.0±0.52µs        ? ?/sec
sort string[1000] to indices 2^12                       1.00    255.5±2.42µs        ? ?/sec    1.00    254.6±1.12µs        ? ?/sec
sort string[100] nulls to indices 2^12                  1.09    145.2±1.20µs        ? ?/sec    1.00    133.6±0.37µs        ? ?/sec
sort string[100] to indices 2^12                        1.00    253.1±0.60µs        ? ?/sec    1.00    252.9±3.12µs        ? ?/sec
sort string[10] dict nulls to indices 2^12              1.06    178.8±0.52µs        ? ?/sec    1.00    168.0±0.44µs        ? ?/sec
sort string[10] dict to indices 2^12                    1.00    315.0±0.58µs        ? ?/sec    1.00    316.3±0.95µs        ? ?/sec
sort string[10] nulls to indices 2^12                   1.09    147.5±0.36µs        ? ?/sec    1.00    135.9±1.57µs        ? ?/sec
sort string[10] to indices 2^12                         1.00    248.0±0.50µs        ? ?/sec    1.00    248.6±2.70µs        ? ?/sec
sort string_view[0-400] nulls to indices 2^12           1.17     86.6±0.26µs        ? ?/sec    1.00     73.8±0.21µs        ? ?/sec
sort string_view[0-400] to indices 2^12                 1.00    125.0±0.20µs        ? ?/sec    1.00    125.5±0.20µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.19     70.0±0.58µs        ? ?/sec    1.00     58.8±0.34µs        ? ?/sec
sort string_view[10] to indices 2^12                    1.00    106.1±0.32µs        ? ?/sec    1.00    106.1±0.35µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.21     67.0±0.39µs        ? ?/sec    1.00     55.5±0.60µs        ? ?/sec
sort string_view_inlined[0-12] to indices 2^12          1.00     93.3±0.65µs        ? ?/sec    1.00     93.8±1.24µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Jul 17, 2025

Still looking really good -- thank you @zhuqi-lucas for your suggestion

@zhuqi-lucas
Copy link
Contributor

Thank you @alamb @jhorstmann , looks good from the latest performance!

@alamb alamb merged commit 233dad3 into apache:main Jul 18, 2025
17 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 18, 2025

Thanks again

@zhuqi-lucas
Copy link
Contributor

LGTM, thank you for the great work, and i added it to this epic

#7937

One potential improvement for further is:

Using word-level (u64) bit scanning.

I create the follow-up experiment for this possible improvement:

#7962

alamb added a commit that referenced this pull request Jul 29, 2025
…map scan (up to 30% faster) (#7962)

# Which issue does this PR close?

This PR is follow-up for:

#7937 

I want to experiment the performance for Using word-level (u64) bit
scanning:

Details:


#7937 (review)

# Rationale for this change

Using word-level (u64) bit scanning

Use set_indices to implement this, but we need u32 index , so i also add
set_indices_u32, the performance shows %7 improvement comparing to
set_indices then to case to u32.

# What changes are included in this PR?

Using word-level (u64) bit scanning

Use set_indices to implement this, but we need u32 index , so i also add
set_indices_u32, the performance shows %7 improvement comparing to
set_indices then to case to u32.

# Are these changes tested?

Yes, add unit test also fuzz testing, also existed testing coverage sort
fuzz.


# Are there any user-facing changes?

No

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize sort kernels partition_validity method
4 participants