-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
PERF: lexsort_depth #47511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: lexsort_depth #47511
Conversation
pandas/_libs/algos.pyx
Outdated
assert arr.dtype.name == 'int64' | ||
vecs[i] = <int64_t*>cnp.PyArray_DATA(arr) | ||
|
||
# Assume uniqueness?? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a comment meaning "check if we should assume uniqueness" or more along the lines of "TODO: should we assume uniqueness?" if the latter, can you make that explicit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied this from is_lexsorted
, but it's a comment which was originally added in 49675be back in 2011. I think it's intended as a todo (i.e. if some levels are duplicated, then maybe it's more efficient to not check each one separately), but should be fine to remove
pandas/_libs/algos.pyx
Outdated
# No need to check levels for which we know input isn't lexsorted. | ||
break | ||
cur = vecs[k][i] | ||
pre = vecs[k][i -1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whitespace around -
for k in range(nlevels, 0, -1): | ||
if libalgos.is_lexsorted(int64_codes[:k]): | ||
return k | ||
return 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the perf boost come from moving the loop into cython or from using a different algorithm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
different algorithm - the main
one first checks whether all levels are lexsorted, then checks whether the first
n_levels-1 are lexsorted, then checks whether the first n_levels-2 are lexsorted, etc. So it requires going
through the whole array many times. The algo in this branch just needs to go through the array once, keeping
track of the depth along the way, and exiting early if possible.
Do we still need is_lexsorted? |
is_lexsorted would still be more efficient if all we're checking is whether something is lexsorted, as opposed to
|
does this help with any algos generally? e.g. try some indexing / multiindex benchs and see if any change |
It makes a noticeable difference if there's a wide multiindex, the rest doesn't change much. ASV runs (with a wide multiindex benchmark added):
run 1:
run 2:
run 3:
Reckon it's worth it, or is it too little benefit to justify adding code? |
TBH I think this is too little benefit - #47459 already improved Apologies for having taken up your time with this |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Currently, to find the lexsort_depth,
is_lexsorted
is repeatedly called.This is unnecessary, it should be possible to keep track of the lexsort depth whilst looping through the elements. In particular, if the first level isn't lexsorted, and there are many levels, the repeatedly calling
is_lexsorted
is slowTiming example:
On main:
On this branch:
Note that if the index's codes are already lexsorted (e.g. by commenting out
df.loc[1, 1] = 2
), this doesn't make a difference (0.008 seconds
in both cases)I think there's alraedy a whatsnew note that could cover this, but I'll check again later
EDIT might be better to use this instead of is_lexsorted to avoid duplication, will check this soon