Skip to content

PERF: lexsort_depth #47511

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from
Closed

Conversation

MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented Jun 26, 2022

  • closes #xxxx (Replace xxxx with the Github issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Currently, to find the lexsort_depth, is_lexsorted is repeatedly called.

This is unnecessary, it should be possible to keep track of the lexsort depth whilst looping through the elements. In particular, if the first level isn't lexsorted, and there are many levels, the repeatedly calling is_lexsorted is slow

Timing example:

import pstats, cProfile

import pandas as pd
import numpy as np

n = 200
df = pd.DataFrame(np.ones((3, n)))
df.loc[1, 1] = 2
df = df.set_index(df.columns.tolist()).copy()

def main():
    df.index._lexsort_depth


cProfile.runctx("main()", globals(), locals(), "Profile.prof")

s = pstats.Stats("Profile.prof")
s.strip_dirs().sort_stats('time').print_stats()

On main:

         221516 function calls in 0.647 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    20100    0.155    0.000    0.562    0.000 _dtype.py:328(_name_get)
    20100    0.102    0.000    0.264    0.000 numerictypes.py:356(issubdtype)
    40200    0.097    0.000    0.142    0.000 numerictypes.py:282(issubclass_)
    80400    0.087    0.000    0.087    0.000 {built-in method builtins.issubclass}
      200    0.083    0.000    0.644    0.003 {pandas._libs.algos.is_lexsorted}
    20100    0.063    0.000    0.327    0.000 _dtype.py:314(_name_includes_bit_suffix)
    20100    0.033    0.000    0.033    0.000 {method 'format' of 'str' objects}
    20100    0.026    0.000    0.026    0.000 _dtype.py:24(_kind_name)
        1    0.001    0.001    0.647    0.647 multi.py:3841(_lexsort_depth)
      200    0.001    0.000    0.001    0.000 {pandas._libs.algos.ensure_int64}

On this branch:

         2417 function calls in 0.008 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      200    0.002    0.000    0.006    0.000 _dtype.py:328(_name_get)
      200    0.001    0.000    0.003    0.000 numerictypes.py:356(issubdtype)
      400    0.001    0.000    0.001    0.000 numerictypes.py:282(issubclass_)
        1    0.001    0.001    0.007    0.007 {pandas._libs.algos.lexsort_depth}
      800    0.001    0.000    0.001    0.000 {built-in method builtins.issubclass}
      200    0.001    0.000    0.001    0.000 {pandas._libs.algos.ensure_int64}
      200    0.001    0.000    0.003    0.000 _dtype.py:314(_name_includes_bit_suffix)
      200    0.000    0.000    0.000    0.000 {method 'format' of 'str' objects}
        1    0.000    0.000    0.001    0.001 multi.py:3843(<listcomp>)
      200    0.000    0.000    0.000    0.000 _dtype.py:24(_kind_name)

Note that if the index's codes are already lexsorted (e.g. by commenting out df.loc[1, 1] = 2), this doesn't make a difference (0.008 seconds in both cases)

I think there's alraedy a whatsnew note that could cover this, but I'll check again later

EDIT might be better to use this instead of is_lexsorted to avoid duplication, will check this soon

@MarcoGorelli MarcoGorelli marked this pull request as ready for review June 26, 2022 18:12
@MarcoGorelli MarcoGorelli added the Performance Memory or execution speed performance label Jun 26, 2022
assert arr.dtype.name == 'int64'
vecs[i] = <int64_t*>cnp.PyArray_DATA(arr)

# Assume uniqueness??
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a comment meaning "check if we should assume uniqueness" or more along the lines of "TODO: should we assume uniqueness?" if the latter, can you make that explicit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this from is_lexsorted, but it's a comment which was originally added in 49675be back in 2011. I think it's intended as a todo (i.e. if some levels are duplicated, then maybe it's more efficient to not check each one separately), but should be fine to remove

# No need to check levels for which we know input isn't lexsorted.
break
cur = vecs[k][i]
pre = vecs[k][i -1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace around -

for k in range(nlevels, 0, -1):
if libalgos.is_lexsorted(int64_codes[:k]):
return k
return 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the perf boost come from moving the loop into cython or from using a different algorithm?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

different algorithm - the main one first checks whether all levels are lexsorted, then checks whether the first
n_levels-1 are lexsorted, then checks whether the first n_levels-2 are lexsorted, etc. So it requires going
through the whole array many times. The algo in this branch just needs to go through the array once, keeping
track of the depth along the way, and exiting early if possible.

@jbrockmendel
Copy link
Member

Do we still need is_lexsorted?

@MarcoGorelli
Copy link
Member Author

Do we still need is_lexsorted?

is_lexsorted would still be more efficient if all we're checking is whether something is lexsorted, as opposed to
what the depth is
example:

In [11]: list_of_arrays = [np.ones((100_000,), dtype='int64') for _ in range(2)]

In [12]: list_of_arrays[1][1] = 2

In [13]: %%timeit
    ...: lexsort_depth(list_of_arrays)
    ...: 
    ...: 
655 µs ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [14]: %%timeit
    ...: is_lexsorted(list_of_arrays)
    ...: 
    ...: 
15.2 µs ± 17 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

@jreback jreback added this to the 1.5 milestone Jul 3, 2022
@jreback
Copy link
Contributor

jreback commented Jul 3, 2022

does this help with any algos generally? e.g. try some indexing / multiindex benchs and see if any change

@MarcoGorelli MarcoGorelli marked this pull request as draft July 8, 2022 07:19
@MarcoGorelli
Copy link
Member Author

It makes a noticeable difference if there's a wide multiindex, the rest doesn't change much. ASV runs (with a wide multiindex benchmark added):

asv continuous -f 1.1 upstream/main 39d90f729b7b3b94bedccb2b1930b6d50f5d6827 -b ^multiindex_object

run 1:

       before           after         ratio
     [e915b0a4]       [39d90f72]
     <main>           <perf-lexsort-depth>
-        49.2±1μs         26.2±1μs     0.53  multiindex_object.GetLoc.time_wide_get_loc

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

run 2:

       before           after         ratio
     [e915b0a4]       [39d90f72]
     <main>           <perf-lexsort-depth>
-        52.0±1μs       26.9±0.2μs     0.52  multiindex_object.GetLoc.time_wide_get_loc

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

run 3:

       before           after         ratio
     [e915b0a4]       [39d90f72]
     <main>           <perf-lexsort-depth>
-      50.6±0.7μs       27.2±0.7μs     0.54  multiindex_object.GetLoc.time_wide_get_loc

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Reckon it's worth it, or is it too little benefit to justify adding code?

@MarcoGorelli MarcoGorelli marked this pull request as ready for review July 10, 2022 09:23
@MarcoGorelli
Copy link
Member Author

Reckon it's worth it, or is it too little benefit to justify adding code?

TBH I think this is too little benefit - #47459 already improved is_lexsorted, and MultiIndexes don't typically have many levels, so this isn't worth adding extra code to have to maintain

Apologies for having taken up your time with this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants