Skip to content

performance regression in ewm.corr(pairwise=True) #17917

@grumpyquant

Description

@grumpyquant

Problem description

The deprecation of Panel in the 0.20.x releases has introduced a severe performance regression in ewm.corr(pairwise=True) for a common case when this function is called on a long time series (e.g. a dataframe with 1 million rows and 6 columns). The issue is the last 3 lines of code in this section of core/window.py:

                # TODO: not the most efficient (perf-wise)
                # though not bad code-wise
                from pandas import Panel, MultiIndex, concat

                with warnings.catch_warnings(record=True):
                    p = Panel.from_dict(results).swapaxes('items', 'major')
                    if len(p.major_axis) > 0:
                        p.major_axis = arg1.columns[p.major_axis]
                    if len(p.minor_axis) > 0:
                        p.minor_axis = arg2.columns[p.minor_axis]

                if len(p.items):
                    result = concat(
                        [p.iloc[i].T for i in range(len(p.items))],
                        keys=p.items)

The result is converted from a Panel to a DataFrame by running a concat along an axis that is typically very long. This is killing performance for me compared to the 0.19 releases.

My solution was to replace the last 3 lines with:

                      result = DataFrame(
                          p.values.reshape((p.shape[0], p.shape[1]*p.shape[2])),
                          index=p.items,
                          columns=MultiIndex.from_product((arg1.columns, arg2.columns))
                      )
                      result = result.stack(dropna=False)

This works for me but I'm no pandas internals expert, so perhaps this solution does not work in all cases.

I'd really appreciate getting a workaround in though - clearly from the TODO comment the developers were at least aware this was a performance issue when they added the code. I'm happy to rejig the above if it has problems.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.22.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: 2.8.5
pip: 9.0.1
setuptools: 26.1.1
Cython: 0.26
numpy: 1.13.3
scipy: 0.19.1
xarray: 0.8.2
IPython: 6.0.0
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Numeric OperationsArithmetic, Comparison, and Logical operationsPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions