Skip to content

Extreme performance issue in pandas 1.0.3 when setting a new column with DatetimeIndex #34531

Closed
@derHeinzer

Description

@derHeinzer

When adding a column to a DataFrame with one level having a DateTime-like dtype, the dtype of the values to be added is explicitly casted to object type in multi.py if the indexes of the values to be setted and the frames index are not identical in pandas version 1.0.3. Those object typed values are beiing transformed to Timestamps later on. This consumes a lot of time for big dataframes.

Comparing Pandas version 0.22.0 and 1.0.3 yields 0.124 seconds vs. 35.274 seconds on my machine on following reproducable setup:

build reproducable setup

iterables = [range(10000), pd.date_range('2020-01-01', periods=200)]
idx = pd.MultiIndex.from_product(iterables, names=['id', 'date'])
df = pd.DataFrame(data=np.random.randn(10000 * 200), index=idx, columns=["value"])
new_col = df[df.index.get_level_values(1) != pd.to_datetime('2020-01-01')] # drop first record of each id
print(df.shape, new_col.shape)

profile performance of set_item

import cProfile
pr = cProfile.Profile()
pr.enable()
df['new_col'] = new_col['value']
pr.disable()
pr.print_stats(sort=2)

Metadata

Metadata

Assignees

No one assigned

    Labels

    BenchmarkPerformance (ASV) benchmarksDatetimeDatetime data dtypeIndexingRelated to indexing on series/frames, not to indexes themselvesMultiIndexgood first issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions