Skip to content

Align should consider the dtype when creating new columns #31874

Open
@TomAugspurger

Description

@TomAugspurger

Currently when you align columns and create a new column, align will create a new float64 column filled with NaNs.

In [1]: import pandas as pd

In [2]: a = pd.DataFrame({"A": [1, 2], "B": [pd.Timestamp('2000'), pd.NaT]})

In [3]: b = pd.DataFrame({"A": [1, 2]})

In [4]: a.align(b)[1].dtypes
Out[4]:
A      int64
B    float64
dtype: object

I think it'd be more useful for the dtypes of new columns to be the same as the dtype from the other.

# proposed behavior
In [4]: a.align(b)[1].dtypes
Out[4]:
A             int64
B    datetime64[ns]
dtype: object

The newly created B column has dtype datetime64[ns], the same as a.B.

This proposal would make the fill_value keyword a bit more complex.

  1. The default of np.nan would change to None, which means "the right NA value for the dtype".
  2. We would maybe need to accept a Mapping so users could specify specific fill values per column.

I think this would make the workaround in #31679 unnecessary, as we'd have the correct dtype going into the operation.


If we think this is a good idea, it's probably an API breaking change. We might be able to deprecate this cleanly by (ab)using fill_value. We would warn when creating new columns.

if new_columns and fill_value is no_default:
    warnings.warn("Creating new float64 columns filled with NaN. In the future... "
                             "Specify fill_value=None to accept the future behavior now.")
    fill_value = np.nan  # 

Unfortunately, that'll happen in the background during binops. Not sure how to get around that, aside from instructing users to explicitly align first.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions