Description
Currently when you align columns and create a new column, align will create a new float64 column filled with NaNs.
In [1]: import pandas as pd
In [2]: a = pd.DataFrame({"A": [1, 2], "B": [pd.Timestamp('2000'), pd.NaT]})
In [3]: b = pd.DataFrame({"A": [1, 2]})
In [4]: a.align(b)[1].dtypes
Out[4]:
A int64
B float64
dtype: object
I think it'd be more useful for the dtypes of new columns to be the same as the dtype from the other.
# proposed behavior
In [4]: a.align(b)[1].dtypes
Out[4]:
A int64
B datetime64[ns]
dtype: object
The newly created B
column has dtype datetime64[ns]
, the same as a.B
.
This proposal would make the fill_value
keyword a bit more complex.
- The default of
np.nan
would change toNone
, which means "the right NA value for the dtype". - We would maybe need to accept a Mapping so users could specify specific fill values per column.
I think this would make the workaround in #31679 unnecessary, as we'd have the correct dtype going into the operation.
If we think this is a good idea, it's probably an API breaking change. We might be able to deprecate this cleanly by (ab)using fill_value
. We would warn when creating new columns.
if new_columns and fill_value is no_default:
warnings.warn("Creating new float64 columns filled with NaN. In the future... "
"Specify fill_value=None to accept the future behavior now.")
fill_value = np.nan #
Unfortunately, that'll happen in the background during binops. Not sure how to get around that, aside from instructing users to explicitly align first.