From 1fb0adb5a7b1ba909fb34a19b1d27c7eccdf4119 Mon Sep 17 00:00:00 2001 From: MarcoGorelli <> Date: Thu, 22 Dec 2022 21:31:28 +0000 Subject: [PATCH 01/27] [skip ci] pdep6 draft --- web/pandas/pdeps/0006-ban-upcasting.md | 140 +++++++++++++++++++++++++ 1 file changed, 140 insertions(+) create mode 100644 web/pandas/pdeps/0006-ban-upcasting.md diff --git a/web/pandas/pdeps/0006-ban-upcasting.md b/web/pandas/pdeps/0006-ban-upcasting.md new file mode 100644 index 0000000000000..b209d49a281d1 --- /dev/null +++ b/web/pandas/pdeps/0006-ban-upcasting.md @@ -0,0 +1,140 @@ +# PDEP-6: Ban upcasting in setitem-like operations + +- Created: 23 December 2022 +- Status: Draft +- Discussion: [#50402](https://github.com/pandas-dev/pandas/pull/50402) +- Author: [Marco Gorelli](https://github.com/MarcoGorelli) ([original issue](https://github.com/pandas-dev/pandas/issues/39584) by [Joris Van den Bossche](https://github.com/jorisvandenbossche)) +- Revision: 1 + +## Abstract + +The suggestion is that setitem-like operations would +not change a ``Series``' dtype. + +Current behaviour: +```python +In [1]: ser = pd.Series([1, 2, 3], dtype='int64') + +In [2]: ser[2] = 'potage' + +In [3]: ser # dtype changed to 'object'! +Out[3]: +0 1 +1 2 +2 potage +dtype: object +``` + +Suggested behaviour: + +```python +In [1]: ser = pd.Series([1, 2, 3]) + +In [2]: ser[2] = 'potage' # raises! +--------------------------------------------------------------------------- +TypeError: Invalid value 'potage' for dtype int64 +``` + +## Motivation and Scope + +Currently, pandas is extremely flexible in handling different dtypes. +However, this can potentially hide bugs, break user expectations, and unnecessarily copy data. + +An example of it hiding a bug is: +```python +In [9]: ser = pd.Series(pd.date_range('2000', periods=3)) + +In [10]: ser[2] = '2000-01-04' # works, is converted to datetime64 + +In [11]: ser[2] = '2000-01-04x' # almost certainly a typo - but pandas doesn't error, it upcasts to object +``` + +The scope of this PDEP is limited to setitem-like operations which would operate inplace, such as: +- ``ser[0] == 2``; +- ``ser.fillna(0, inplace=True)``; +- ``ser.where(ser.isna(), 0, inplace=True)`` + +There may be more. What is explicitly excluded from this PDEP is any operation would have no change +of operating inplace to begin with, such as: +- ``ser.diff()``; +- ``pd.concat([ser, other])``; +- ``ser.mean()``. + +These would keep being allowed to change Series' dtypes. + +## Detailed description + +Concretely, the suggestion is: +- if a ``Series`` is of a given dtype, then a ``setitem``-like operation should not change its dtype; +- if a ``setitem``-like operation would previously have changed a ``Series``' dtype, it would now raise. + +For a start, this would involve: + +1. changing ``Block.setitem`` such that it doesn't have an ``except`` block in + + ```python + value = extract_array(value, extract_numpy=True) + try: + casted = np_can_hold_element(values.dtype, value) + except LossySetitemError: + # current dtype cannot store value, coerce to common dtype + nb = self.coerce_to_target_dtype(value) + return nb.setitem(indexer, value) + else: + ``` + +2. making a similar change in ``Block.where``, ``EABlock.setitem``, ``EABlock.where``, and probably more places. + +The above would already require several hundreds of tests to be adjusted. + +### Ban upcasting altogether, or just upcasting to ``object``? + +The trickiest part of this proposal concerns what to do when setting a float in an integer column: + +```python +In [1]: ser = pd.Series([1, 2, 3]) + +In [2]: ser[0] = 1.5 +``` + +Possibly options could be: +1. just raise; +2. convert the float value to ``int``, preserve the Series' dtype; +3. upcast to ``float``, even if upcasting in setitem-like is banned for other conversions. + +Let us compare with what other libraries do: +- ``numpy``: option 2 +- ``cudf``: option 2 +- ``polars``: option 2 +- ``R data.frame``: option 3 +- ``pandas`` (nullable dtype): option 1 + +If the objective of this PDEP is to prevent bugs, then option 2 is also not desirable: +someone might set ``1.5`` and later be surprised to learn that they actually set ``1``. + +Option ``3`` would be inconsistent with the nullable dtypes' behaviour, would add complexity +to the codebase and to tests, and would be confusing to teach. + +Option ``1`` is the maximally safe one in terms of protecting users from bugs, and would +also be consistent with the current behaviour of nullable dtypes. It would also be simple to teach: +"if you try to set an element of a ``Series`` to a new value, then that value must be compatible +with the Series' dtype, otherwise it will raise" is easy to understand. If we make an exception for +``int`` to ``float`` (and presumably also for ``interval[int]``, ``interval[float]``), then the rule +starts to become confusing. + +## Usage and Impact + +This would make pandas stricter, so there should not be any risk of introducing bugs. If anything, this would help prevent bugs. + +Unfortunately, it would also risk annoy users who might have been intentionally upcasting. + +Given that users can get around this as simply as with an ``.astype({'my_column': float})`` call, +I think it would be more beneficial to the community at large to err on the side of strictness. + +## Timeline + +Deprecate sometime in the 2.x releases (after 2.0.0 has already been released), and enforce in 3.0.0. + +### PDEP History + +- 23 December 2022: Initial draft From 5456787378e41590bc72379689014e0916cdcc45 Mon Sep 17 00:00:00 2001 From: MarcoGorelli <> Date: Sat, 24 Dec 2022 07:47:26 +0000 Subject: [PATCH 02/27] [skip ci] reword --- web/pandas/pdeps/0006-ban-upcasting.md | 37 +++++++++++++++----------- 1 file changed, 21 insertions(+), 16 deletions(-) diff --git a/web/pandas/pdeps/0006-ban-upcasting.md b/web/pandas/pdeps/0006-ban-upcasting.md index b209d49a281d1..fed9b185ac8fd 100644 --- a/web/pandas/pdeps/0006-ban-upcasting.md +++ b/web/pandas/pdeps/0006-ban-upcasting.md @@ -38,7 +38,8 @@ TypeError: Invalid value 'potage' for dtype int64 ## Motivation and Scope Currently, pandas is extremely flexible in handling different dtypes. -However, this can potentially hide bugs, break user expectations, and unnecessarily copy data. +However, this can potentially hide bugs, break user expectations, and copy data +in what looks like it should be an inplace operation. An example of it hiding a bug is: ```python @@ -97,30 +98,34 @@ In [1]: ser = pd.Series([1, 2, 3]) In [2]: ser[0] = 1.5 ``` +This isn't necessarily a sign of a bug, because the user might just be thinking of their ``Series`` as being +numeric (without much regard for ``int`` vs ``float``) - ``'int64'`` is just what pandas happened to infer. + Possibly options could be: -1. just raise; -2. convert the float value to ``int``, preserve the Series' dtype; -3. upcast to ``float``, even if upcasting in setitem-like is banned for other conversions. +1. just raise, forcing users to be explicit; +2. convert the float value to ``int`` before setting it; +3. limit "banning upcasting" to when the upcasted dtype is ``object``. Let us compare with what other libraries do: - ``numpy``: option 2 - ``cudf``: option 2 - ``polars``: option 2 -- ``R data.frame``: option 3 -- ``pandas`` (nullable dtype): option 1 +- ``R data.frame``: just upcasts (like pandas does now for non-nullable dtypes); +- ``pandas`` (nullable dtypes): option 1 +- ``datatable``: option 1 -If the objective of this PDEP is to prevent bugs, then option 2 is also not desirable: +Option ``2`` would be a breaking behaviour change in pandas. Further, +if the objective of this PDEP is to prevent bugs, then this is also not desirable: someone might set ``1.5`` and later be surprised to learn that they actually set ``1``. -Option ``3`` would be inconsistent with the nullable dtypes' behaviour, would add complexity -to the codebase and to tests, and would be confusing to teach. +Option ``3`` would be inconsistent with the nullable dtypes' behaviour. It would also add +complexity to the codebase and to tests. It would be hard to teach, as instead of +being able to teach a simple rule, there would be a rule with exceptions. Finally, it opens +the door to other exceptions, such as not upcasting to ``'int16'`` when trying to set an +element of a ``'int8'`` ``Series`` to ``128``. -Option ``1`` is the maximally safe one in terms of protecting users from bugs, and would -also be consistent with the current behaviour of nullable dtypes. It would also be simple to teach: -"if you try to set an element of a ``Series`` to a new value, then that value must be compatible -with the Series' dtype, otherwise it will raise" is easy to understand. If we make an exception for -``int`` to ``float`` (and presumably also for ``interval[int]``, ``interval[float]``), then the rule -starts to become confusing. +Option ``1`` is the maximally safe one in terms of protecting users from bugs, being +consistent with the current behaviour of nullable dtypes, and in being simple to teach. ## Usage and Impact @@ -128,7 +133,7 @@ This would make pandas stricter, so there should not be any risk of introducing Unfortunately, it would also risk annoy users who might have been intentionally upcasting. -Given that users can get around this as simply as with an ``.astype({'my_column': float})`` call, +Given that users can get around this as simply as with a ``.astype({'my_column': float})`` call, I think it would be more beneficial to the community at large to err on the side of strictness. ## Timeline From 02ff7354e19f5bfdff59e247eeeb0600a74af245 Mon Sep 17 00:00:00 2001 From: MarcoGorelli <> Date: Sat, 24 Dec 2022 07:52:24 +0000 Subject: [PATCH 03/27] [skip ci] compare with DataFrames.jl --- web/pandas/pdeps/0006-ban-upcasting.md | 1 + 1 file changed, 1 insertion(+) diff --git a/web/pandas/pdeps/0006-ban-upcasting.md b/web/pandas/pdeps/0006-ban-upcasting.md index fed9b185ac8fd..5ab74e720e6de 100644 --- a/web/pandas/pdeps/0006-ban-upcasting.md +++ b/web/pandas/pdeps/0006-ban-upcasting.md @@ -113,6 +113,7 @@ Let us compare with what other libraries do: - ``R data.frame``: just upcasts (like pandas does now for non-nullable dtypes); - ``pandas`` (nullable dtypes): option 1 - ``datatable``: option 1 +- ``DataFrames.jl``: option 1 Option ``2`` would be a breaking behaviour change in pandas. Further, if the objective of this PDEP is to prevent bugs, then this is also not desirable: From e3cc381e396e6fcf9085ac5d1d17df4b40ec89dd Mon Sep 17 00:00:00 2001 From: MarcoGorelli <> Date: Sat, 24 Dec 2022 17:22:13 +0000 Subject: [PATCH 04/27] [skip ci] note about loss of precision --- web/pandas/pdeps/0006-ban-upcasting.md | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/web/pandas/pdeps/0006-ban-upcasting.md b/web/pandas/pdeps/0006-ban-upcasting.md index 5ab74e720e6de..769aa311e36be 100644 --- a/web/pandas/pdeps/0006-ban-upcasting.md +++ b/web/pandas/pdeps/0006-ban-upcasting.md @@ -119,11 +119,14 @@ Option ``2`` would be a breaking behaviour change in pandas. Further, if the objective of this PDEP is to prevent bugs, then this is also not desirable: someone might set ``1.5`` and later be surprised to learn that they actually set ``1``. -Option ``3`` would be inconsistent with the nullable dtypes' behaviour. It would also add -complexity to the codebase and to tests. It would be hard to teach, as instead of -being able to teach a simple rule, there would be a rule with exceptions. Finally, it opens -the door to other exceptions, such as not upcasting to ``'int16'`` when trying to set an -element of a ``'int8'`` ``Series`` to ``128``. +There are several downsides to option ``3``: +- it would be inconsistent with the nullable dtypes' behaviour; +- it would also add complexity to the codebase and to tests; +- it would be hard to teach, as instead of being able to teach a simple rule, + there would be a rule with exceptions; +- there would be a risk of loss of precision; +- it opens the door to other exceptions, such as not upcasting to ``'int16'`` + when trying to set an element of a ``'int8'`` ``Series`` to ``128``. Option ``1`` is the maximally safe one in terms of protecting users from bugs, being consistent with the current behaviour of nullable dtypes, and in being simple to teach. From f6298e918818ff8bf844b417d743a352e6aeed16 Mon Sep 17 00:00:00 2001 From: MarcoGorelli <> Date: Fri, 30 Dec 2022 09:03:41 +0000 Subject: [PATCH 05/27] [skip ci] add examples of operations which would raise --- web/pandas/pdeps/0006-ban-upcasting.md | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/web/pandas/pdeps/0006-ban-upcasting.md b/web/pandas/pdeps/0006-ban-upcasting.md index 769aa311e36be..3fa3bd2c7d0d0 100644 --- a/web/pandas/pdeps/0006-ban-upcasting.md +++ b/web/pandas/pdeps/0006-ban-upcasting.md @@ -51,17 +51,24 @@ In [11]: ser[2] = '2000-01-04x' # almost certainly a typo - but pandas doesn't ``` The scope of this PDEP is limited to setitem-like operations which would operate inplace, such as: -- ``ser[0] == 2``; -- ``ser.fillna(0, inplace=True)``; -- ``ser.where(ser.isna(), 0, inplace=True)`` +- ``ser[0] = 2.5``; +- ``ser.fillna('foo', inplace=True)``; +- ``ser.where(ser.isna(), 'foo', inplace=True)`` +- ``ser.iloc[0] = 2.5`` +- ``ser.loc[0] = 2.5`` +- ``ser[:] = 2.5`` There may be more. What is explicitly excluded from this PDEP is any operation would have no change of operating inplace to begin with, such as: - ``ser.diff()``; - ``pd.concat([ser, other])``; -- ``ser.mean()``. +- ``ser.mean()``; +- ``df.loc[0, 'col1'] = 2.5``. -These would keep being allowed to change Series' dtypes. +These would keep being allowed to change Series' dtypes. Note that setting element of a column of a +``DataFrame`` would not raise, as that sets the elements in a new block manager (rather than in the +original one), +see https://github.com/pandas-dev/pandas/blob/4e4be0bfa8f74b9d453aa4163d95660c04ffea0c/pandas/core/internals/managers.py#L1361-L1362. ## Detailed description @@ -84,7 +91,7 @@ For a start, this would involve: else: ``` -2. making a similar change in ``Block.where``, ``EABlock.setitem``, ``EABlock.where``, and probably more places. +2. making a similar change in ``Block.where``, ``Block.putmask``, and likewise for ``EABlock`` (and possibly in more places). The above would already require several hundreds of tests to be adjusted. From dffef424dc0e1cf83bd55d31a947610e7a498263 Mon Sep 17 00:00:00 2001 From: MarcoGorelli <> Date: Fri, 30 Dec 2022 10:37:42 +0000 Subject: [PATCH 06/27] [skip ci] note about DataFrame.__setitem__ --- web/pandas/pdeps/0006-ban-upcasting.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0006-ban-upcasting.md b/web/pandas/pdeps/0006-ban-upcasting.md index 3fa3bd2c7d0d0..861e68134e9e7 100644 --- a/web/pandas/pdeps/0006-ban-upcasting.md +++ b/web/pandas/pdeps/0006-ban-upcasting.md @@ -63,7 +63,7 @@ of operating inplace to begin with, such as: - ``ser.diff()``; - ``pd.concat([ser, other])``; - ``ser.mean()``; -- ``df.loc[0, 'col1'] = 2.5``. +- ``df.loc[0, 'col1'] = 2.5`` (if ``df`` is not a single block). These would keep being allowed to change Series' dtypes. Note that setting element of a column of a ``DataFrame`` would not raise, as that sets the elements in a new block manager (rather than in the From 9fa767591f4791a757b1a9f12af9a9b25f7597f0 Mon Sep 17 00:00:00 2001 From: MarcoGorelli <> Date: Fri, 30 Dec 2022 15:49:24 +0000 Subject: [PATCH 07/27] [skip ci] notes about dataframe case --- web/pandas/pdeps/0006-ban-upcasting.md | 31 ++++++++++++++------------ 1 file changed, 17 insertions(+), 14 deletions(-) diff --git a/web/pandas/pdeps/0006-ban-upcasting.md b/web/pandas/pdeps/0006-ban-upcasting.md index 861e68134e9e7..2363b2847c833 100644 --- a/web/pandas/pdeps/0006-ban-upcasting.md +++ b/web/pandas/pdeps/0006-ban-upcasting.md @@ -50,25 +50,28 @@ In [10]: ser[2] = '2000-01-04' # works, is converted to datetime64 In [11]: ser[2] = '2000-01-04x' # almost certainly a typo - but pandas doesn't error, it upcasts to object ``` -The scope of this PDEP is limited to setitem-like operations which would operate inplace, such as: -- ``ser[0] = 2.5``; +The scope of this PDEP is limited to setitem-like operations on Series. +For example, starting with +```python +df = DataFrame({'a': [1, 2, np.nan], 'b': [4, 5, 6]}) +ser = df['a'].copy() +``` +then the following would all raise: +- ``ser[0] = 'foo'``; - ``ser.fillna('foo', inplace=True)``; - ``ser.where(ser.isna(), 'foo', inplace=True)`` -- ``ser.iloc[0] = 2.5`` -- ``ser.loc[0] = 2.5`` -- ``ser[:] = 2.5`` +- ``ser.iloc[0] = 'foo'`` +- ``ser.loc[0] = 'foo'`` +- ``df.loc[0, 'a'] = 'foo'`` -There may be more. What is explicitly excluded from this PDEP is any operation would have no change -of operating inplace to begin with, such as: +Examples of operations which would not raise are: - ``ser.diff()``; -- ``pd.concat([ser, other])``; +- ``pd.concat([ser, ser.astype(object)])``; - ``ser.mean()``; -- ``df.loc[0, 'col1'] = 2.5`` (if ``df`` is not a single block). +- ``df.loc[:, 'a'] = 'foo'`` (debatable, as is the one below) +- ``ser[:] = 'foo'`` -These would keep being allowed to change Series' dtypes. Note that setting element of a column of a -``DataFrame`` would not raise, as that sets the elements in a new block manager (rather than in the -original one), -see https://github.com/pandas-dev/pandas/blob/4e4be0bfa8f74b9d453aa4163d95660c04ffea0c/pandas/core/internals/managers.py#L1361-L1362. +These would keep being allowed to change Series' dtypes. ## Detailed description @@ -91,7 +94,7 @@ For a start, this would involve: else: ``` -2. making a similar change in ``Block.where``, ``Block.putmask``, and likewise for ``EABlock`` (and possibly in more places). +2. making a similar change in ``Block.where``, ``Block.putmask``, ``EABackedBlock.where``, and ``EABackedBlock.putmask``. The above would already require several hundreds of tests to be adjusted. From 2ce6ff02160f705be3b5cdfff3266247bc625f42 Mon Sep 17 00:00:00 2001 From: MarcoGorelli <> Date: Tue, 3 Jan 2023 14:51:25 +0000 Subject: [PATCH 08/27] [skip ci] remove special-casing of full slice --- web/pandas/pdeps/0006-ban-upcasting.md | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/web/pandas/pdeps/0006-ban-upcasting.md b/web/pandas/pdeps/0006-ban-upcasting.md index 2363b2847c833..595b1b8646e64 100644 --- a/web/pandas/pdeps/0006-ban-upcasting.md +++ b/web/pandas/pdeps/0006-ban-upcasting.md @@ -60,18 +60,20 @@ then the following would all raise: - ``ser[0] = 'foo'``; - ``ser.fillna('foo', inplace=True)``; - ``ser.where(ser.isna(), 'foo', inplace=True)`` +- ``ser.fillna('foo', inplace=False)``; +- ``ser.where(ser.isna(), 'foo', inplace=False)`` - ``ser.iloc[0] = 'foo'`` - ``ser.loc[0] = 'foo'`` - ``df.loc[0, 'a'] = 'foo'`` +- ``df.loc[:, 'a'] = 'foo'`` (debatable, as is the one below) +- ``ser[:] = 'foo'`` Examples of operations which would not raise are: - ``ser.diff()``; - ``pd.concat([ser, ser.astype(object)])``; - ``ser.mean()``; -- ``df.loc[:, 'a'] = 'foo'`` (debatable, as is the one below) -- ``ser[:] = 'foo'`` - -These would keep being allowed to change Series' dtypes. +- ``ser[0] = 3.``; +- ``ser[0] = 3``. ## Detailed description @@ -94,7 +96,13 @@ For a start, this would involve: else: ``` -2. making a similar change in ``Block.where``, ``Block.putmask``, ``EABackedBlock.where``, and ``EABackedBlock.putmask``. +2. making a similar change in: + - ``Block.where``; + - ``Block.putmask``; + - ``EABackedBlock.setitem``; + - ``EABackedBlock.where``; + - ``EABackedBlock.putmask``; + - ``_iLocIndexer._setitem_single_column``; The above would already require several hundreds of tests to be adjusted. From 46268315d0f2f8994a9c098cd0a8ad774762b4ff Mon Sep 17 00:00:00 2001 From: MarcoGorelli <> Date: Tue, 3 Jan 2023 15:42:29 +0000 Subject: [PATCH 09/27] [skip ci] minor fixups --- web/pandas/pdeps/0006-ban-upcasting.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/web/pandas/pdeps/0006-ban-upcasting.md b/web/pandas/pdeps/0006-ban-upcasting.md index 595b1b8646e64..ef4e89e89482e 100644 --- a/web/pandas/pdeps/0006-ban-upcasting.md +++ b/web/pandas/pdeps/0006-ban-upcasting.md @@ -65,8 +65,8 @@ then the following would all raise: - ``ser.iloc[0] = 'foo'`` - ``ser.loc[0] = 'foo'`` - ``df.loc[0, 'a'] = 'foo'`` -- ``df.loc[:, 'a'] = 'foo'`` (debatable, as is the one below) -- ``ser[:] = 'foo'`` +- ``df.loc[:, 'a'] = 'foo'`` +- ``ser[:] = 'foo'``. Examples of operations which would not raise are: - ``ser.diff()``; @@ -119,7 +119,7 @@ In [2]: ser[0] = 1.5 This isn't necessarily a sign of a bug, because the user might just be thinking of their ``Series`` as being numeric (without much regard for ``int`` vs ``float``) - ``'int64'`` is just what pandas happened to infer. -Possibly options could be: +Possible options could be: 1. just raise, forcing users to be explicit; 2. convert the float value to ``int`` before setting it; 3. limit "banning upcasting" to when the upcasted dtype is ``object``. @@ -143,8 +143,7 @@ There are several downsides to option ``3``: - it would be hard to teach, as instead of being able to teach a simple rule, there would be a rule with exceptions; - there would be a risk of loss of precision; -- it opens the door to other exceptions, such as not upcasting to ``'int16'`` - when trying to set an element of a ``'int8'`` ``Series`` to ``128``. +- it opens the door to other exceptions, such as not upcasting ``'int8'`` to ``'int16'``. Option ``1`` is the maximally safe one in terms of protecting users from bugs, being consistent with the current behaviour of nullable dtypes, and in being simple to teach. From 1217a4e3c5231f5c8132d138893a54dcb92590c0 Mon Sep 17 00:00:00 2001 From: MarcoGorelli <> Date: Tue, 3 Jan 2023 16:15:57 +0000 Subject: [PATCH 10/27] [skip ci] add examples with boolean masks --- web/pandas/pdeps/0006-ban-upcasting.md | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/web/pandas/pdeps/0006-ban-upcasting.md b/web/pandas/pdeps/0006-ban-upcasting.md index ef4e89e89482e..2193230c55589 100644 --- a/web/pandas/pdeps/0006-ban-upcasting.md +++ b/web/pandas/pdeps/0006-ban-upcasting.md @@ -43,18 +43,20 @@ in what looks like it should be an inplace operation. An example of it hiding a bug is: ```python -In [9]: ser = pd.Series(pd.date_range('2000', periods=3)) +In[9]: ser = pd.Series(pd.date_range("2000", periods=3)) -In [10]: ser[2] = '2000-01-04' # works, is converted to datetime64 +In[10]: ser[2] = "2000-01-04" # works, is converted to datetime64 -In [11]: ser[2] = '2000-01-04x' # almost certainly a typo - but pandas doesn't error, it upcasts to object +In[11]: ser[ + 2 +] = "2000-01-04x" # almost certainly a typo - but pandas doesn't error, it upcasts to object ``` The scope of this PDEP is limited to setitem-like operations on Series. For example, starting with ```python -df = DataFrame({'a': [1, 2, np.nan], 'b': [4, 5, 6]}) -ser = df['a'].copy() +df = DataFrame({"a": [1, 2, np.nan], "b": [4, 5, 6]}) +ser = df["a"].copy() ``` then the following would all raise: - ``ser[0] = 'foo'``; @@ -66,7 +68,9 @@ then the following would all raise: - ``ser.loc[0] = 'foo'`` - ``df.loc[0, 'a'] = 'foo'`` - ``df.loc[:, 'a'] = 'foo'`` -- ``ser[:] = 'foo'``. +- ``ser[:] = 'foo'``; +- ``df.loc[[True, False, True], 'a'] = 'foo'`` +- ``ser[[True, False, True]] = 'foo'``. Examples of operations which would not raise are: - ``ser.diff()``; @@ -111,9 +115,9 @@ The above would already require several hundreds of tests to be adjusted. The trickiest part of this proposal concerns what to do when setting a float in an integer column: ```python -In [1]: ser = pd.Series([1, 2, 3]) +In[1]: ser = pd.Series([1, 2, 3]) -In [2]: ser[0] = 1.5 +In[2]: ser[0] = 1.5 ``` This isn't necessarily a sign of a bug, because the user might just be thinking of their ``Series`` as being From a930df1ba80d7795640baa64b5d25e91010f58f0 Mon Sep 17 00:00:00 2001 From: MarcoGorelli <> Date: Fri, 17 Mar 2023 09:16:17 +0000 Subject: [PATCH 11/27] use more generic indexer in example, clarify the enlargement is out of scope --- web/pandas/pdeps/0006-ban-upcasting.md | 23 ++++++++++++++--------- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/web/pandas/pdeps/0006-ban-upcasting.md b/web/pandas/pdeps/0006-ban-upcasting.md index 2193230c55589..29b7a363599b8 100644 --- a/web/pandas/pdeps/0006-ban-upcasting.md +++ b/web/pandas/pdeps/0006-ban-upcasting.md @@ -1,7 +1,7 @@ # PDEP-6: Ban upcasting in setitem-like operations - Created: 23 December 2022 -- Status: Draft +- Status: Under discussion - Discussion: [#50402](https://github.com/pandas-dev/pandas/pull/50402) - Author: [Marco Gorelli](https://github.com/MarcoGorelli) ([original issue](https://github.com/pandas-dev/pandas/issues/39584) by [Joris Van den Bossche](https://github.com/jorisvandenbossche)) - Revision: 1 @@ -47,9 +47,7 @@ In[9]: ser = pd.Series(pd.date_range("2000", periods=3)) In[10]: ser[2] = "2000-01-04" # works, is converted to datetime64 -In[11]: ser[ - 2 -] = "2000-01-04x" # almost certainly a typo - but pandas doesn't error, it upcasts to object +In[11]: ser[2] = "2000-01-04x" # typo - but pandas doesn't error, it upcasts to object ``` The scope of this PDEP is limited to setitem-like operations on Series. @@ -66,11 +64,8 @@ then the following would all raise: - ``ser.where(ser.isna(), 'foo', inplace=False)`` - ``ser.iloc[0] = 'foo'`` - ``ser.loc[0] = 'foo'`` -- ``df.loc[0, 'a'] = 'foo'`` -- ``df.loc[:, 'a'] = 'foo'`` -- ``ser[:] = 'foo'``; -- ``df.loc[[True, False, True], 'a'] = 'foo'`` -- ``ser[[True, False, True]] = 'foo'``. +- ``df.loc[indexer, 'a'] = 'foo'`` +- ``ser[indexer] = 'foo'``; Examples of operations which would not raise are: - ``ser.diff()``; @@ -161,6 +156,16 @@ Unfortunately, it would also risk annoy users who might have been intentionally Given that users can get around this as simply as with a ``.astype({'my_column': float})`` call, I think it would be more beneficial to the community at large to err on the side of strictness. +## Out of scope + +Enlargement. For example: +```python +ser = pd.Series([1, 2, 3]) +ser[len(ser)] = 4.5 +``` +There is arguably a larger conversation to be had about whether that should be allowed +at all. To keep this proposal focused, it is intentionally excluded from the scope. + ## Timeline Deprecate sometime in the 2.x releases (after 2.0.0 has already been released), and enforce in 3.0.0. From 02bab00d5077044443387c171d303ce131baab6d Mon Sep 17 00:00:00 2001 From: MarcoGorelli <> Date: Mon, 20 Mar 2023 17:14:46 +0000 Subject: [PATCH 12/27] dont call workaround "easy" --- web/pandas/pdeps/0006-ban-upcasting.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/web/pandas/pdeps/0006-ban-upcasting.md b/web/pandas/pdeps/0006-ban-upcasting.md index 29b7a363599b8..ce342cba2882e 100644 --- a/web/pandas/pdeps/0006-ban-upcasting.md +++ b/web/pandas/pdeps/0006-ban-upcasting.md @@ -47,7 +47,7 @@ In[9]: ser = pd.Series(pd.date_range("2000", periods=3)) In[10]: ser[2] = "2000-01-04" # works, is converted to datetime64 -In[11]: ser[2] = "2000-01-04x" # typo - but pandas doesn't error, it upcasts to object +In[11]: ser[2] = "2000-01-04x" # typo - but pandas does not error, it upcasts to object ``` The scope of this PDEP is limited to setitem-like operations on Series. @@ -82,7 +82,7 @@ Concretely, the suggestion is: For a start, this would involve: -1. changing ``Block.setitem`` such that it doesn't have an ``except`` block in +1. changing ``Block.setitem`` such that it does not have an ``except`` block in ```python value = extract_array(value, extract_numpy=True) @@ -115,7 +115,7 @@ In[1]: ser = pd.Series([1, 2, 3]) In[2]: ser[0] = 1.5 ``` -This isn't necessarily a sign of a bug, because the user might just be thinking of their ``Series`` as being +This is not necessarily a sign of a bug, because the user might just be thinking of their ``Series`` as being numeric (without much regard for ``int`` vs ``float``) - ``'int64'`` is just what pandas happened to infer. Possible options could be: @@ -153,8 +153,8 @@ This would make pandas stricter, so there should not be any risk of introducing Unfortunately, it would also risk annoy users who might have been intentionally upcasting. -Given that users can get around this as simply as with a ``.astype({'my_column': float})`` call, -I think it would be more beneficial to the community at large to err on the side of strictness. +Given that users could still get the current behaviour by first explicitly casting to float, +it would be more beneficial to the community at large to err on the side of strictness. ## Out of scope From 875cf4c53313a8b0aca72766c25198b35622314c Mon Sep 17 00:00:00 2001 From: MarcoGorelli <> Date: Mon, 20 Mar 2023 17:17:35 +0000 Subject: [PATCH 13/27] define indexer --- web/pandas/pdeps/0006-ban-upcasting.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0006-ban-upcasting.md b/web/pandas/pdeps/0006-ban-upcasting.md index ce342cba2882e..cf2704360e807 100644 --- a/web/pandas/pdeps/0006-ban-upcasting.md +++ b/web/pandas/pdeps/0006-ban-upcasting.md @@ -65,7 +65,10 @@ then the following would all raise: - ``ser.iloc[0] = 'foo'`` - ``ser.loc[0] = 'foo'`` - ``df.loc[indexer, 'a'] = 'foo'`` -- ``ser[indexer] = 'foo'``; +- ``ser[indexer] = 'foo'`` + +where ``indexer`` could be a slice, a mask, a single value, a list or array of values, +or any other allowed indexer. Examples of operations which would not raise are: - ``ser.diff()``; From 9dcf8d48170ead39f83f44458f6c5ff353af15ff Mon Sep 17 00:00:00 2001 From: MarcoGorelli <> Date: Fri, 24 Mar 2023 11:35:34 +0000 Subject: [PATCH 14/27] clarify --- web/pandas/pdeps/0006-ban-upcasting.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/web/pandas/pdeps/0006-ban-upcasting.md b/web/pandas/pdeps/0006-ban-upcasting.md index cf2704360e807..5db185e52d798 100644 --- a/web/pandas/pdeps/0006-ban-upcasting.md +++ b/web/pandas/pdeps/0006-ban-upcasting.md @@ -156,8 +156,9 @@ This would make pandas stricter, so there should not be any risk of introducing Unfortunately, it would also risk annoy users who might have been intentionally upcasting. -Given that users could still get the current behaviour by first explicitly casting to float, -it would be more beneficial to the community at large to err on the side of strictness. +Given that users could still get the current behaviour by first explicitly casting the Series +to float, it would be more beneficial to the community at large to err on the side +of strictness. ## Out of scope From d15a7c13f9ba632d259660c137666ef846555406 Mon Sep 17 00:00:00 2001 From: MarcoGorelli <> Date: Fri, 24 Mar 2023 16:28:56 +0000 Subject: [PATCH 15/27] wip --- web/pandas/pdeps/0006-ban-upcasting.md | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/web/pandas/pdeps/0006-ban-upcasting.md b/web/pandas/pdeps/0006-ban-upcasting.md index 5db185e52d798..8a037e5411f6d 100644 --- a/web/pandas/pdeps/0006-ban-upcasting.md +++ b/web/pandas/pdeps/0006-ban-upcasting.md @@ -9,7 +9,7 @@ ## Abstract The suggestion is that setitem-like operations would -not change a ``Series``' dtype. +not change a ``Series`` dtype (nor that of a ``DataFrame``'s column). Current behaviour: ```python @@ -32,7 +32,7 @@ In [1]: ser = pd.Series([1, 2, 3]) In [2]: ser[2] = 'potage' # raises! --------------------------------------------------------------------------- -TypeError: Invalid value 'potage' for dtype int64 +ValueError: Invalid value 'potage' for dtype int64 ``` ## Motivation and Scope @@ -57,13 +57,12 @@ df = DataFrame({"a": [1, 2, np.nan], "b": [4, 5, 6]}) ser = df["a"].copy() ``` then the following would all raise: -- ``ser[0] = 'foo'``; - ``ser.fillna('foo', inplace=True)``; - ``ser.where(ser.isna(), 'foo', inplace=True)`` - ``ser.fillna('foo', inplace=False)``; - ``ser.where(ser.isna(), 'foo', inplace=False)`` -- ``ser.iloc[0] = 'foo'`` -- ``ser.loc[0] = 'foo'`` +- ``ser.iloc[indexer] = 'foo'`` +- ``ser.loc[indexer] = 'foo'`` - ``df.loc[indexer, 'a'] = 'foo'`` - ``ser[indexer] = 'foo'`` @@ -75,6 +74,7 @@ Examples of operations which would not raise are: - ``pd.concat([ser, ser.astype(object)])``; - ``ser.mean()``; - ``ser[0] = 3.``; +- ``df['a'] = pd.date_range(datetime(2020, 1, 1), periods=3)``; - ``ser[0] = 3``. ## Detailed description @@ -106,7 +106,9 @@ For a start, this would involve: - ``EABackedBlock.putmask``; - ``_iLocIndexer._setitem_single_column``; -The above would already require several hundreds of tests to be adjusted. +The above would already require several hundreds of tests to be adjusted. Note that once +implementation starts, the list of locations to change may turn out to be slightly +different. ### Ban upcasting altogether, or just upcasting to ``object``? @@ -122,7 +124,7 @@ This is not necessarily a sign of a bug, because the user might just be thinking numeric (without much regard for ``int`` vs ``float``) - ``'int64'`` is just what pandas happened to infer. Possible options could be: -1. just raise, forcing users to be explicit; +1. only accept round floats (e.g. ``1.0``) and raise on anything else (e.g. ``1.01``); 2. convert the float value to ``int`` before setting it; 3. limit "banning upcasting" to when the upcasted dtype is ``object``. @@ -144,7 +146,7 @@ There are several downsides to option ``3``: - it would also add complexity to the codebase and to tests; - it would be hard to teach, as instead of being able to teach a simple rule, there would be a rule with exceptions; -- there would be a risk of loss of precision; +- there would be a risk of loss of precision and or overflow; - it opens the door to other exceptions, such as not upcasting ``'int8'`` to ``'int16'``. Option ``1`` is the maximally safe one in terms of protecting users from bugs, being From 2c214c7f5eb11621d75235bc9ca57683341eef0e Mon Sep 17 00:00:00 2001 From: MarcoGorelli <> Date: Fri, 24 Mar 2023 16:43:01 +0000 Subject: [PATCH 16/27] split up examples, assorted cleanups, clarify scope --- web/pandas/pdeps/0006-ban-upcasting.md | 28 +++++++++++++++----------- 1 file changed, 16 insertions(+), 12 deletions(-) diff --git a/web/pandas/pdeps/0006-ban-upcasting.md b/web/pandas/pdeps/0006-ban-upcasting.md index 8a037e5411f6d..16c6d3d9bc99c 100644 --- a/web/pandas/pdeps/0006-ban-upcasting.md +++ b/web/pandas/pdeps/0006-ban-upcasting.md @@ -57,17 +57,22 @@ df = DataFrame({"a": [1, 2, np.nan], "b": [4, 5, 6]}) ser = df["a"].copy() ``` then the following would all raise: -- ``ser.fillna('foo', inplace=True)``; -- ``ser.where(ser.isna(), 'foo', inplace=True)`` -- ``ser.fillna('foo', inplace=False)``; -- ``ser.where(ser.isna(), 'foo', inplace=False)`` -- ``ser.iloc[indexer] = 'foo'`` -- ``ser.loc[indexer] = 'foo'`` -- ``df.loc[indexer, 'a'] = 'foo'`` -- ``ser[indexer] = 'foo'`` - -where ``indexer`` could be a slice, a mask, a single value, a list or array of values, -or any other allowed indexer. + +- setitem-like operations: + - ``ser.fillna('foo', inplace=True)``; + - ``ser.where(ser.isna(), 'foo', inplace=True)`` + - ``ser.fillna('foo', inplace=False)``; + - ``ser.where(ser.isna(), 'foo', inplace=False)`` +- setitem indexing operations (where ``indexer`` could be a slice, a mask, + a single value, a list or array of values, or any other allowed indexer): + - ``ser.iloc[indexer] = 'foo'`` + - ``ser.loc[indexer] = 'foo'`` + - ``df.iloc[indexer, 0] = 'foo'`` + - ``df.loc[indexer, 'a'] = 'foo'`` + - ``ser[indexer] = 'foo'`` + +It may be desirable to expand the top list to ``mask``, ``replace``, and ``update``, +but to keep the scope of the PDEP down, they are excluded for now. Examples of operations which would not raise are: - ``ser.diff()``; @@ -104,7 +109,6 @@ For a start, this would involve: - ``EABackedBlock.setitem``; - ``EABackedBlock.where``; - ``EABackedBlock.putmask``; - - ``_iLocIndexer._setitem_single_column``; The above would already require several hundreds of tests to be adjusted. Note that once implementation starts, the list of locations to change may turn out to be slightly From 1868f3cd42ffcf1f0ba0016f7cd78bb0646d1fca Mon Sep 17 00:00:00 2001 From: MarcoGorelli <> Date: Thu, 30 Mar 2023 14:59:59 +0100 Subject: [PATCH 17/27] mention df.index.intersection --- web/pandas/pdeps/0006-ban-upcasting.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/web/pandas/pdeps/0006-ban-upcasting.md b/web/pandas/pdeps/0006-ban-upcasting.md index 16c6d3d9bc99c..ece6ff663381e 100644 --- a/web/pandas/pdeps/0006-ban-upcasting.md +++ b/web/pandas/pdeps/0006-ban-upcasting.md @@ -71,16 +71,17 @@ then the following would all raise: - ``df.loc[indexer, 'a'] = 'foo'`` - ``ser[indexer] = 'foo'`` -It may be desirable to expand the top list to ``mask``, ``replace``, and ``update``, +It may be desirable to expand the top list to ``replace`` and ``update``, but to keep the scope of the PDEP down, they are excluded for now. Examples of operations which would not raise are: - ``ser.diff()``; - ``pd.concat([ser, ser.astype(object)])``; - ``ser.mean()``; -- ``ser[0] = 3.``; +- ``ser[0] = 3``; # same dtype +- ``ser[0] = 3.``; # 3.0 is a 'round' float and so compatible with 'int64' dtype - ``df['a'] = pd.date_range(datetime(2020, 1, 1), periods=3)``; -- ``ser[0] = 3``. +- ``df.index.intersection(ser.index)``. ## Detailed description From 80841d27b71c0bb416dc1b0e9de98719ae2e7bae Mon Sep 17 00:00:00 2001 From: MarcoGorelli <> Date: Thu, 6 Apr 2023 17:04:47 +0100 Subject: [PATCH 18/27] make explicit that option 1 was chosen in this pdep --- web/pandas/pdeps/0006-ban-upcasting.md | 1 + 1 file changed, 1 insertion(+) diff --git a/web/pandas/pdeps/0006-ban-upcasting.md b/web/pandas/pdeps/0006-ban-upcasting.md index ece6ff663381e..d56e3be0e4269 100644 --- a/web/pandas/pdeps/0006-ban-upcasting.md +++ b/web/pandas/pdeps/0006-ban-upcasting.md @@ -156,6 +156,7 @@ There are several downsides to option ``3``: Option ``1`` is the maximally safe one in terms of protecting users from bugs, being consistent with the current behaviour of nullable dtypes, and in being simple to teach. +Therefore, the option chosen by this PDEP is option 1. ## Usage and Impact From 0c4bdff1a1f14fb18721f2a0441519b7544ded55 Mon Sep 17 00:00:00 2001 From: Marco Edward Gorelli <33491632+MarcoGorelli@users.noreply.github.com> Date: Thu, 6 Apr 2023 19:57:42 +0100 Subject: [PATCH 19/27] clarify option 3 Co-authored-by: Joris Van den Bossche --- web/pandas/pdeps/0006-ban-upcasting.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0006-ban-upcasting.md b/web/pandas/pdeps/0006-ban-upcasting.md index d56e3be0e4269..794e7a5358039 100644 --- a/web/pandas/pdeps/0006-ban-upcasting.md +++ b/web/pandas/pdeps/0006-ban-upcasting.md @@ -131,7 +131,7 @@ numeric (without much regard for ``int`` vs ``float``) - ``'int64'`` is just wha Possible options could be: 1. only accept round floats (e.g. ``1.0``) and raise on anything else (e.g. ``1.01``); 2. convert the float value to ``int`` before setting it; -3. limit "banning upcasting" to when the upcasted dtype is ``object``. +3. limit "banning upcasting" to when the upcasted dtype is ``object`` (i.e. preserve current behavior of upcasting the int64 Series to float64) . Let us compare with what other libraries do: - ``numpy``: option 2 From e6f0c7f737e7a7d89ac5ce3881e87ada524ca094 Mon Sep 17 00:00:00 2001 From: Marco Edward Gorelli <33491632+MarcoGorelli@users.noreply.github.com> Date: Thu, 6 Apr 2023 19:58:02 +0100 Subject: [PATCH 20/27] clarify option 2 Co-authored-by: Joris Van den Bossche --- web/pandas/pdeps/0006-ban-upcasting.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0006-ban-upcasting.md b/web/pandas/pdeps/0006-ban-upcasting.md index 794e7a5358039..86328a904398d 100644 --- a/web/pandas/pdeps/0006-ban-upcasting.md +++ b/web/pandas/pdeps/0006-ban-upcasting.md @@ -130,7 +130,7 @@ numeric (without much regard for ``int`` vs ``float``) - ``'int64'`` is just wha Possible options could be: 1. only accept round floats (e.g. ``1.0``) and raise on anything else (e.g. ``1.01``); -2. convert the float value to ``int`` before setting it; +2. convert the float value to ``int`` before setting it (i.e. silently round all float values); 3. limit "banning upcasting" to when the upcasted dtype is ``object`` (i.e. preserve current behavior of upcasting the int64 Series to float64) . Let us compare with what other libraries do: From 368ad209d25283359e302dc798abdd45f5230bce Mon Sep 17 00:00:00 2001 From: Marco Edward Gorelli <33491632+MarcoGorelli@users.noreply.github.com> Date: Thu, 6 Apr 2023 21:46:26 +0100 Subject: [PATCH 21/27] correct "risk annoy" to "risk annoying" so as to not risk annoying reviewers --- web/pandas/pdeps/0006-ban-upcasting.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0006-ban-upcasting.md b/web/pandas/pdeps/0006-ban-upcasting.md index 86328a904398d..cd96a35a8f894 100644 --- a/web/pandas/pdeps/0006-ban-upcasting.md +++ b/web/pandas/pdeps/0006-ban-upcasting.md @@ -162,7 +162,7 @@ Therefore, the option chosen by this PDEP is option 1. This would make pandas stricter, so there should not be any risk of introducing bugs. If anything, this would help prevent bugs. -Unfortunately, it would also risk annoy users who might have been intentionally upcasting. +Unfortunately, it would also risk annoying users who might have been intentionally upcasting. Given that users could still get the current behaviour by first explicitly casting the Series to float, it would be more beneficial to the community at large to err on the side From a0ae1fd78c83199b183fbc4bdab836d4a5daf048 Mon Sep 17 00:00:00 2001 From: MarcoGorelli <> Date: Sat, 8 Apr 2023 13:33:57 +0100 Subject: [PATCH 22/27] add faq --- web/pandas/pdeps/0006-ban-upcasting.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/web/pandas/pdeps/0006-ban-upcasting.md b/web/pandas/pdeps/0006-ban-upcasting.md index d56e3be0e4269..1dd8ba5b40775 100644 --- a/web/pandas/pdeps/0006-ban-upcasting.md +++ b/web/pandas/pdeps/0006-ban-upcasting.md @@ -178,6 +178,18 @@ ser[len(ser)] = 4.5 There is arguably a larger conversation to be had about whether that should be allowed at all. To keep this proposal focused, it is intentionally excluded from the scope. +## F.A.Q. + +**Q: What happens if setting ``1.0`` in an ``int8`` Series?** + +**A**: The current behavior would be to insert ``1.0`` as ``1`` and keep the dtype + as ``int8``. So, this would not change. + +**Q: What happens if setting ``1_000_000.0`` in an ``int8`` Series?** + +**A**: The current behavior would be to upcast to ``int32``. So under this PDEP, + it would instead raise. + ## Timeline Deprecate sometime in the 2.x releases (after 2.0.0 has already been released), and enforce in 3.0.0. From 3e220f0bf151bc0fc9f77a08c175ed9cc49e8c6b Mon Sep 17 00:00:00 2001 From: MarcoGorelli <> Date: Tue, 11 Apr 2023 10:14:25 +0100 Subject: [PATCH 23/27] add example with 16.000000000000001 to faq --- web/pandas/pdeps/0006-ban-upcasting.md | 29 +++++++++++++++++++++++--- 1 file changed, 26 insertions(+), 3 deletions(-) diff --git a/web/pandas/pdeps/0006-ban-upcasting.md b/web/pandas/pdeps/0006-ban-upcasting.md index 4d2c6e33a45f3..15188317dbf4f 100644 --- a/web/pandas/pdeps/0006-ban-upcasting.md +++ b/web/pandas/pdeps/0006-ban-upcasting.md @@ -50,7 +50,7 @@ In[10]: ser[2] = "2000-01-04" # works, is converted to datetime64 In[11]: ser[2] = "2000-01-04x" # typo - but pandas does not error, it upcasts to object ``` -The scope of this PDEP is limited to setitem-like operations on Series. +The scope of this PDEP is limited to setitem-like operations on Series (and DataFrame columns). For example, starting with ```python df = DataFrame({"a": [1, 2, np.nan], "b": [4, 5, 6]}) @@ -71,7 +71,7 @@ then the following would all raise: - ``df.loc[indexer, 'a'] = 'foo'`` - ``ser[indexer] = 'foo'`` -It may be desirable to expand the top list to ``replace`` and ``update``, +It may be desirable to expand the top list to ``Series.replace`` and ``Series.update``, but to keep the scope of the PDEP down, they are excluded for now. Examples of operations which would not raise are: @@ -122,7 +122,24 @@ The trickiest part of this proposal concerns what to do when setting a float in ```python In[1]: ser = pd.Series([1, 2, 3]) -In[2]: ser[0] = 1.5 +In [2]: ser +Out[2]: +0 1 +1 2 +2 3 +dtype: int64 + +In[3]: ser[0] = 1.5 # what should this do? +``` + +The current behaviour is to upcast to 'float64': +```python +In [4]: ser +Out[4]: +0 1.5 +1 2.0 +2 3.0 +dtype: float64 ``` This is not necessarily a sign of a bug, because the user might just be thinking of their ``Series`` as being @@ -190,6 +207,12 @@ at all. To keep this proposal focused, it is intentionally excluded from the sco **A**: The current behavior would be to upcast to ``int32``. So under this PDEP, it would instead raise. +**Q: What happens in setting ``16.000000000000001`` in an `int8`` Series?** + +**A**: As far as Python is concerned, ``16.000000000000001`` and ``16.0`` are the + same number. So, it would be inserted as ``1`` and the dtype would not change + (just like what happens now, there would be no change here). + ## Timeline Deprecate sometime in the 2.x releases (after 2.0.0 has already been released), and enforce in 3.0.0. From 0b023170299af998e9b45b654883109a508cc891 Mon Sep 17 00:00:00 2001 From: MarcoGorelli <> Date: Tue, 11 Apr 2023 10:21:40 +0100 Subject: [PATCH 24/27] minor clarification (when constructing it) --- web/pandas/pdeps/0006-ban-upcasting.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0006-ban-upcasting.md b/web/pandas/pdeps/0006-ban-upcasting.md index 15188317dbf4f..a70d96795568d 100644 --- a/web/pandas/pdeps/0006-ban-upcasting.md +++ b/web/pandas/pdeps/0006-ban-upcasting.md @@ -143,7 +143,8 @@ dtype: float64 ``` This is not necessarily a sign of a bug, because the user might just be thinking of their ``Series`` as being -numeric (without much regard for ``int`` vs ``float``) - ``'int64'`` is just what pandas happened to infer. +numeric (without much regard for ``int`` vs ``float``) - ``'int64'`` is just what pandas happened to infer +when constructing it. Possible options could be: 1. only accept round floats (e.g. ``1.0``) and raise on anything else (e.g. ``1.01``); From 50f0a410471ff231f7d54b1400785aec8d404847 Mon Sep 17 00:00:00 2001 From: Marco Edward Gorelli <33491632+MarcoGorelli@users.noreply.github.com> Date: Tue, 11 Apr 2023 16:59:46 +0100 Subject: [PATCH 25/27] Update web/pandas/pdeps/0006-ban-upcasting.md Co-authored-by: Irv Lustig --- web/pandas/pdeps/0006-ban-upcasting.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0006-ban-upcasting.md b/web/pandas/pdeps/0006-ban-upcasting.md index a70d96795568d..1e1445a001389 100644 --- a/web/pandas/pdeps/0006-ban-upcasting.md +++ b/web/pandas/pdeps/0006-ban-upcasting.md @@ -211,7 +211,7 @@ at all. To keep this proposal focused, it is intentionally excluded from the sco **Q: What happens in setting ``16.000000000000001`` in an `int8`` Series?** **A**: As far as Python is concerned, ``16.000000000000001`` and ``16.0`` are the - same number. So, it would be inserted as ``1`` and the dtype would not change + same number. So, it would be inserted as ``16`` and the dtype would not change (just like what happens now, there would be no change here). ## Timeline From cc7756230b101dc48084e2b3f4ed3bcd313ab312 Mon Sep 17 00:00:00 2001 From: MarcoGorelli <> Date: Tue, 11 Apr 2023 17:39:21 +0100 Subject: [PATCH 26/27] add example of maybe_convert_to_int function --- web/pandas/pdeps/0006-ban-upcasting.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/web/pandas/pdeps/0006-ban-upcasting.md b/web/pandas/pdeps/0006-ban-upcasting.md index a70d96795568d..692793498a593 100644 --- a/web/pandas/pdeps/0006-ban-upcasting.md +++ b/web/pandas/pdeps/0006-ban-upcasting.md @@ -214,6 +214,17 @@ at all. To keep this proposal focused, it is intentionally excluded from the sco same number. So, it would be inserted as ``1`` and the dtype would not change (just like what happens now, there would be no change here). +**Q: What if I want ``1.0000000001`` to be inserted as ``1.0`` in an `'int8'` Series?** + +**A**: You may want to define your own helper function, such as + ```python + >>> def maybe_convert_to_int(x: int | float, tolerance: float): + if np.abs(x - round(x)) < tolerance: + return round(x) + return x + ``` + which you could adapt according to your needs. + ## Timeline Deprecate sometime in the 2.x releases (after 2.0.0 has already been released), and enforce in 3.0.0. From 8211f37a47970ac0cd41eeea19c69233c5c1c70f Mon Sep 17 00:00:00 2001 From: MarcoGorelli <> Date: Fri, 21 Apr 2023 14:14:45 +0100 Subject: [PATCH 27/27] change status to accepted --- web/pandas/pdeps/0006-ban-upcasting.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0006-ban-upcasting.md b/web/pandas/pdeps/0006-ban-upcasting.md index d26b3f6ec99dd..325c25313af53 100644 --- a/web/pandas/pdeps/0006-ban-upcasting.md +++ b/web/pandas/pdeps/0006-ban-upcasting.md @@ -1,7 +1,7 @@ # PDEP-6: Ban upcasting in setitem-like operations - Created: 23 December 2022 -- Status: Under discussion +- Status: Accepted - Discussion: [#50402](https://github.com/pandas-dev/pandas/pull/50402) - Author: [Marco Gorelli](https://github.com/MarcoGorelli) ([original issue](https://github.com/pandas-dev/pandas/issues/39584) by [Joris Van den Bossche](https://github.com/jorisvandenbossche)) - Revision: 1