|
| 1 | +# PDEP-6: Ban upcasting in setitem-like operations |
| 2 | + |
| 3 | +- Created: 23 December 2022 |
| 4 | +- Status: Accepted |
| 5 | +- Discussion: [#50402](https://github.com/pandas-dev/pandas/pull/50402) |
| 6 | +- Author: [Marco Gorelli](https://github.com/MarcoGorelli) ([original issue](https://github.com/pandas-dev/pandas/issues/39584) by [Joris Van den Bossche](https://github.com/jorisvandenbossche)) |
| 7 | +- Revision: 1 |
| 8 | + |
| 9 | +## Abstract |
| 10 | + |
| 11 | +The suggestion is that setitem-like operations would |
| 12 | +not change a ``Series`` dtype (nor that of a ``DataFrame``'s column). |
| 13 | + |
| 14 | +Current behaviour: |
| 15 | +```python |
| 16 | +In [1]: ser = pd.Series([1, 2, 3], dtype='int64') |
| 17 | + |
| 18 | +In [2]: ser[2] = 'potage' |
| 19 | + |
| 20 | +In [3]: ser # dtype changed to 'object'! |
| 21 | +Out[3]: |
| 22 | +0 1 |
| 23 | +1 2 |
| 24 | +2 potage |
| 25 | +dtype: object |
| 26 | +``` |
| 27 | + |
| 28 | +Suggested behaviour: |
| 29 | + |
| 30 | +```python |
| 31 | +In [1]: ser = pd.Series([1, 2, 3]) |
| 32 | + |
| 33 | +In [2]: ser[2] = 'potage' # raises! |
| 34 | +--------------------------------------------------------------------------- |
| 35 | +ValueError: Invalid value 'potage' for dtype int64 |
| 36 | +``` |
| 37 | + |
| 38 | +## Motivation and Scope |
| 39 | + |
| 40 | +Currently, pandas is extremely flexible in handling different dtypes. |
| 41 | +However, this can potentially hide bugs, break user expectations, and copy data |
| 42 | +in what looks like it should be an inplace operation. |
| 43 | + |
| 44 | +An example of it hiding a bug is: |
| 45 | +```python |
| 46 | +In[9]: ser = pd.Series(pd.date_range("2000", periods=3)) |
| 47 | + |
| 48 | +In[10]: ser[2] = "2000-01-04" # works, is converted to datetime64 |
| 49 | + |
| 50 | +In[11]: ser[2] = "2000-01-04x" # typo - but pandas does not error, it upcasts to object |
| 51 | +``` |
| 52 | + |
| 53 | +The scope of this PDEP is limited to setitem-like operations on Series (and DataFrame columns). |
| 54 | +For example, starting with |
| 55 | +```python |
| 56 | +df = DataFrame({"a": [1, 2, np.nan], "b": [4, 5, 6]}) |
| 57 | +ser = df["a"].copy() |
| 58 | +``` |
| 59 | +then the following would all raise: |
| 60 | + |
| 61 | +- setitem-like operations: |
| 62 | + - ``ser.fillna('foo', inplace=True)``; |
| 63 | + - ``ser.where(ser.isna(), 'foo', inplace=True)`` |
| 64 | + - ``ser.fillna('foo', inplace=False)``; |
| 65 | + - ``ser.where(ser.isna(), 'foo', inplace=False)`` |
| 66 | +- setitem indexing operations (where ``indexer`` could be a slice, a mask, |
| 67 | + a single value, a list or array of values, or any other allowed indexer): |
| 68 | + - ``ser.iloc[indexer] = 'foo'`` |
| 69 | + - ``ser.loc[indexer] = 'foo'`` |
| 70 | + - ``df.iloc[indexer, 0] = 'foo'`` |
| 71 | + - ``df.loc[indexer, 'a'] = 'foo'`` |
| 72 | + - ``ser[indexer] = 'foo'`` |
| 73 | + |
| 74 | +It may be desirable to expand the top list to ``Series.replace`` and ``Series.update``, |
| 75 | +but to keep the scope of the PDEP down, they are excluded for now. |
| 76 | + |
| 77 | +Examples of operations which would not raise are: |
| 78 | +- ``ser.diff()``; |
| 79 | +- ``pd.concat([ser, ser.astype(object)])``; |
| 80 | +- ``ser.mean()``; |
| 81 | +- ``ser[0] = 3``; # same dtype |
| 82 | +- ``ser[0] = 3.``; # 3.0 is a 'round' float and so compatible with 'int64' dtype |
| 83 | +- ``df['a'] = pd.date_range(datetime(2020, 1, 1), periods=3)``; |
| 84 | +- ``df.index.intersection(ser.index)``. |
| 85 | + |
| 86 | +## Detailed description |
| 87 | + |
| 88 | +Concretely, the suggestion is: |
| 89 | +- if a ``Series`` is of a given dtype, then a ``setitem``-like operation should not change its dtype; |
| 90 | +- if a ``setitem``-like operation would previously have changed a ``Series``' dtype, it would now raise. |
| 91 | + |
| 92 | +For a start, this would involve: |
| 93 | + |
| 94 | +1. changing ``Block.setitem`` such that it does not have an ``except`` block in |
| 95 | + |
| 96 | + ```python |
| 97 | + value = extract_array(value, extract_numpy=True) |
| 98 | + try: |
| 99 | + casted = np_can_hold_element(values.dtype, value) |
| 100 | + except LossySetitemError: |
| 101 | + # current dtype cannot store value, coerce to common dtype |
| 102 | + nb = self.coerce_to_target_dtype(value) |
| 103 | + return nb.setitem(indexer, value) |
| 104 | + else: |
| 105 | + ``` |
| 106 | + |
| 107 | +2. making a similar change in: |
| 108 | + - ``Block.where``; |
| 109 | + - ``Block.putmask``; |
| 110 | + - ``EABackedBlock.setitem``; |
| 111 | + - ``EABackedBlock.where``; |
| 112 | + - ``EABackedBlock.putmask``; |
| 113 | + |
| 114 | +The above would already require several hundreds of tests to be adjusted. Note that once |
| 115 | +implementation starts, the list of locations to change may turn out to be slightly |
| 116 | +different. |
| 117 | + |
| 118 | +### Ban upcasting altogether, or just upcasting to ``object``? |
| 119 | + |
| 120 | +The trickiest part of this proposal concerns what to do when setting a float in an integer column: |
| 121 | + |
| 122 | +```python |
| 123 | +In[1]: ser = pd.Series([1, 2, 3]) |
| 124 | + |
| 125 | +In [2]: ser |
| 126 | +Out[2]: |
| 127 | +0 1 |
| 128 | +1 2 |
| 129 | +2 3 |
| 130 | +dtype: int64 |
| 131 | + |
| 132 | +In[3]: ser[0] = 1.5 # what should this do? |
| 133 | +``` |
| 134 | + |
| 135 | +The current behaviour is to upcast to 'float64': |
| 136 | +```python |
| 137 | +In [4]: ser |
| 138 | +Out[4]: |
| 139 | +0 1.5 |
| 140 | +1 2.0 |
| 141 | +2 3.0 |
| 142 | +dtype: float64 |
| 143 | +``` |
| 144 | + |
| 145 | +This is not necessarily a sign of a bug, because the user might just be thinking of their ``Series`` as being |
| 146 | +numeric (without much regard for ``int`` vs ``float``) - ``'int64'`` is just what pandas happened to infer |
| 147 | +when constructing it. |
| 148 | + |
| 149 | +Possible options could be: |
| 150 | +1. only accept round floats (e.g. ``1.0``) and raise on anything else (e.g. ``1.01``); |
| 151 | +2. convert the float value to ``int`` before setting it (i.e. silently round all float values); |
| 152 | +3. limit "banning upcasting" to when the upcasted dtype is ``object`` (i.e. preserve current behavior of upcasting the int64 Series to float64) . |
| 153 | + |
| 154 | +Let us compare with what other libraries do: |
| 155 | +- ``numpy``: option 2 |
| 156 | +- ``cudf``: option 2 |
| 157 | +- ``polars``: option 2 |
| 158 | +- ``R data.frame``: just upcasts (like pandas does now for non-nullable dtypes); |
| 159 | +- ``pandas`` (nullable dtypes): option 1 |
| 160 | +- ``datatable``: option 1 |
| 161 | +- ``DataFrames.jl``: option 1 |
| 162 | + |
| 163 | +Option ``2`` would be a breaking behaviour change in pandas. Further, |
| 164 | +if the objective of this PDEP is to prevent bugs, then this is also not desirable: |
| 165 | +someone might set ``1.5`` and later be surprised to learn that they actually set ``1``. |
| 166 | + |
| 167 | +There are several downsides to option ``3``: |
| 168 | +- it would be inconsistent with the nullable dtypes' behaviour; |
| 169 | +- it would also add complexity to the codebase and to tests; |
| 170 | +- it would be hard to teach, as instead of being able to teach a simple rule, |
| 171 | + there would be a rule with exceptions; |
| 172 | +- there would be a risk of loss of precision and or overflow; |
| 173 | +- it opens the door to other exceptions, such as not upcasting ``'int8'`` to ``'int16'``. |
| 174 | + |
| 175 | +Option ``1`` is the maximally safe one in terms of protecting users from bugs, being |
| 176 | +consistent with the current behaviour of nullable dtypes, and in being simple to teach. |
| 177 | +Therefore, the option chosen by this PDEP is option 1. |
| 178 | + |
| 179 | +## Usage and Impact |
| 180 | + |
| 181 | +This would make pandas stricter, so there should not be any risk of introducing bugs. If anything, this would help prevent bugs. |
| 182 | + |
| 183 | +Unfortunately, it would also risk annoying users who might have been intentionally upcasting. |
| 184 | + |
| 185 | +Given that users could still get the current behaviour by first explicitly casting the Series |
| 186 | +to float, it would be more beneficial to the community at large to err on the side |
| 187 | +of strictness. |
| 188 | + |
| 189 | +## Out of scope |
| 190 | + |
| 191 | +Enlargement. For example: |
| 192 | +```python |
| 193 | +ser = pd.Series([1, 2, 3]) |
| 194 | +ser[len(ser)] = 4.5 |
| 195 | +``` |
| 196 | +There is arguably a larger conversation to be had about whether that should be allowed |
| 197 | +at all. To keep this proposal focused, it is intentionally excluded from the scope. |
| 198 | + |
| 199 | +## F.A.Q. |
| 200 | + |
| 201 | +**Q: What happens if setting ``1.0`` in an ``int8`` Series?** |
| 202 | + |
| 203 | +**A**: The current behavior would be to insert ``1.0`` as ``1`` and keep the dtype |
| 204 | + as ``int8``. So, this would not change. |
| 205 | + |
| 206 | +**Q: What happens if setting ``1_000_000.0`` in an ``int8`` Series?** |
| 207 | + |
| 208 | +**A**: The current behavior would be to upcast to ``int32``. So under this PDEP, |
| 209 | + it would instead raise. |
| 210 | + |
| 211 | +**Q: What happens in setting ``16.000000000000001`` in an `int8`` Series?** |
| 212 | + |
| 213 | +**A**: As far as Python is concerned, ``16.000000000000001`` and ``16.0`` are the |
| 214 | + same number. So, it would be inserted as ``16`` and the dtype would not change |
| 215 | + (just like what happens now, there would be no change here). |
| 216 | + |
| 217 | +**Q: What if I want ``1.0000000001`` to be inserted as ``1.0`` in an `'int8'` Series?** |
| 218 | + |
| 219 | +**A**: You may want to define your own helper function, such as |
| 220 | + ```python |
| 221 | + >>> def maybe_convert_to_int(x: int | float, tolerance: float): |
| 222 | + if np.abs(x - round(x)) < tolerance: |
| 223 | + return round(x) |
| 224 | + return x |
| 225 | + ``` |
| 226 | + which you could adapt according to your needs. |
| 227 | + |
| 228 | +## Timeline |
| 229 | + |
| 230 | +Deprecate sometime in the 2.x releases (after 2.0.0 has already been released), and enforce in 3.0.0. |
| 231 | + |
| 232 | +### PDEP History |
| 233 | + |
| 234 | +- 23 December 2022: Initial draft |
0 commit comments