Skip to content

Commit 910f159

Browse files
MarcoGorellijorisvandenbosscheDr-Irv
authored
PDEP-6: Ban upcasting in setitem-like operations (#50424)
PDEP-6: Ban upcasting in setitem-like operations (#50424) --------- Co-authored-by: MarcoGorelli <> Co-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: Irv Lustig <[email protected]>
1 parent 494590d commit 910f159

File tree

1 file changed

+234
-0
lines changed

1 file changed

+234
-0
lines changed
Lines changed: 234 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,234 @@
1+
# PDEP-6: Ban upcasting in setitem-like operations
2+
3+
- Created: 23 December 2022
4+
- Status: Accepted
5+
- Discussion: [#50402](https://github.com/pandas-dev/pandas/pull/50402)
6+
- Author: [Marco Gorelli](https://github.com/MarcoGorelli) ([original issue](https://github.com/pandas-dev/pandas/issues/39584) by [Joris Van den Bossche](https://github.com/jorisvandenbossche))
7+
- Revision: 1
8+
9+
## Abstract
10+
11+
The suggestion is that setitem-like operations would
12+
not change a ``Series`` dtype (nor that of a ``DataFrame``'s column).
13+
14+
Current behaviour:
15+
```python
16+
In [1]: ser = pd.Series([1, 2, 3], dtype='int64')
17+
18+
In [2]: ser[2] = 'potage'
19+
20+
In [3]: ser # dtype changed to 'object'!
21+
Out[3]:
22+
0 1
23+
1 2
24+
2 potage
25+
dtype: object
26+
```
27+
28+
Suggested behaviour:
29+
30+
```python
31+
In [1]: ser = pd.Series([1, 2, 3])
32+
33+
In [2]: ser[2] = 'potage' # raises!
34+
---------------------------------------------------------------------------
35+
ValueError: Invalid value 'potage' for dtype int64
36+
```
37+
38+
## Motivation and Scope
39+
40+
Currently, pandas is extremely flexible in handling different dtypes.
41+
However, this can potentially hide bugs, break user expectations, and copy data
42+
in what looks like it should be an inplace operation.
43+
44+
An example of it hiding a bug is:
45+
```python
46+
In[9]: ser = pd.Series(pd.date_range("2000", periods=3))
47+
48+
In[10]: ser[2] = "2000-01-04" # works, is converted to datetime64
49+
50+
In[11]: ser[2] = "2000-01-04x" # typo - but pandas does not error, it upcasts to object
51+
```
52+
53+
The scope of this PDEP is limited to setitem-like operations on Series (and DataFrame columns).
54+
For example, starting with
55+
```python
56+
df = DataFrame({"a": [1, 2, np.nan], "b": [4, 5, 6]})
57+
ser = df["a"].copy()
58+
```
59+
then the following would all raise:
60+
61+
- setitem-like operations:
62+
- ``ser.fillna('foo', inplace=True)``;
63+
- ``ser.where(ser.isna(), 'foo', inplace=True)``
64+
- ``ser.fillna('foo', inplace=False)``;
65+
- ``ser.where(ser.isna(), 'foo', inplace=False)``
66+
- setitem indexing operations (where ``indexer`` could be a slice, a mask,
67+
a single value, a list or array of values, or any other allowed indexer):
68+
- ``ser.iloc[indexer] = 'foo'``
69+
- ``ser.loc[indexer] = 'foo'``
70+
- ``df.iloc[indexer, 0] = 'foo'``
71+
- ``df.loc[indexer, 'a'] = 'foo'``
72+
- ``ser[indexer] = 'foo'``
73+
74+
It may be desirable to expand the top list to ``Series.replace`` and ``Series.update``,
75+
but to keep the scope of the PDEP down, they are excluded for now.
76+
77+
Examples of operations which would not raise are:
78+
- ``ser.diff()``;
79+
- ``pd.concat([ser, ser.astype(object)])``;
80+
- ``ser.mean()``;
81+
- ``ser[0] = 3``; # same dtype
82+
- ``ser[0] = 3.``; # 3.0 is a 'round' float and so compatible with 'int64' dtype
83+
- ``df['a'] = pd.date_range(datetime(2020, 1, 1), periods=3)``;
84+
- ``df.index.intersection(ser.index)``.
85+
86+
## Detailed description
87+
88+
Concretely, the suggestion is:
89+
- if a ``Series`` is of a given dtype, then a ``setitem``-like operation should not change its dtype;
90+
- if a ``setitem``-like operation would previously have changed a ``Series``' dtype, it would now raise.
91+
92+
For a start, this would involve:
93+
94+
1. changing ``Block.setitem`` such that it does not have an ``except`` block in
95+
96+
```python
97+
value = extract_array(value, extract_numpy=True)
98+
try:
99+
casted = np_can_hold_element(values.dtype, value)
100+
except LossySetitemError:
101+
# current dtype cannot store value, coerce to common dtype
102+
nb = self.coerce_to_target_dtype(value)
103+
return nb.setitem(indexer, value)
104+
else:
105+
```
106+
107+
2. making a similar change in:
108+
- ``Block.where``;
109+
- ``Block.putmask``;
110+
- ``EABackedBlock.setitem``;
111+
- ``EABackedBlock.where``;
112+
- ``EABackedBlock.putmask``;
113+
114+
The above would already require several hundreds of tests to be adjusted. Note that once
115+
implementation starts, the list of locations to change may turn out to be slightly
116+
different.
117+
118+
### Ban upcasting altogether, or just upcasting to ``object``?
119+
120+
The trickiest part of this proposal concerns what to do when setting a float in an integer column:
121+
122+
```python
123+
In[1]: ser = pd.Series([1, 2, 3])
124+
125+
In [2]: ser
126+
Out[2]:
127+
0 1
128+
1 2
129+
2 3
130+
dtype: int64
131+
132+
In[3]: ser[0] = 1.5 # what should this do?
133+
```
134+
135+
The current behaviour is to upcast to 'float64':
136+
```python
137+
In [4]: ser
138+
Out[4]:
139+
0 1.5
140+
1 2.0
141+
2 3.0
142+
dtype: float64
143+
```
144+
145+
This is not necessarily a sign of a bug, because the user might just be thinking of their ``Series`` as being
146+
numeric (without much regard for ``int`` vs ``float``) - ``'int64'`` is just what pandas happened to infer
147+
when constructing it.
148+
149+
Possible options could be:
150+
1. only accept round floats (e.g. ``1.0``) and raise on anything else (e.g. ``1.01``);
151+
2. convert the float value to ``int`` before setting it (i.e. silently round all float values);
152+
3. limit "banning upcasting" to when the upcasted dtype is ``object`` (i.e. preserve current behavior of upcasting the int64 Series to float64) .
153+
154+
Let us compare with what other libraries do:
155+
- ``numpy``: option 2
156+
- ``cudf``: option 2
157+
- ``polars``: option 2
158+
- ``R data.frame``: just upcasts (like pandas does now for non-nullable dtypes);
159+
- ``pandas`` (nullable dtypes): option 1
160+
- ``datatable``: option 1
161+
- ``DataFrames.jl``: option 1
162+
163+
Option ``2`` would be a breaking behaviour change in pandas. Further,
164+
if the objective of this PDEP is to prevent bugs, then this is also not desirable:
165+
someone might set ``1.5`` and later be surprised to learn that they actually set ``1``.
166+
167+
There are several downsides to option ``3``:
168+
- it would be inconsistent with the nullable dtypes' behaviour;
169+
- it would also add complexity to the codebase and to tests;
170+
- it would be hard to teach, as instead of being able to teach a simple rule,
171+
there would be a rule with exceptions;
172+
- there would be a risk of loss of precision and or overflow;
173+
- it opens the door to other exceptions, such as not upcasting ``'int8'`` to ``'int16'``.
174+
175+
Option ``1`` is the maximally safe one in terms of protecting users from bugs, being
176+
consistent with the current behaviour of nullable dtypes, and in being simple to teach.
177+
Therefore, the option chosen by this PDEP is option 1.
178+
179+
## Usage and Impact
180+
181+
This would make pandas stricter, so there should not be any risk of introducing bugs. If anything, this would help prevent bugs.
182+
183+
Unfortunately, it would also risk annoying users who might have been intentionally upcasting.
184+
185+
Given that users could still get the current behaviour by first explicitly casting the Series
186+
to float, it would be more beneficial to the community at large to err on the side
187+
of strictness.
188+
189+
## Out of scope
190+
191+
Enlargement. For example:
192+
```python
193+
ser = pd.Series([1, 2, 3])
194+
ser[len(ser)] = 4.5
195+
```
196+
There is arguably a larger conversation to be had about whether that should be allowed
197+
at all. To keep this proposal focused, it is intentionally excluded from the scope.
198+
199+
## F.A.Q.
200+
201+
**Q: What happens if setting ``1.0`` in an ``int8`` Series?**
202+
203+
**A**: The current behavior would be to insert ``1.0`` as ``1`` and keep the dtype
204+
as ``int8``. So, this would not change.
205+
206+
**Q: What happens if setting ``1_000_000.0`` in an ``int8`` Series?**
207+
208+
**A**: The current behavior would be to upcast to ``int32``. So under this PDEP,
209+
it would instead raise.
210+
211+
**Q: What happens in setting ``16.000000000000001`` in an `int8`` Series?**
212+
213+
**A**: As far as Python is concerned, ``16.000000000000001`` and ``16.0`` are the
214+
same number. So, it would be inserted as ``16`` and the dtype would not change
215+
(just like what happens now, there would be no change here).
216+
217+
**Q: What if I want ``1.0000000001`` to be inserted as ``1.0`` in an `'int8'` Series?**
218+
219+
**A**: You may want to define your own helper function, such as
220+
```python
221+
>>> def maybe_convert_to_int(x: int | float, tolerance: float):
222+
if np.abs(x - round(x)) < tolerance:
223+
return round(x)
224+
return x
225+
```
226+
which you could adapt according to your needs.
227+
228+
## Timeline
229+
230+
Deprecate sometime in the 2.x releases (after 2.0.0 has already been released), and enforce in 3.0.0.
231+
232+
### PDEP History
233+
234+
- 23 December 2022: Initial draft

0 commit comments

Comments
 (0)