-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: ChainedAssignmentError for CoW not working when setitem is called from cython #51315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
A small reproducer from the pyreadstats case, is the following function compiled by cython:
If you call that, it incorrectly triggers the ChainedAssignmentError: def test_setitem_from_cython():
df = DataFrame({"a": [1, 2, 3], "b": [0.1, 0.2, 0.3]})
df2 = _test_setitem_from_cython(df) |
I'm not a huge fan of using refcounting, since it seems very fragile(it depends on a magic number). (I also don't think this approach will work for accessors too, since accessors store a reference to the DataFrame, so that could prevent the error from happenning) One thing we could try is using (There is also zero chance of that working with PyPy, since the finalizer will not get called when the refcount hits zero like in CPython because there is no refcount) Digging through the call stack seems more robust, but I think that that is going to absolutely kill perf. |
How would this be able to distinguish between "valid" cases of objects from getitem that at a later stage go out of scope, versus the cases where it goes out of scope on the same line directly after the setitem call? |
Uh oh!
There was an error while loading. Please reload this page.
#49467 added the feature to detect "chained assignment" and raise an error for this when CoW is enabled (since it will then never work):
This is done based on the refcount of the calling DataFrame of
__setitem__
: if this is only a temporary variable in a chain, it has a lower refcount than when it is assigned to a variable (usingsys.getrefcount
).It seemed that we could robustly check for this with the above approach, but we discovered that this fails if the setitem call is done from a function compiled with cython. We actually encounter this in our own test suite, in the read_spss tests using
pyreadstats
(those are skipped for CoW now).pyreadstats
has some functionality written in cython, and for example this function does a valid (non-chained) setitem call to update a column of a dataframe. However, this incorrectly triggers the ChainedAssignmentError.This is because the cython generated code results in a lower refcount when getting to
DataFrame/Series.__setitem__
(a refcount of 2 instead of 3 for a chained case, but so for a valid non-chained case it can give a refcount of 3, which triggers the error).My current understanding is that both when calling a pure python function that gets interpreted, or calling a compiled cython function, it ends up calling
PyObject_SetItem
(which will then eventually call the__setitem__
). However, when calling this through the interpreter, there is an extra refcount increment from having the object on the stack.It's an annoying issue, and I am not directly sure of a good solution. Some options that I can currently think of:
The text was updated successfully, but these errors were encountered: