-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH/PERF: optional mask in nullable dtypes #42012
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -404,20 +453,17 @@ def isin(self, values) -> BooleanArray: # type: ignore[override] | |||
from pandas.core.arrays import BooleanArray | |||
|
|||
result = isin(self._data, values) | |||
if self._hasna: | |||
if self._mask is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are multiple changes like this where _mask is not None
was used instead of _hasna
. _hasna
seems more resistant to any future refactoring in how the mask works, but then we need an additional assert self._mask is not None
afterwards anyway for mypy
.
@mzeitlin11 thanks a lot for looking into this! There is one problem I recently realized while thinking this through (but didn't yet comment about that on the issue), and that is: how to ensure "views" still work? So for example, assume I create an array with no NAs (the mask will be None), then I create a view of this array (with a slice, or the
One possible solution would be to create a small Such a class could also have some helper methods then like However, one problem with this (that I only realize now while writing it down), is that if using such a shared Mask object, we would need to keep track of the offset/length on the Array class in case the array is sliced (because the mask would present the full mask of the parent array, and when asking for the mask array or whether there are NAs present, we of course only need the sliced part), which seems like a considerable complication .. Another option would be to only support "read-only views" (so that if an array is a view of another, it can't be mutated), but that's certainly something that would need to be discussed. |
Thanks for bringing this up, that's a serious complication that never crossed my mind :) For now I'll try messing around with some workarounds like you mention, though I worry they might complicate the implementation too much to be worth it. |
I think the only way to handle this properly is for the second array to be a view on the original IntegerArray object, rather than it's component ndarrays like we do in I don't know exactly what this would look like in code, but maybe some kind of |
Thanks for the insight @TomAugspurger and @jorisvandenbossche! Will close this for now until I make any progress on a new approach. |
Benchmarks:
isna
benchmark was somewhat surprising, but happens because of the following:The
argmax
performance decrease is caused by theisna
slowdown because there are a bunch ofisna.any()
calls. I think this decrease could be prevented (and performance increased overall) by addingargmax
to our masked reductions (or avoiding the repeatedisna.any()
some other way), but that seems better left for a different pr.The
TimeLogicalOps
benchmark slowdown is occurring because of themask.any()
check added in theBooleanArray
constructor - this slowdown could be avoided by more aggressively figuring out whenmask
can (or won't) be None, but that would complicate the logic somewhat, so would plan to leave as a potential followup.Finally, the slowdown in
IntegerArray
andBooleanArray
is also due to this addedmask.any()
check. This could be mitigated by determining earlier when missing values are not possible in thecoerce_to_array
functions, but that adds reasonable complexity to already complex functions.