-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Cythonized GroupBy mad #20024
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cythonized GroupBy mad #20024
Conversation
@@ -12,7 +12,7 @@ | |||
|
|||
AGG_FUNCTIONS = ['sum', 'prod', 'min', 'max', 'median', 'mean', 'skew', | |||
'mad', 'std', 'var', 'sem'] | |||
AGG_FUNCTIONS_WITH_SKIPNA = ['skew', 'mad'] | |||
AGG_FUNCTIONS_WITH_SKIPNA = ['skew'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speaking to why I did this - since mad
uses the mean
behind the scenes I figured it made sense to rely on mean
to do as much of the heavy lifting as possible. Unfortunately mean
doesn't currently handle the skipna
parameter but I think it's worth addressing within that function and allowing that to pass through to mad
rather than implementing specifically within mad
.
As always glad to open an issue for that if you agree on approach
pandas/core/groupby.py
Outdated
|
||
# Wrap in a try..except to catch a TypeError with bool data | ||
# Ideally this would be implemented in `mean` instead of here | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As touched on in the comments I ideally would not want this try...except
and think instead that the mean
application should be throwing the error. While mean
does raise for object
types, something like pd.Series([True, False, True]).mean()
is entirely valid and therefore this ends up throwing a TypeError in subsequent operation
I think mean
should raise on boolean data and have that pass through. Will open separate issue if you agree
pandas/tests/groupby/test_groupby.py
Outdated
@@ -1300,17 +1301,6 @@ def test_non_cython_api(self): | |||
g = df.groupby('A') | |||
gni = df.groupby('A', as_index=False) | |||
|
|||
# mad |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test was failing with this change because it allowed for mad
operations on object
types, returning NaN
rather than raising. I figured the explicit tests added elsewhere made more sense and that raising is preferable to NaN
There's also some code that tests any
further down within this test. Didn't check in detail but can probably be removed on account of changes done in #19722. Assuming I have updates I'll include in the next batch or alternately open up a separate issue
i think we r going to remove mad (there is an issue about this); so not sure we should add this |
I assume you are referencing #11787. FWIW it would be a pretty trivial change here to support 'median' as well. OK with deprecating, but I will say the one thing that we may want to consider is how users would roll this on their own. Operating on a series is for sure straightforward: abs(df['val'] - df['val'].mean()).mean() But attempting the same on a GroupBy object will Raise abs(df.groupby('key') - df.groupby('key').mean()).mean()
ValueError: Unable to coerce to Series, length must be 1: given 2 So the user would be responsible for some heavier lifting on that side, assuming there's no built-in way for the plain |
try with a group by and then transform |
Assuming you mean abs(df - df.groupby('key').transform('mean')).mean() It's possible but then requires the user to explicitly drop the grouped field(s) somewhere in the operation. A similar implementation (which I used here) doesn't add the grouped fields but is definitely more verbose than what would be required for the Series / DataFrame counterparts abs(df.groupby('key').shift(0) - df.groupby('key').transform('mean')).mean() |
not sure why u need .shift |
If you remove the shift you get ValueError: Unable to coerce to Series, length must be 1: given 2 The inner mean is the "centering" op and the outer is the resulting op. I think this comment sums up the possible combinations pretty well |
In xarray we support arithmetic with GroupBy objects so your example would actually work:
It would be interesting to explore porting this syntax to pandas. Xarray users find it pretty useful for doing these sorts of grouped normalizations (which are common in climate science). |
@shoyer that could be a viable option here, and certainly make the mad operation across Do you know where that is implemented in xarray? Dug through the source but nothing was immediately apparent to me |
We use some awkward machinery for defining binary ops in xarray (you'd need to figure out how to do this for pandas), but here's where the core groupby arithmetic logic is defined: I think you could do something pretty similar for pandas, though obviously the implementation would be pretty different (e.g., you could use |
@WillAyd ideally like to move the groupby cython routines to pandas/core/groupby/cython before we attempt this |
Closing due to potential deprecation of method cited in #11787 |
git diff upstream/master -u -- "*.py" | flake8 --diff