Cythonized GroupBy mad #20024

WillAyd · 2018-03-07T00:20:50Z

progress towards PERF: Discrepancy in groupby methods #19165
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

       before           after         ratio
     [01b91c26]       [9d7f0ac9]
+         678±3μs         807±10μs     1.19  groupby.GroupByMethods.time_method('object', 'value_counts')
+       116±0.8μs          135±1μs     1.16  groupby.GroupByMethods.time_method('object', 'shift')
+        777±10μs         888±10μs     1.14  groupby.GroupByMethods.time_method('object', 'unique')
+      71.6±0.9μs       81.4±0.4μs     1.14  groupby.GroupByMethods.time_method('object', 'size')
-           730ms       2.87±0.4ms     0.00  groupby.GroupByMethods.time_method('int', 'mad')
-           1.23s       3.05±0.5ms     0.00  groupby.GroupByMethods.time_method('float', 'mad')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

WillAyd · 2018-03-07T00:23:03Z

pandas/tests/groupby/test_whitelist.py

@@ -12,7 +12,7 @@

 AGG_FUNCTIONS = ['sum', 'prod', 'min', 'max', 'median', 'mean', 'skew',
                 'mad', 'std', 'var', 'sem']
-AGG_FUNCTIONS_WITH_SKIPNA = ['skew', 'mad']
+AGG_FUNCTIONS_WITH_SKIPNA = ['skew']


Speaking to why I did this - since mad uses the mean behind the scenes I figured it made sense to rely on mean to do as much of the heavy lifting as possible. Unfortunately mean doesn't currently handle the skipna parameter but I think it's worth addressing within that function and allowing that to pass through to mad rather than implementing specifically within mad.

As always glad to open an issue for that if you agree on approach

WillAyd · 2018-03-07T00:25:11Z

pandas/core/groupby.py

+
+        # Wrap in a try..except to catch a TypeError with bool data
+        # Ideally this would be implemented in `mean` instead of here
+        try:


As touched on in the comments I ideally would not want this try...except and think instead that the mean application should be throwing the error. While mean does raise for object types, something like pd.Series([True, False, True]).mean() is entirely valid and therefore this ends up throwing a TypeError in subsequent operation

I think mean should raise on boolean data and have that pass through. Will open separate issue if you agree

WillAyd · 2018-03-07T00:28:02Z

pandas/tests/groupby/test_groupby.py

@@ -1300,17 +1301,6 @@ def test_non_cython_api(self):
        g = df.groupby('A')
        gni = df.groupby('A', as_index=False)

-        # mad


This test was failing with this change because it allowed for mad operations on object types, returning NaN rather than raising. I figured the explicit tests added elsewhere made more sense and that raising is preferable to NaN

There's also some code that tests any further down within this test. Didn't check in detail but can probably be removed on account of changes done in #19722. Assuming I have updates I'll include in the next batch or alternately open up a separate issue

jreback · 2018-03-07T00:29:27Z

i think we r going to remove mad (there is an issue about this); so not sure we should add this

WillAyd · 2018-03-07T00:43:33Z

I assume you are referencing #11787. FWIW it would be a pretty trivial change here to support 'median' as well.

OK with deprecating, but I will say the one thing that we may want to consider is how users would roll this on their own. Operating on a series is for sure straightforward:

abs(df['val'] - df['val'].mean()).mean()

But attempting the same on a GroupBy object will Raise

abs(df.groupby('key') - df.groupby('key').mean()).mean()
ValueError: Unable to coerce to Series, length must be 1: given 2

So the user would be responsible for some heavier lifting on that side, assuming there's no built-in way for the plain GroupBy object to handle subtraction of its aggregated result.

jreback · 2018-03-07T00:45:27Z

try with a group by and then transform
this is a typical pattern

WillAyd · 2018-03-07T00:52:45Z

Assuming you mean

abs(df - df.groupby('key').transform('mean')).mean()

It's possible but then requires the user to explicitly drop the grouped field(s) somewhere in the operation.

A similar implementation (which I used here) doesn't add the grouped fields but is definitely more verbose than what would be required for the Series / DataFrame counterparts

abs(df.groupby('key').shift(0) - df.groupby('key').transform('mean')).mean()

jreback · 2018-03-07T00:56:18Z

not sure why u need .shift
also you don’t need the final mean
not sure what that is for

WillAyd · 2018-03-07T01:01:25Z

If you remove the shift you get

ValueError: Unable to coerce to Series, length must be 1: given 2

The inner mean is the "centering" op and the outer is the resulting op. I think this comment sums up the possible combinations pretty well

shoyer · 2018-03-07T02:56:10Z

In xarray we support arithmetic with GroupBy objects so your example would actually work:

abs(df.groupby('key') - df.groupby('key').mean()).mean()

It would be interesting to explore porting this syntax to pandas. Xarray users find it pretty useful for doing these sorts of grouped normalizations (which are common in climate science).

WillAyd · 2018-03-08T19:17:27Z

@shoyer that could be a viable option here, and certainly make the mad operation across DataFrame / Series / GroupBy consistent while opening up some other possibilities (ex: easy demeaning).

Do you know where that is implemented in xarray? Dug through the source but nothing was immediately apparent to me

shoyer · 2018-03-08T19:20:10Z

We use some awkward machinery for defining binary ops in xarray (you'd need to figure out how to do this for pandas), but here's where the core groupby arithmetic logic is defined:
https://github.com/pydata/xarray/blob/870e4eaf1895cfeffdc27dab61ad739e67133777/xarray/core/groupby.py#L301-L332

I think you could do something pretty similar for pandas, though obviously the implementation would be pretty different (e.g., you could use .loc instead of .sel()).

jreback · 2018-07-08T15:53:36Z

@WillAyd ideally like to move the groupby cython routines to pandas/core/groupby/cython before we attempt this

WillAyd · 2018-11-13T06:28:00Z

Closing due to potential deprecation of method cited in #11787

WillAyd added 6 commits March 6, 2018 13:04

Added GroupBy mad tests

31f1799

Merge remote-tracking branch 'upstream/master' into grp-mad-perf

962f324

Implemented Cythonized GroupBy mad; fixed tests

0c10369

Fixed support issue for series; added tests

192253f

Code refactor / cleanup

9d7f0ac

Updated whatsnew

57152e6

WillAyd commented Mar 7, 2018

View reviewed changes

gfyoung added Groupby Performance Memory or execution speed performance labels Mar 8, 2018

WillAyd mentioned this pull request Mar 8, 2018

ENH: Add Support for GroupBy Numeric Operations #20060

Open

WillAyd added 3 commits August 6, 2018 20:52

Merge remote-tracking branch 'upstream/master' into grp-mad-perf

f1a3860

Test fixup

5307ac3

LINT fixup

5cab1eb

jbrockmendel mentioned this pull request Aug 11, 2018

Cythonized GroupBy Quantile #20405

Merged

4 tasks

WillAyd closed this Nov 13, 2018

WillAyd deleted the grp-mad-perf branch January 16, 2020 00:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cythonized GroupBy mad #20024

Cythonized GroupBy mad #20024

WillAyd commented Mar 7, 2018 •

edited by gfyoung

Loading

WillAyd Mar 7, 2018

WillAyd Mar 7, 2018 •

edited

Loading

WillAyd Mar 7, 2018 •

edited

Loading

jreback commented Mar 7, 2018

WillAyd commented Mar 7, 2018

jreback commented Mar 7, 2018

WillAyd commented Mar 7, 2018

jreback commented Mar 7, 2018

WillAyd commented Mar 7, 2018

shoyer commented Mar 7, 2018 •

edited

Loading

WillAyd commented Mar 8, 2018

shoyer commented Mar 8, 2018 •

edited

Loading

jreback commented Jul 8, 2018

WillAyd commented Nov 13, 2018

Cythonized GroupBy mad #20024

Cythonized GroupBy mad #20024

Conversation

WillAyd commented Mar 7, 2018 • edited by gfyoung Loading

WillAyd Mar 7, 2018

Choose a reason for hiding this comment

WillAyd Mar 7, 2018 • edited Loading

Choose a reason for hiding this comment

WillAyd Mar 7, 2018 • edited Loading

Choose a reason for hiding this comment

jreback commented Mar 7, 2018

WillAyd commented Mar 7, 2018

jreback commented Mar 7, 2018

WillAyd commented Mar 7, 2018

jreback commented Mar 7, 2018

WillAyd commented Mar 7, 2018

shoyer commented Mar 7, 2018 • edited Loading

WillAyd commented Mar 8, 2018

shoyer commented Mar 8, 2018 • edited Loading

jreback commented Jul 8, 2018

WillAyd commented Nov 13, 2018

WillAyd commented Mar 7, 2018 •

edited by gfyoung

Loading

WillAyd Mar 7, 2018 •

edited

Loading

WillAyd Mar 7, 2018 •

edited

Loading

shoyer commented Mar 7, 2018 •

edited

Loading

shoyer commented Mar 8, 2018 •

edited

Loading