Aggregate transforms #1924

alexcjohnson · 2017-08-02T02:53:45Z

Adds an aggregate transform type, which is actually a closer analog of sql groupby than the groupby transform is because it coalesces all values in each group to a single output, and leaves the results all in a single trace. Each array attribute needs an aggregation function; if not provided we fall back on "first" ie the first value encountered in the group.

Note that some data types / aggregation function combinations don't make much sense, but we allow them anyway: date sums (we add milliseconds since 1970 and then convert back to a date), category sums (we add category serial numbers), category averages (we average serial numbers then round). The tests include an example to show how this plays out.

I did NOT implement any sort of binning in this PR. That will be a separate PR so it can add binning to groupby as well.

cc @rreusser @etpinard

rreusser

Looks good to me so far! Nothing blocking, though there are probably some weird corner cases that could be debugged if we dig deep enough.

rreusser · 2017-08-02T03:19:11Z

src/transforms/aggregate.js

+            'If a string, *groups* is assumed to be a reference to a data array',
+            'in the parent trace object.',
+            'To aggregate by nested variables, use *.* to access them.',
+            'For example, set `groups` to *marker.color* to aggregate',


Is there a difference between the way backticks and asterisks render in the docs?

Ah good catch - not quite sure how it renders, but we've tried to use backticks for attribute names and asterisks for string attribute values, looks like I mixed them up a bit here.

All good. I didn't notice which was which. 😄

For the record, this special attribute / value formatting isn't currently used anywhere. I originally though it could be on https://plot.ly/javascript/reference/, but no-one ever got to implementing it.

rreusser · 2017-08-02T03:20:12Z

src/transforms/aggregate.js

+                'A reference to the data array in the parent trace to aggregate.',
+                'To aggregate by nested variables, use *.* to access them.',
+                'For example, set `groups` to *marker.color* to aggregate',
+                'about the marker color array.',


Subtle, but maybe "to aggregate over the marker color array?"

rreusser · 2017-08-02T03:20:50Z

src/transforms/aggregate.js

+        },
+        func: {
+            valType: 'enumerated',
+            values: ['count', 'sum', 'avg', 'min', 'max', 'first', 'last'],


Can't imagine median would be very popular… mode…? variance…?

rreusser · 2017-08-02T03:26:06Z

src/transforms/aggregate.js

+     * as distinct from undefined which means this array isn't present in the input
+     * missing arrays can still be aggregate outputs for *count* aggregations.
+     */
+    var arrayAttrArray = PlotSchema.findArrayAttributes(traceOut);


Of subtle interest is the fact that arrayAttrs does include transform data arrays themselves. I glossed over this groupby since it wasn't critically important for performance (iterated groupbys are definitely a corner case) and since the groups are already decided by the time this transform modifies its own arrayAttrs so that the result is correct. Perhaps it should filter transform[i] for i > transformIn._index…

Ah interesting... I will have to look for cases where this matters, I can't quite tell offhand whether transforming the earlier transforms is merely unnecessary or can actually lead to bugs. Combine that with groupby happening first and the condition is a bit more complicated than just i > transformIn._index anyway.

I think it's just a tiny bit of unnecessary work we can 🔪 at some point.

rreusser · 2017-08-02T03:27:25Z

src/transforms/aggregate.js

+        func: {
+            valType: 'enumerated',
+            values: ['count', 'sum', 'avg', 'min', 'max', 'first', 'last'],
+            dflt: 'first',


By default, presumably any data array just returns the first entry as a way of dealing with the issue of needing to specify operations for all data arrays?

By default, presumably any data array just returns the first entry as a way of dealing with the issue of needing to specify operations for all data arrays?

exactly

rreusser · 2017-08-02T03:29:53Z

test/jasmine/tests/transform_multi_test.js

+
+    it('always executes groupby before aggregate', function() {
+        // aggregate and groupby wouldn't commute, but groupby always happens first
+        // because it has a `transform`, and aggregate has a `calcTransform`


Is this desirable? Should it be possible to group and then apply an aggregation to each group?

You mean the other way around - should it be possible to aggregate and then group? Yes of course, there are definitely use cases for it - say you want to do a 'count' aggregation, then group by that count - one group for all entries with one sample, a separate group for entries with two samples, etc etc. But it seems like for the moment we have a deeper structural problem with that, so this test documents it.

Ah, correct. The other way around. 👍

we have a deeper structural problem

... to say the least. 😕

rreusser · 2017-08-02T03:31:58Z

test/jasmine/tests/transform_aggregate_test.js

+                // groups can be any type, but until we implement binning they
+                // will always compare as strings = so 1 === '1' === 1.0 !== '1.0'
+                groups: [1, 2, '1', 1.0, 1],
+                aggregations: [


This would be a case where Lib.keyedContainer would be usable, though since we don't modify it internally, it probably serves no purpose. But for the workspace it could be possible to import it from plotly lib. It basically just helps you use this arrangement as if it's a direct key-value store.

jackparmer · 2017-08-02T06:14:36Z

Perhaps @domoritz is interested in this as well.

domoritz · 2017-08-02T06:35:30Z

Thanks for pinging me @jackparmer. Will there be a way to provide custom aggregators like user defined aggregates in databases?

alexcjohnson · 2017-08-02T12:00:17Z

@domoritz

Will there be a way to provide custom aggregators like user defined aggregates in databases?

Unlikely - for portability & security we do not allow any code in the plot JSON. But I definitely would not be averse to adding more functions. Can you give me examples of the kind of thing people do with this? All the custom aggregator docs, seem to talk about "concatenate" - that we seems generally useful enough that we should add it (would require an additional parameter for the join string), @rreusser suggested some others (median, mode, variance), I could throw in RMS... what else? I know we're not going to cover all edge cases this way but we should try to grab all the semi-common ones.

I was also imagining adding some special ones for x/y/z coordinates to generate error bars automatically (min/mean/max or mean +/- std. dev).

rreusser · 2017-08-02T13:50:05Z

src/transforms/aggregate.js

+
+    var arrayOut = new Array(groupings.length);
+    for(var i = 0; i < groupings.length; i++) {
+        arrayOut[i] = func(arrayIn, groupings[i]);


One more comment: is it necessary to split the arrays or would it be possible to use online algorithms to just make one pass without splitting into arrays? Like mean and variance, for example. Going through the list with online algorithms in mind:

count: easy

sum: easy

avg: see: Algorithms_for_calculating_variance

min: easy

max: easy

first: easy

last: easy, maybe?

variance: see: Algorithms_for_calculating_variance, maybe requires a bit of per-grouping storage

Good point - I think for the moment I'll leave it as is, all the existing aggregations would be fairly easy to replicate online (and yes, 'last' is easy - perhaps even easier than 'first', you just always replace the output with the new value) but some of the others we've listed in the comments might be trickier (median and mode, in particular). I'll keep this in mind though as a potential performance gain for later.

To note from a private convo with @rreusser - the case where this is most important is when you have a large number of small groups - thousands of groups aggregating just a couple of items each - in which case creating all the little groupings arrays entails significant overhead. If we see unreasonable drag in this case, switching to online algorithms would be the fix.

Oh. Yeah. Easy. Obv.

etpinard · 2017-08-02T15:09:41Z

src/transforms/aggregate.js

+        ].join(' ')
+    },
+    aggregations: {
+        _isLinkedToArray: 'style',


For the record, the _isLinkedToArray values are used in the python api to build the graph objects e.g. go.Annotation. So, we should change this line to _isLinkedToArray: 'aggregation'.

This won't do anything at the moment as transforms and anything inside them don't have corresponding python graph objects (I think) at the moment, but might as well stay consistent.

oh haha copy/paste error - thanks.

etpinard · 2017-08-02T15:11:06Z

src/transforms/aggregate.js

+    },
+    aggregations: {
+        _isLinkedToArray: 'style',
+        array: {


why not target as in groupby and filter?

array -> target in ffc4ee2

To recap:

In filter, target = data by which filter is applied
In sort, target = data by which data is sorted
In groupby, groups = data by which trace is grouped (should this be target?)
In aggregate, target = data by which trace is aggregated

In groupby, groups = data by which trace is grouped (should this be target?)

aggregate has both groups (in the main container) and target (in each aggregation) - I suppose we could make all of these be target but that's starting to sound a bit confusing.

Ah, I thought maybe I was missing one. That works for me.

etpinard · 2017-08-02T15:20:11Z

src/transforms/aggregate.js

+            // of a valid array attribute - or an unused array attribute with "count"
+            if(array && (arrayAttrs[array] || (func === 'count' && arrayAttrs[array] === undefined))) {
+                arrayAttrs[array] = 0;
+                aggregationsOut.push(aggregationOut);


So in general, aggregationsIn.length !== aggregationsOut.length. Hmm, that might cause issues in the workspace (but I'm sure you thought about that 😏 ). Maybe we could instead add an aggragations[i].enabled attribute to make more similar to other isLinkedToArray items?

good idea. We're never going to be able to ensure equal in/out lengths, as we have to add entries for missing arrays - but we can at least ensure that all inputs contribute and the extras go at the end of the array so entries that do appear in the input have the same index in the output. added enabled -> 6d2d32c

etpinard · 2017-08-02T15:51:38Z

test/jasmine/tests/transform_aggregate_test.js

+            transforms: [{
+                type: 'aggregate',
+                // groups can be any type, but until we implement binning they
+                // will always compare as strings = so 1 === '1' === 1.0 !== '1.0'


learn something new everyday.

etpinard · 2017-08-02T15:54:44Z

test/jasmine/tests/transform_aggregate_test.js

+                    // missing array - the entry is ignored
+                    {array: '', func: 'avg'},
+                    {array: 'x', func: 'sum'},
+                    // non-numerics will not count toward numerator or denominator for avg


Maybe we could add a nansum func down the road.

It would be nice to add a line about this behavior in the attribute description.

or another attribute (that could apply to other functions too) for how to handle bad values.

expanded description along with adding a few more functions in b6ccc01

and (lint) style -> aggregation

median, mode, rms, stddev and some improved docs

rreusser · 2017-08-02T23:37:41Z

test/jasmine/tests/transform_aggregate_test.js

+                    {target: 'x', func: 'mode'},
+                    {target: 'y', func: 'median'},
+                    {target: 'marker.size', func: 'rms'},
+                    {target: 'marker.line.width', func: 'stddev'}


Maybe:

{target: 'marker.line.width', func: 'stddev', population: true | false}

or

{target: 'marker.line.width', func: 'stddev', sample: true | false}

Or @alexcjohnson suggests:

{ target: 'marker.line.width', func: 'stddev', variant: 'population' | 'sample' }

Are we going to want these config keys to go in that container and not conflict with anything else? Feels just a little loose to lump in their config with the parent container as opposed to

{ target: 'marker.line.width', func: 'stddev', opts: { variant: 'population' | 'sample' } }

But then that's just getting verbose for no reason.

Attribute variants are often set under *mode in other contexts. I'd vote for:

funcmode: 'population' | 'sample'

The only downside of funcmode might be that the enum values are coupled to type. So if concat has funcmode too, then do we just set dynamic enum values based on the coerced type? Does that affect how options and defaults are represented in the docs?

This should be fine. Yes, we'll have to describe it in detail in description, and manage it dynamically in supplyDefaults, but that seems to me better than making different names for each func that gets varying modes.

etpinard · 2017-08-03T14:34:46Z

💃 for me

w/ or w/o 'population' vs 'sample' of #1924 (comment) which we can always add in another PR.

rreusser · 2017-08-03T22:23:48Z

src/transforms/aggregate.js

+                // this is debatable: should a count of 1 return sample stddev of
+                // 0 or undefined?
+                if(!norm) return 0;
+                return Math.sqrt((total2 - (total * total / cnt)) / norm);


After all the discussion, I felt compelled to verify the formula, though I really do trust it's correct. But anyway, I ran it against a simple online calculator and copied the function into a super quick test: https://gist.github.com/rreusser/ffe0a6cea78dd153c6eea1d54ac11ea1

It gets a hearty 👍 from me. 👏

Actually, make that 💃

alexcjohnson added 3 commits August 1, 2017 17:05

use beforeEach(createGraphDiv) pattern in transform_multi

bcc0ca1

axes.getDataConversions as a generalization of axes.getDataToCoordFunc

95aa244

aggregate transform

9e83747

alexcjohnson added status: reviewable feature something new labels Aug 2, 2017

rreusser reviewed Aug 2, 2017

View reviewed changes

lint

317938f

rreusser reviewed Aug 2, 2017

View reviewed changes

etpinard reviewed Aug 2, 2017

View reviewed changes

alexcjohnson added 3 commits August 2, 2017 13:35

aggregations.array -> target

ffc4ee2

and (lint) style -> aggregation

aggregations[i].enabled

6d2d32c

more aggregate functions

b6ccc01

median, mode, rms, stddev and some improved docs

rreusser reviewed Aug 2, 2017

View reviewed changes

sample vs population stddev

572f282

rreusser reviewed Aug 3, 2017

View reviewed changes

alexcjohnson merged commit 6f0bded into master Aug 3, 2017

alexcjohnson deleted the aggregateby branch August 3, 2017 22:30

etpinard added this to the v1.30.0 milestone Aug 4, 2017

Aggregate transforms #1924

Aggregate transforms #1924

Conversation

alexcjohnson commented Aug 2, 2017

rreusser left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rreusser Aug 2, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rreusser Aug 2, 2017 • edited Loading

Choose a reason for hiding this comment

jackparmer commented Aug 2, 2017

domoritz commented Aug 2, 2017

alexcjohnson commented Aug 2, 2017

rreusser Aug 2, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

etpinard Aug 2, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rreusser Aug 2, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rreusser Aug 3, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

etpinard commented Aug 3, 2017

rreusser Aug 3, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rreusser Aug 2, 2017 •

edited

Loading

rreusser Aug 2, 2017 •

edited

Loading

rreusser Aug 2, 2017 •

edited

Loading

etpinard Aug 2, 2017 •

edited

Loading

rreusser Aug 2, 2017 •

edited

Loading

rreusser Aug 3, 2017 •

edited

Loading

rreusser Aug 3, 2017 •

edited

Loading