-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Aggregate transforms #1924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aggregate transforms #1924
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me so far! Nothing blocking, though there are probably some weird corner cases that could be debugged if we dig deep enough.
src/transforms/aggregate.js
Outdated
'If a string, *groups* is assumed to be a reference to a data array', | ||
'in the parent trace object.', | ||
'To aggregate by nested variables, use *.* to access them.', | ||
'For example, set `groups` to *marker.color* to aggregate', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a difference between the way backticks and asterisks render in the docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good catch - not quite sure how it renders, but we've tried to use backticks for attribute names and asterisks for string attribute values, looks like I mixed them up a bit here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good. I didn't notice which was which. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, this special attribute / value formatting isn't currently used anywhere. I originally though it could be on https://plot.ly/javascript/reference/, but no-one ever got to implementing it.
src/transforms/aggregate.js
Outdated
'A reference to the data array in the parent trace to aggregate.', | ||
'To aggregate by nested variables, use *.* to access them.', | ||
'For example, set `groups` to *marker.color* to aggregate', | ||
'about the marker color array.', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Subtle, but maybe "to aggregate over the marker color array?"
src/transforms/aggregate.js
Outdated
}, | ||
func: { | ||
valType: 'enumerated', | ||
values: ['count', 'sum', 'avg', 'min', 'max', 'first', 'last'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't imagine median would be very popular… mode…? variance…?
* as distinct from undefined which means this array isn't present in the input | ||
* missing arrays can still be aggregate outputs for *count* aggregations. | ||
*/ | ||
var arrayAttrArray = PlotSchema.findArrayAttributes(traceOut); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of subtle interest is the fact that arrayAttrs
does include transform data arrays themselves. I glossed over this groupby since it wasn't critically important for performance (iterated groupbys are definitely a corner case) and since the groups are already decided by the time this transform modifies its own arrayAttrs
so that the result is correct. Perhaps it should filter transform[i]
for i > transformIn._index
…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah interesting... I will have to look for cases where this matters, I can't quite tell offhand whether transforming the earlier transforms is merely unnecessary or can actually lead to bugs. Combine that with groupby happening first and the condition is a bit more complicated than just i > transformIn._index
anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's just a tiny bit of unnecessary work we can 🔪 at some point.
src/transforms/aggregate.js
Outdated
func: { | ||
valType: 'enumerated', | ||
values: ['count', 'sum', 'avg', 'min', 'max', 'first', 'last'], | ||
dflt: 'first', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, presumably any data array just returns the first entry as a way of dealing with the issue of needing to specify operations for all data arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, presumably any data array just returns the first entry as a way of dealing with the issue of needing to specify operations for all data arrays?
|
||
it('always executes groupby before aggregate', function() { | ||
// aggregate and groupby wouldn't commute, but groupby always happens first | ||
// because it has a `transform`, and aggregate has a `calcTransform` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this desirable? Should it be possible to group and then apply an aggregation to each group?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean the other way around - should it be possible to aggregate and then group? Yes of course, there are definitely use cases for it - say you want to do a 'count'
aggregation, then group by that count - one group for all entries with one sample, a separate group for entries with two samples, etc etc. But it seems like for the moment we have a deeper structural problem with that, so this test documents it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, correct. The other way around. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have a deeper structural problem
... to say the least. 😕
// groups can be any type, but until we implement binning they | ||
// will always compare as strings = so 1 === '1' === 1.0 !== '1.0' | ||
groups: [1, 2, '1', 1.0, 1], | ||
aggregations: [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a case where Lib.keyedContainer
would be usable, though since we don't modify it internally, it probably serves no purpose. But for the workspace it could be possible to import it from plotly lib. It basically just helps you use this arrangement as if it's a direct key-value store.
Perhaps @domoritz is interested in this as well. |
Thanks for pinging me @jackparmer. Will there be a way to provide custom aggregators like user defined aggregates in databases? |
Unlikely - for portability & security we do not allow any code in the plot JSON. But I definitely would not be averse to adding more functions. Can you give me examples of the kind of thing people do with this? All the custom aggregator docs, seem to talk about "concatenate" - that we seems generally useful enough that we should add it (would require an additional parameter for the join string), @rreusser suggested some others (median, mode, variance), I could throw in RMS... what else? I know we're not going to cover all edge cases this way but we should try to grab all the semi-common ones. I was also imagining adding some special ones for x/y/z coordinates to generate error bars automatically (min/mean/max or mean +/- std. dev). |
src/transforms/aggregate.js
Outdated
|
||
var arrayOut = new Array(groupings.length); | ||
for(var i = 0; i < groupings.length; i++) { | ||
arrayOut[i] = func(arrayIn, groupings[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more comment: is it necessary to split the arrays or would it be possible to use online algorithms to just make one pass without splitting into arrays? Like mean and variance, for example. Going through the list with online algorithms in mind:
count
: easysum
: easyavg
: see: Algorithms_for_calculating_variancemin
: easymax
: easyfirst
: easylast
: easy, maybe?variance
: see: Algorithms_for_calculating_variance, maybe requires a bit of per-grouping storage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - I think for the moment I'll leave it as is, all the existing aggregations would be fairly easy to replicate online (and yes, 'last'
is easy - perhaps even easier than 'first'
, you just always replace the output with the new value) but some of the others we've listed in the comments might be trickier (median and mode, in particular). I'll keep this in mind though as a potential performance gain for later.
To note from a private convo with @rreusser - the case where this is most important is when you have a large number of small groups - thousands of groups aggregating just a couple of items each - in which case creating all the little groupings
arrays entails significant overhead. If we see unreasonable drag in this case, switching to online algorithms would be the fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. Yeah. Easy. Obv.
src/transforms/aggregate.js
Outdated
].join(' ') | ||
}, | ||
aggregations: { | ||
_isLinkedToArray: 'style', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, the _isLinkedToArray
values are used in the python api to build the graph objects e.g. go.Annotation
. So, we should change this line to _isLinkedToArray: 'aggregation'
.
This won't do anything at the moment as transforms and anything inside them don't have corresponding python graph objects (I think) at the moment, but might as well stay consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh haha copy/paste error - thanks.
src/transforms/aggregate.js
Outdated
}, | ||
aggregations: { | ||
_isLinkedToArray: 'style', | ||
array: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not target
as in groupby
and filter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
array
-> target
in ffc4ee2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To recap:
In filter, target
= data by which filter is applied
In sort, target
= data by which data is sorted
In groupby, groups
= data by which trace is grouped (should this be target
?)
In aggregate, target
= data by which trace is aggregated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In groupby,
groups
= data by which trace is grouped (should this betarget
?)
aggregate has both groups
(in the main container) and target
(in each aggregation
) - I suppose we could make all of these be target
but that's starting to sound a bit confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I thought maybe I was missing one. That works for me.
src/transforms/aggregate.js
Outdated
// of a valid array attribute - or an unused array attribute with "count" | ||
if(array && (arrayAttrs[array] || (func === 'count' && arrayAttrs[array] === undefined))) { | ||
arrayAttrs[array] = 0; | ||
aggregationsOut.push(aggregationOut); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in general, aggregationsIn.length !== aggregationsOut.length
. Hmm, that might cause issues in the workspace (but I'm sure you thought about that 😏 ). Maybe we could instead add an aggragations[i].enabled
attribute to make more similar to other isLinkedToArray
items?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea. We're never going to be able to ensure equal in/out lengths, as we have to add entries for missing arrays - but we can at least ensure that all inputs contribute and the extras go at the end of the array so entries that do appear in the input have the same index in the output. added enabled
-> 6d2d32c
transforms: [{ | ||
type: 'aggregate', | ||
// groups can be any type, but until we implement binning they | ||
// will always compare as strings = so 1 === '1' === 1.0 !== '1.0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// missing array - the entry is ignored | ||
{array: '', func: 'avg'}, | ||
{array: 'x', func: 'sum'}, | ||
// non-numerics will not count toward numerator or denominator for avg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could add a nansum
func down the road.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to add a line about this behavior in the attribute description
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or another attribute (that could apply to other functions too) for how to handle bad values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expanded description along with adding a few more functions in b6ccc01
and (lint) style -> aggregation
median, mode, rms, stddev and some improved docs
{target: 'x', func: 'mode'}, | ||
{target: 'y', func: 'median'}, | ||
{target: 'marker.size', func: 'rms'}, | ||
{target: 'marker.line.width', func: 'stddev'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe:
{target: 'marker.line.width', func: 'stddev', population: true | false}
or
{target: 'marker.line.width', func: 'stddev', sample: true | false}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or @alexcjohnson suggests:
{
target: 'marker.line.width',
func: 'stddev',
variant: 'population' | 'sample'
}
Are we going to want these config keys to go in that container and not conflict with anything else? Feels just a little loose to lump in their config with the parent container as opposed to
{
target: 'marker.line.width',
func: 'stddev',
opts: {
variant: 'population' | 'sample'
}
}
But then that's just getting verbose for no reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attribute variants are often set under *mode
in other contexts. I'd vote for:
funcmode: 'population' | 'sample'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only downside of funcmode
might be that the enum values are coupled to type
. So if concat
has funcmode
too, then do we just set dynamic enum values based on the coerced type
? Does that affect how options and defaults are represented in the docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be fine. Yes, we'll have to describe it in detail in description
, and manage it dynamically in supplyDefaults
, but that seems to me better than making different names for each func
that gets varying mode
s.
💃 for me w/ or w/o |
// this is debatable: should a count of 1 return sample stddev of | ||
// 0 or undefined? | ||
if(!norm) return 0; | ||
return Math.sqrt((total2 - (total * total / cnt)) / norm); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After all the discussion, I felt compelled to verify the formula, though I really do trust it's correct. But anyway, I ran it against a simple online calculator and copied the function into a super quick test: https://gist.github.com/rreusser/ffe0a6cea78dd153c6eea1d54ac11ea1
It gets a hearty 👍 from me. 👏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, make that 💃
Adds an
aggregate
transform type, which is actually a closer analog of sqlgroupby
than thegroupby
transform is because it coalesces all values in each group to a single output, and leaves the results all in a single trace. Each array attribute needs an aggregation function; if not provided we fall back on"first"
ie the first value encountered in the group.Note that some data types / aggregation function combinations don't make much sense, but we allow them anyway: date sums (we add milliseconds since 1970 and then convert back to a date), category sums (we add category serial numbers), category averages (we average serial numbers then round). The tests include an example to show how this plays out.
I did NOT implement any sort of binning in this PR. That will be a separate PR so it can add binning to
groupby
as well.cc @rreusser @etpinard