implement partial aggregates (LArray.regroup and Axis.regroup) #361

gdementen · 2017-08-23T11:12:18Z

Implement an easier way to aggregate only part of an axis and leave other labels intact:

>>> a = ndtest(10)
>>> a.sum('a0;a1..a3 >> a13;a4;a5;a6;a7;a8;a9')
a  a0  a13  a4  a5  a6  a7  a8  a9
    0    6   4   5   6   7   8   9

gdementen · 2017-10-27T07:35:07Z

Technically, that should not be too hard (*), but I am unsure about the syntax:

>>> a.partial_agg(sum, 'a1..a3 >> a13')
>>> a.partial_sum('a1..a3 >> a13')
a  a0  a13  a4  a5  a6  a7  a8  a9
    0    6   4   5   6   7   8   9
>>> a.partial_mean('a1..a3 >> a13')
a   a0  a13   a4   a5   a6   a7   a8   a9
   0.0  2.0  4.0  5.0  6.0  7.0  8.0  9.0

(*) ~~either~~ create the group explicitly like above, ~~or split the array using LArray.split()~~

gdementen · 2017-10-27T07:45:45Z

Note that we must also support arbitrary (non-contiguous) groups and (maybe) overlapping groups, which will make an implementation via .split()/.chunks() mostly impossible:

>>> a.partial_sum('a1..a3 >> a13;a6..a8 >> a68')
a  a0  a13  a4  a5  a68  a9
    0    6   4   5   21   9
>>> a.partial_sum('a1,a3,a4 >> a134;a6,a8 >> a68')
a  a0  a134  a2  a5  a68  a7  a9
    0     8   2   5   14   7   9

gdementen · 2017-11-01T16:56:55Z

Now that I think of it, it might be better to implement this as a method on Axis, so that we do not have to define extra aggregate methods and it works out of the box for any aggregate. The difficulty in that case is to find a good name for the method:

>>> arr.a.partial('a1,a3,a4 >> a134;a6,a8 >> a68')
(a['a0'],
 a['a1', 'a3', 'a4'] >> 'a134',
 a['a2'],
 a['a5'],
 a['a6', 'a8'] >> 'a68',
 a['a7'],
 a['a9'])
>>> arr.sum(arr.a.partial('a1,a3,a4 >> a134;a6,a8 >> a68'))
a  a0  a134  a2  a5  a68  a7  a9
    0     8   2   5   14   7   9

This seems technically interesting, but not very readable/obvious what it means.

gdementen · 2017-11-10T14:19:54Z

I don't think partial is clear enough. Is partial_grouping understandable enough?

>>> arr = ndtest(10)
>>> a = arr.a
>>> a.partial_grouping('a1,a3,a4 >> a134;a6,a8 >> a68')
(a['a0'],
 a['a1', 'a3', 'a4'] >> 'a134',
 a['a2'],
 a['a5'],
 a['a6', 'a8'] >> 'a68',
 a['a7'],
 a['a9'])
>>> arr.sum(a.partial_grouping('a1,a3,a4 >> a134;a6,a8 >> a68'))
a  a0  a134  a2  a5  a68  a7  a9
    0     8   2   5   14   7   9

gdementen · 2018-02-02T13:43:58Z

maybe Axis.regroup() ?

gdementen · 2018-03-27T07:28:48Z

When groupby is done, we will be able to do this via set_labels + groupby. That would be an improvement compared to the current situation, but maybe not good enough as it is still quite verbose and inefficient.

>>> arr = ndtest(10)
>>> arr.set_labels('a', {'a1': 'a134', 'a3': 'a134', 'a4': 'a134', 'a6': 'a68', 'a8': 'a68'}).groupby('a').sum()
a  a0  a134  a2  a5  a68  a7  a9
    0     8   2   5   14   7   9

gdementen · 2018-05-30T13:37:56Z

Other ideas:

support passing groups to set_labels (the last example requires allow omitting axis (guess_axis) for set_labels #634)

>>> arr = ndtest(10)
>>> # I like this, because it simply generalize what we already have. We might want to implement this regardless of this "partial grouping" feature
>>> arr.set_labels('a', {X.a['a1,a3,a4']: 'a134', X.a['a6,a8']: 'a68'})).groupby('a').sum()
a  a0  a134  a2  a5  a68  a7  a9
    0     8   2   5   14   7   9
>>> # or even reuse an existing group label (this might be going too far?)
>>> arr.set_labels('a', (X.a['a1,a3,a4'] >> 'a134', X.a['a6,a8'] >> 'a68')).groupby('a').sum()
a  a0  a134  a2  a5  a68  a7  a9
    0     8   2   5   14   7   9
>>> # ... but this would be practical
>>> arr.set_labels('a1,a3,a4 >> a134;a6,a8 >> a68').groupby('a').sum()
a  a0  a134  a2  a5  a68  a7  a9
    0     8   2   5   14   7   9

use a keyword argument to aggregate functions (unsure about the name of the kwarg though)

>>> arr.sum('a1,a3,a4 >> a134;a6,a8 >> a68', partial_agg=True) # or "partial" or "keep_other" or ...
a  a0  a134  a2  a5  a68  a7  a9
    0     8   2   5   14   7   9

This gets awkward when we want to combine partial and non partial aggregates.

use a "modifier attribute" to aggregate functions

>>> arr.partial.sum('a1,a3,a4 >> a134;a6,a8 >> a68')
a  a0  a134  a2  a5  a68  a7  a9
    0     8   2   5   14   7   9

use a "regroup" method on LArray (probably in combination with Axis.regroup)

>>> arr.regroup('a1,a3,a4 >> a134;a6,a8 >> a68').sum()
a  a0  a134  a2  a5  a68  a7  a9
    0     8   2   5   14   7   9
>>> groups = arr.a.regroup('a1,a3,a4 >> a134;a6,a8 >> a68')
>>> arr.sum(groups)
a  a0  a134  a2  a5  a68  a7  a9
    0     8   2   5   14   7   9

This is my currently preferred option, but (the way I see it) would benefit from a Grid class LArray.regroup() would return such an object. This Grid thing is more or less implemented in my local branch to implement #635.

gdementen · 2019-01-09T10:14:04Z

given that #635 and the Grid class are slow in coming, we might want to implement Axis.regroup already, which would be very easy to do and would already help our users quite a bit.

gdementen · 2019-01-30T09:39:46Z

Here is some hacky code I did for BM. The goal was to offer an API as close as possible to the future LArray.regroup without depending on the groupby feature:

class RegrouperMethod(object):
    def __init__(self, array, name, groups):
        self.array = array
        self.name = name
        if not isinstance(groups, tuple):
            groups = (groups,)
        groups = tuple(array._prepare_aggregate(name, groups))
        assert len(groups) == 1, "regroup only supports groups on one axis so far"
        if not isinstance(groups[0], tuple):
            groups = (groups,)
        new_groups = []
        for axis_groups in groups:

            axis = axis_groups[0].axis
            new_group = []
            for l in axis:
                lfound = False
                for g in axis_groups:
                    first_elem = g[0] if isinstance(g.key, (tuple, list, np.ndarray, slice)) else g
                    if l in g:
                        lfound = True
                        if l == first_elem:
                            new_group.append(g)
                if not lfound:
                    new_group.append(l)
            new_groups.append(tuple(new_group))
        self.groups = tuple(new_groups)

    def __call__(self, *args, **kwargs):
        args = self.groups + args
        return getattr(self.array, self.name)(*args, **kwargs)

class Regrouper(object):
    def __init__(self, array, groups):
        self.array = array
        self.groups = groups

    def __getattr__(self, attr):
        return RegrouperMethod(self.array, attr, self.groups)

def regroup(array, groups):
    return Regrouper(array, groups)

Usage is like this:

>>> arr = ndtest((3, 4))
>>> arr
a\b  b0  b1  b2  b3
 a0   0   1   2   3
 a1   4   5   6   7
 a2   8   9  10  11
>>> regroup(arr, 'b1,b3 >> b13').sum()
a\b  b0  b13  b2
 a0   0    4   2
 a1   4   12   6
 a2   8   20  10

alixdamman · 2019-01-30T09:52:57Z

If we implement the groupby feature one day, I wonder if the existence of regroup will not be confusing.
A more general questions is: do we need to take the risk to make the LArray's API incomprehensible but including each specific demand?
Is regroup will be interesting for other users?

gdementen · 2019-01-30T11:05:26Z

If we implement the groupby feature one day,

It is not an if, it is a when. It is just a matter of me being back on larray code after dc2019 is done.

I wonder if the existence of regroup will not be confusing.

It is always a tradeoff but I think that in this case benefits outweight costs

A more general questions is: do we need to take the risk to make the LArray's API incomprehensible but including each specific demand?

You know the answer to this question: it is obviously no.

Is regroup will be interesting for other users?

Yes, it is a very common need, at least in our institution.

gdementen · 2022-10-25T14:45:49Z

I stumbled on the need with a slight variation: amg had to regroup "parts" of some combined axes. I did two different versions to solve her problem. A more limited one but more efficient and a more general but less efficient. The limited one handles only prefixes (aka the first part of the combined axis). The second one works for any "part" of the combined axis but splits the axis, does the aggregate then recombine the axes.

def sum_prefixes(array, axis, prefixes, combined_prefix, sep='_'):
    axis = array.axes[axis]
    all_prefixes, suffixes = axis.split(sep=sep)    
    starts_with_prefixes = axis.startingwith(prefixes[0])
    for prefix in prefixes[1:]:
        starts_with_prefixes = starts_with_prefixes.union(axis.startingwith(prefix))
    aggregated_groups = tuple(starts_with_prefixes.endingwith(s) >> f'{combined_prefix}{sep}{s}' for s in suffixes)
    other_groups = tuple(axis[:].difference(starts_with_prefixes))
    return array.sum(aggregated_groups + other_groups)

def split_axes_sum(array, combined_axis, group, sep='_'):
    orig_combined_axis = array.axes[combined_axis]
    split_axes = orig_combined_axis.split(sep=sep)
    split_array = array.split_axes(combined_axis, sep=sep)
    split_axis = split_array.axes[group.axis]
    nans = isnan(split_array)
    added_labels = nans[nans].axes[combined_axis]
    agg_array = split_array.sum((group,) + tuple(split_axis[:].difference(group)))
    combined_array = agg_array.combine_axes(split_axes)
    new_combined_axis = combined_array.axes[combined_axis]
    return combined_array.drop(added_labels.intersection(new_combined_axis))

>>> arr = ndtest('a_b=BR_A,BR_B,WA_B,WA_C,FL_C,FL_D,FR_A,DE_B')
>>> sum_prefixes(arr, 'a_b', ['BR', 'WA', 'FL'], 'BE')
a_b  BE_A  BE_B  BE_C  BE_D  FR_A  DE_B
      0.0   3.0   7.0   5.0   6.0   7.0
>>> split_axes_sum(arr, 'a_b', X.a['BR, WA, FL'] >> 'BE')
a_b  BE_A  BE_B  BE_C  BE_D  FR_A  DE_B
      0.0   3.0   7.0   5.0   6.0   7.0
>>> split_axes_sum(arr, 'a_b', X.b['B, C'] >> 'BC')
a_b  BR_BC  BR_A  WA_BC  FL_BC  FL_D  FR_BC  FR_A  DE_BC
       1.0   0.0    5.0    4.0   5.0    0.0   6.0    7.0

This could, one day be solved via some kind of pattern syntax, but it's hard to imagine something powerful enough and still readable:

>>> arr.sum('a_b[BR_{prod:*}, WA_{prod:*}, FL_{prod:*}] >> BE_{prod}')
a_b  BE_A  BE_B  BE_C  BE_D  FR_A  DE_B
      0.0   3.0   7.0   5.0   6.0   7.0
>>> arr.sum('a_b[(BR|WA|FL)_{prod:*}] >> BE_{prod}')
a_b  BE_A  BE_B  BE_C  BE_D  FR_A  DE_B
      0.0   3.0   7.0   5.0   6.0   7.0
>>> arr.sum('a_b[(BR|WA|FL)_*] >> BE_*')
a_b  BE_A  BE_B  BE_C  BE_D  FR_A  DE_B
      0.0   3.0   7.0   5.0   6.0   7.0

gdementen added the enhancement label Aug 23, 2017

gdementen added the priority: high label Oct 20, 2017

alixdamman added this to the 0.29 milestone Dec 18, 2017

alixdamman modified the milestones: 0.29, 0.30 Mar 16, 2018

alixdamman modified the milestones: 0.30, 0.31 Jul 18, 2018

gdementen changed the title ~~implement partial aggregates~~ implement partial aggregates (LArray.regroup and Axis.regroup) Jan 9, 2019

gdementen removed this from the 0.31 milestone Aug 1, 2019

alixdamman added this to the nice_to_have milestone Oct 10, 2019

gdementen removed this from the nice_to_have milestone Nov 14, 2019

gdementen mentioned this issue Dec 23, 2020

generalize set_labels to support group keys #906

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

implement partial aggregates (LArray.regroup and Axis.regroup) #361

implement partial aggregates (LArray.regroup and Axis.regroup) #361

gdementen commented Aug 23, 2017 •

edited

Loading

gdementen commented Oct 27, 2017 •

edited

Loading

Uh oh!

gdementen commented Oct 27, 2017 •

edited

Loading

Uh oh!

gdementen commented Nov 1, 2017 •

edited

Loading

Uh oh!

gdementen commented Nov 10, 2017

Uh oh!

gdementen commented Feb 2, 2018

Uh oh!

gdementen commented Mar 27, 2018

Uh oh!

gdementen commented May 30, 2018 •

edited

Loading

Uh oh!

gdementen commented Jan 9, 2019

Uh oh!

gdementen commented Jan 30, 2019

Uh oh!

alixdamman commented Jan 30, 2019

Uh oh!

gdementen commented Jan 30, 2019 •

edited

Loading

Uh oh!

gdementen commented Oct 25, 2022

Uh oh!

implement partial aggregates (LArray.regroup and Axis.regroup) #361

implement partial aggregates (LArray.regroup and Axis.regroup) #361

Comments

gdementen commented Aug 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

gdementen commented Oct 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gdementen commented Oct 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gdementen commented Nov 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gdementen commented Nov 10, 2017

Uh oh!

gdementen commented Feb 2, 2018

Uh oh!

gdementen commented Mar 27, 2018

Uh oh!

gdementen commented May 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gdementen commented Jan 9, 2019

Uh oh!

gdementen commented Jan 30, 2019

Uh oh!

alixdamman commented Jan 30, 2019

Uh oh!

gdementen commented Jan 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gdementen commented Oct 25, 2022

Uh oh!

gdementen commented Aug 23, 2017 •

edited

Loading

gdementen commented Oct 27, 2017 •

edited

Loading

gdementen commented Oct 27, 2017 •

edited

Loading

gdementen commented Nov 1, 2017 •

edited

Loading

gdementen commented May 30, 2018 •

edited

Loading

gdementen commented Jan 30, 2019 •

edited

Loading