Skip to content

implement partial aggregates (LArray.regroup and Axis.regroup) #361

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
gdementen opened this issue Aug 23, 2017 · 12 comments
Open

implement partial aggregates (LArray.regroup and Axis.regroup) #361

gdementen opened this issue Aug 23, 2017 · 12 comments

Comments

@gdementen
Copy link
Contributor

gdementen commented Aug 23, 2017

Implement an easier way to aggregate only part of an axis and leave other labels intact:

>>> a = ndtest(10)
>>> a.sum('a0;a1..a3 >> a13;a4;a5;a6;a7;a8;a9')
a  a0  a13  a4  a5  a6  a7  a8  a9
    0    6   4   5   6   7   8   9
@gdementen
Copy link
Contributor Author

gdementen commented Oct 27, 2017

Technically, that should not be too hard (*), but I am unsure about the syntax:

>>> a.partial_agg(sum, 'a1..a3 >> a13')
>>> a.partial_sum('a1..a3 >> a13')
a  a0  a13  a4  a5  a6  a7  a8  a9
    0    6   4   5   6   7   8   9
>>> a.partial_mean('a1..a3 >> a13')
a   a0  a13   a4   a5   a6   a7   a8   a9
   0.0  2.0  4.0  5.0  6.0  7.0  8.0  9.0

(*) either create the group explicitly like above, or split the array using LArray.split()

@gdementen
Copy link
Contributor Author

gdementen commented Oct 27, 2017

Note that we must also support arbitrary (non-contiguous) groups and (maybe) overlapping groups, which will make an implementation via .split()/.chunks() mostly impossible:

>>> a.partial_sum('a1..a3 >> a13;a6..a8 >> a68')
a  a0  a13  a4  a5  a68  a9
    0    6   4   5   21   9
>>> a.partial_sum('a1,a3,a4 >> a134;a6,a8 >> a68')
a  a0  a134  a2  a5  a68  a7  a9
    0     8   2   5   14   7   9

@gdementen
Copy link
Contributor Author

gdementen commented Nov 1, 2017

Now that I think of it, it might be better to implement this as a method on Axis, so that we do not have to define extra aggregate methods and it works out of the box for any aggregate. The difficulty in that case is to find a good name for the method:

>>> arr.a.partial('a1,a3,a4 >> a134;a6,a8 >> a68')
(a['a0'],
 a['a1', 'a3', 'a4'] >> 'a134',
 a['a2'],
 a['a5'],
 a['a6', 'a8'] >> 'a68',
 a['a7'],
 a['a9'])
>>> arr.sum(arr.a.partial('a1,a3,a4 >> a134;a6,a8 >> a68'))
a  a0  a134  a2  a5  a68  a7  a9
    0     8   2   5   14   7   9

This seems technically interesting, but not very readable/obvious what it means.

@gdementen
Copy link
Contributor Author

I don't think partial is clear enough. Is partial_grouping understandable enough?

>>> arr = ndtest(10)
>>> a = arr.a
>>> a.partial_grouping('a1,a3,a4 >> a134;a6,a8 >> a68')
(a['a0'],
 a['a1', 'a3', 'a4'] >> 'a134',
 a['a2'],
 a['a5'],
 a['a6', 'a8'] >> 'a68',
 a['a7'],
 a['a9'])
>>> arr.sum(a.partial_grouping('a1,a3,a4 >> a134;a6,a8 >> a68'))
a  a0  a134  a2  a5  a68  a7  a9
    0     8   2   5   14   7   9

@alixdamman alixdamman added this to the 0.29 milestone Dec 18, 2017
@gdementen
Copy link
Contributor Author

maybe Axis.regroup() ?

@alixdamman alixdamman modified the milestones: 0.29, 0.30 Mar 16, 2018
@gdementen
Copy link
Contributor Author

When groupby is done, we will be able to do this via set_labels + groupby. That would be an improvement compared to the current situation, but maybe not good enough as it is still quite verbose and inefficient.

>>> arr = ndtest(10)
>>> arr.set_labels('a', {'a1': 'a134', 'a3': 'a134', 'a4': 'a134', 'a6': 'a68', 'a8': 'a68'}).groupby('a').sum()
a  a0  a134  a2  a5  a68  a7  a9
    0     8   2   5   14   7   9

@gdementen
Copy link
Contributor Author

gdementen commented May 30, 2018

Other ideas:

>>> arr = ndtest(10)
>>> # I like this, because it simply generalize what we already have. We might want to implement this regardless of this "partial grouping" feature
>>> arr.set_labels('a', {X.a['a1,a3,a4']: 'a134', X.a['a6,a8']: 'a68'})).groupby('a').sum()
a  a0  a134  a2  a5  a68  a7  a9
    0     8   2   5   14   7   9
>>> # or even reuse an existing group label (this might be going too far?)
>>> arr.set_labels('a', (X.a['a1,a3,a4'] >> 'a134', X.a['a6,a8'] >> 'a68')).groupby('a').sum()
a  a0  a134  a2  a5  a68  a7  a9
    0     8   2   5   14   7   9
>>> # ... but this would be practical
>>> arr.set_labels('a1,a3,a4 >> a134;a6,a8 >> a68').groupby('a').sum()
a  a0  a134  a2  a5  a68  a7  a9
    0     8   2   5   14   7   9
  • use a keyword argument to aggregate functions (unsure about the name of the kwarg though)
>>> arr.sum('a1,a3,a4 >> a134;a6,a8 >> a68', partial_agg=True) # or "partial" or "keep_other" or ...
a  a0  a134  a2  a5  a68  a7  a9
    0     8   2   5   14   7   9

This gets awkward when we want to combine partial and non partial aggregates.

  • use a "modifier attribute" to aggregate functions
>>> arr.partial.sum('a1,a3,a4 >> a134;a6,a8 >> a68')
a  a0  a134  a2  a5  a68  a7  a9
    0     8   2   5   14   7   9
  • use a "regroup" method on LArray (probably in combination with Axis.regroup)
>>> arr.regroup('a1,a3,a4 >> a134;a6,a8 >> a68').sum()
a  a0  a134  a2  a5  a68  a7  a9
    0     8   2   5   14   7   9
>>> groups = arr.a.regroup('a1,a3,a4 >> a134;a6,a8 >> a68')
>>> arr.sum(groups)
a  a0  a134  a2  a5  a68  a7  a9
    0     8   2   5   14   7   9

This is my currently preferred option, but (the way I see it) would benefit from a Grid class LArray.regroup() would return such an object. This Grid thing is more or less implemented in my local branch to implement #635.

@alixdamman alixdamman modified the milestones: 0.30, 0.31 Jul 18, 2018
@gdementen gdementen changed the title implement partial aggregates implement partial aggregates (LArray.regroup and Axis.regroup) Jan 9, 2019
@gdementen
Copy link
Contributor Author

given that #635 and the Grid class are slow in coming, we might want to implement Axis.regroup already, which would be very easy to do and would already help our users quite a bit.

@gdementen
Copy link
Contributor Author

Here is some hacky code I did for BM. The goal was to offer an API as close as possible to the future LArray.regroup without depending on the groupby feature:

class RegrouperMethod(object):
    def __init__(self, array, name, groups):
        self.array = array
        self.name = name
        if not isinstance(groups, tuple):
            groups = (groups,)
        groups = tuple(array._prepare_aggregate(name, groups))
        assert len(groups) == 1, "regroup only supports groups on one axis so far"
        if not isinstance(groups[0], tuple):
            groups = (groups,)
        new_groups = []
        for axis_groups in groups:

            axis = axis_groups[0].axis
            new_group = []
            for l in axis:
                lfound = False
                for g in axis_groups:
                    first_elem = g[0] if isinstance(g.key, (tuple, list, np.ndarray, slice)) else g
                    if l in g:
                        lfound = True
                        if l == first_elem:
                            new_group.append(g)
                if not lfound:
                    new_group.append(l)
            new_groups.append(tuple(new_group))
        self.groups = tuple(new_groups)

    def __call__(self, *args, **kwargs):
        args = self.groups + args
        return getattr(self.array, self.name)(*args, **kwargs)

class Regrouper(object):
    def __init__(self, array, groups):
        self.array = array
        self.groups = groups

    def __getattr__(self, attr):
        return RegrouperMethod(self.array, attr, self.groups)

def regroup(array, groups):
    return Regrouper(array, groups)

Usage is like this:

>>> arr = ndtest((3, 4))
>>> arr
a\b  b0  b1  b2  b3
 a0   0   1   2   3
 a1   4   5   6   7
 a2   8   9  10  11
>>> regroup(arr, 'b1,b3 >> b13').sum()
a\b  b0  b13  b2
 a0   0    4   2
 a1   4   12   6
 a2   8   20  10

@alixdamman
Copy link
Collaborator

If we implement the groupby feature one day, I wonder if the existence of regroup will not be confusing.
A more general questions is: do we need to take the risk to make the LArray's API incomprehensible but including each specific demand?
Is regroup will be interesting for other users?

@gdementen
Copy link
Contributor Author

gdementen commented Jan 30, 2019

If we implement the groupby feature one day,

It is not an if, it is a when. It is just a matter of me being back on larray code after dc2019 is done.

I wonder if the existence of regroup will not be confusing.

It is always a tradeoff but I think that in this case benefits outweight costs

A more general questions is: do we need to take the risk to make the LArray's API incomprehensible but including each specific demand?

You know the answer to this question: it is obviously no.

Is regroup will be interesting for other users?

Yes, it is a very common need, at least in our institution.

@gdementen gdementen removed this from the 0.31 milestone Aug 1, 2019
@alixdamman alixdamman added this to the nice_to_have milestone Oct 10, 2019
@gdementen gdementen removed this from the nice_to_have milestone Nov 14, 2019
@gdementen
Copy link
Contributor Author

I stumbled on the need with a slight variation: amg had to regroup "parts" of some combined axes. I did two different versions to solve her problem. A more limited one but more efficient and a more general but less efficient. The limited one handles only prefixes (aka the first part of the combined axis). The second one works for any "part" of the combined axis but splits the axis, does the aggregate then recombine the axes.

def sum_prefixes(array, axis, prefixes, combined_prefix, sep='_'):
    axis = array.axes[axis]
    all_prefixes, suffixes = axis.split(sep=sep)    
    starts_with_prefixes = axis.startingwith(prefixes[0])
    for prefix in prefixes[1:]:
        starts_with_prefixes = starts_with_prefixes.union(axis.startingwith(prefix))
    aggregated_groups = tuple(starts_with_prefixes.endingwith(s) >> f'{combined_prefix}{sep}{s}' for s in suffixes)
    other_groups = tuple(axis[:].difference(starts_with_prefixes))
    return array.sum(aggregated_groups + other_groups)

def split_axes_sum(array, combined_axis, group, sep='_'):
    orig_combined_axis = array.axes[combined_axis]
    split_axes = orig_combined_axis.split(sep=sep)
    split_array = array.split_axes(combined_axis, sep=sep)
    split_axis = split_array.axes[group.axis]
    nans = isnan(split_array)
    added_labels = nans[nans].axes[combined_axis]
    agg_array = split_array.sum((group,) + tuple(split_axis[:].difference(group)))
    combined_array = agg_array.combine_axes(split_axes)
    new_combined_axis = combined_array.axes[combined_axis]
    return combined_array.drop(added_labels.intersection(new_combined_axis))

>>> arr = ndtest('a_b=BR_A,BR_B,WA_B,WA_C,FL_C,FL_D,FR_A,DE_B')
>>> sum_prefixes(arr, 'a_b', ['BR', 'WA', 'FL'], 'BE')
a_b  BE_A  BE_B  BE_C  BE_D  FR_A  DE_B
      0.0   3.0   7.0   5.0   6.0   7.0
>>> split_axes_sum(arr, 'a_b', X.a['BR, WA, FL'] >> 'BE')
a_b  BE_A  BE_B  BE_C  BE_D  FR_A  DE_B
      0.0   3.0   7.0   5.0   6.0   7.0
>>> split_axes_sum(arr, 'a_b', X.b['B, C'] >> 'BC')
a_b  BR_BC  BR_A  WA_BC  FL_BC  FL_D  FR_BC  FR_A  DE_BC
       1.0   0.0    5.0    4.0   5.0    0.0   6.0    7.0

This could, one day be solved via some kind of pattern syntax, but it's hard to imagine something powerful enough and still readable:

>>> arr.sum('a_b[BR_{prod:*}, WA_{prod:*}, FL_{prod:*}] >> BE_{prod}')
a_b  BE_A  BE_B  BE_C  BE_D  FR_A  DE_B
      0.0   3.0   7.0   5.0   6.0   7.0
>>> arr.sum('a_b[(BR|WA|FL)_{prod:*}] >> BE_{prod}')
a_b  BE_A  BE_B  BE_C  BE_D  FR_A  DE_B
      0.0   3.0   7.0   5.0   6.0   7.0
>>> arr.sum('a_b[(BR|WA|FL)_*] >> BE_*')
a_b  BE_A  BE_B  BE_C  BE_D  FR_A  DE_B
      0.0   3.0   7.0   5.0   6.0   7.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants