-
Notifications
You must be signed in to change notification settings - Fork 6
implement partial aggregates (LArray.regroup and Axis.regroup) #361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Technically, that should not be too hard (*), but I am unsure about the syntax: >>> a.partial_agg(sum, 'a1..a3 >> a13')
>>> a.partial_sum('a1..a3 >> a13')
a a0 a13 a4 a5 a6 a7 a8 a9
0 6 4 5 6 7 8 9
>>> a.partial_mean('a1..a3 >> a13')
a a0 a13 a4 a5 a6 a7 a8 a9
0.0 2.0 4.0 5.0 6.0 7.0 8.0 9.0 (*) |
Note that we must also support arbitrary (non-contiguous) groups and (maybe) overlapping groups, which will make an implementation via .split()/.chunks() mostly impossible: >>> a.partial_sum('a1..a3 >> a13;a6..a8 >> a68')
a a0 a13 a4 a5 a68 a9
0 6 4 5 21 9
>>> a.partial_sum('a1,a3,a4 >> a134;a6,a8 >> a68')
a a0 a134 a2 a5 a68 a7 a9
0 8 2 5 14 7 9 |
Now that I think of it, it might be better to implement this as a method on Axis, so that we do not have to define extra aggregate methods and it works out of the box for any aggregate. The difficulty in that case is to find a good name for the method: >>> arr.a.partial('a1,a3,a4 >> a134;a6,a8 >> a68')
(a['a0'],
a['a1', 'a3', 'a4'] >> 'a134',
a['a2'],
a['a5'],
a['a6', 'a8'] >> 'a68',
a['a7'],
a['a9'])
>>> arr.sum(arr.a.partial('a1,a3,a4 >> a134;a6,a8 >> a68'))
a a0 a134 a2 a5 a68 a7 a9
0 8 2 5 14 7 9 This seems technically interesting, but not very readable/obvious what it means. |
I don't think >>> arr = ndtest(10)
>>> a = arr.a
>>> a.partial_grouping('a1,a3,a4 >> a134;a6,a8 >> a68')
(a['a0'],
a['a1', 'a3', 'a4'] >> 'a134',
a['a2'],
a['a5'],
a['a6', 'a8'] >> 'a68',
a['a7'],
a['a9'])
>>> arr.sum(a.partial_grouping('a1,a3,a4 >> a134;a6,a8 >> a68'))
a a0 a134 a2 a5 a68 a7 a9
0 8 2 5 14 7 9 |
maybe Axis.regroup() ? |
When groupby is done, we will be able to do this via set_labels + groupby. That would be an improvement compared to the current situation, but maybe not good enough as it is still quite verbose and inefficient. >>> arr = ndtest(10)
>>> arr.set_labels('a', {'a1': 'a134', 'a3': 'a134', 'a4': 'a134', 'a6': 'a68', 'a8': 'a68'}).groupby('a').sum()
a a0 a134 a2 a5 a68 a7 a9
0 8 2 5 14 7 9 |
Other ideas:
>>> arr = ndtest(10)
>>> # I like this, because it simply generalize what we already have. We might want to implement this regardless of this "partial grouping" feature
>>> arr.set_labels('a', {X.a['a1,a3,a4']: 'a134', X.a['a6,a8']: 'a68'})).groupby('a').sum()
a a0 a134 a2 a5 a68 a7 a9
0 8 2 5 14 7 9
>>> # or even reuse an existing group label (this might be going too far?)
>>> arr.set_labels('a', (X.a['a1,a3,a4'] >> 'a134', X.a['a6,a8'] >> 'a68')).groupby('a').sum()
a a0 a134 a2 a5 a68 a7 a9
0 8 2 5 14 7 9
>>> # ... but this would be practical
>>> arr.set_labels('a1,a3,a4 >> a134;a6,a8 >> a68').groupby('a').sum()
a a0 a134 a2 a5 a68 a7 a9
0 8 2 5 14 7 9
>>> arr.sum('a1,a3,a4 >> a134;a6,a8 >> a68', partial_agg=True) # or "partial" or "keep_other" or ...
a a0 a134 a2 a5 a68 a7 a9
0 8 2 5 14 7 9 This gets awkward when we want to combine partial and non partial aggregates.
>>> arr.partial.sum('a1,a3,a4 >> a134;a6,a8 >> a68')
a a0 a134 a2 a5 a68 a7 a9
0 8 2 5 14 7 9
>>> arr.regroup('a1,a3,a4 >> a134;a6,a8 >> a68').sum()
a a0 a134 a2 a5 a68 a7 a9
0 8 2 5 14 7 9
>>> groups = arr.a.regroup('a1,a3,a4 >> a134;a6,a8 >> a68')
>>> arr.sum(groups)
a a0 a134 a2 a5 a68 a7 a9
0 8 2 5 14 7 9 This is my currently preferred option, but (the way I see it) would benefit from a Grid class LArray.regroup() would return such an object. This Grid thing is more or less implemented in my local branch to implement #635. |
given that #635 and the Grid class are slow in coming, we might want to implement Axis.regroup already, which would be very easy to do and would already help our users quite a bit. |
Here is some hacky code I did for BM. The goal was to offer an API as close as possible to the future LArray.regroup without depending on the groupby feature: class RegrouperMethod(object):
def __init__(self, array, name, groups):
self.array = array
self.name = name
if not isinstance(groups, tuple):
groups = (groups,)
groups = tuple(array._prepare_aggregate(name, groups))
assert len(groups) == 1, "regroup only supports groups on one axis so far"
if not isinstance(groups[0], tuple):
groups = (groups,)
new_groups = []
for axis_groups in groups:
axis = axis_groups[0].axis
new_group = []
for l in axis:
lfound = False
for g in axis_groups:
first_elem = g[0] if isinstance(g.key, (tuple, list, np.ndarray, slice)) else g
if l in g:
lfound = True
if l == first_elem:
new_group.append(g)
if not lfound:
new_group.append(l)
new_groups.append(tuple(new_group))
self.groups = tuple(new_groups)
def __call__(self, *args, **kwargs):
args = self.groups + args
return getattr(self.array, self.name)(*args, **kwargs)
class Regrouper(object):
def __init__(self, array, groups):
self.array = array
self.groups = groups
def __getattr__(self, attr):
return RegrouperMethod(self.array, attr, self.groups)
def regroup(array, groups):
return Regrouper(array, groups) Usage is like this: >>> arr = ndtest((3, 4))
>>> arr
a\b b0 b1 b2 b3
a0 0 1 2 3
a1 4 5 6 7
a2 8 9 10 11
>>> regroup(arr, 'b1,b3 >> b13').sum()
a\b b0 b13 b2
a0 0 4 2
a1 4 12 6
a2 8 20 10 |
If we implement the |
It is not an if, it is a when. It is just a matter of me being back on larray code after dc2019 is done.
It is always a tradeoff but I think that in this case benefits outweight costs
You know the answer to this question: it is obviously no.
Yes, it is a very common need, at least in our institution. |
I stumbled on the need with a slight variation: amg had to regroup "parts" of some combined axes. I did two different versions to solve her problem. A more limited one but more efficient and a more general but less efficient. The limited one handles only prefixes (aka the first part of the combined axis). The second one works for any "part" of the combined axis but splits the axis, does the aggregate then recombine the axes. def sum_prefixes(array, axis, prefixes, combined_prefix, sep='_'):
axis = array.axes[axis]
all_prefixes, suffixes = axis.split(sep=sep)
starts_with_prefixes = axis.startingwith(prefixes[0])
for prefix in prefixes[1:]:
starts_with_prefixes = starts_with_prefixes.union(axis.startingwith(prefix))
aggregated_groups = tuple(starts_with_prefixes.endingwith(s) >> f'{combined_prefix}{sep}{s}' for s in suffixes)
other_groups = tuple(axis[:].difference(starts_with_prefixes))
return array.sum(aggregated_groups + other_groups)
def split_axes_sum(array, combined_axis, group, sep='_'):
orig_combined_axis = array.axes[combined_axis]
split_axes = orig_combined_axis.split(sep=sep)
split_array = array.split_axes(combined_axis, sep=sep)
split_axis = split_array.axes[group.axis]
nans = isnan(split_array)
added_labels = nans[nans].axes[combined_axis]
agg_array = split_array.sum((group,) + tuple(split_axis[:].difference(group)))
combined_array = agg_array.combine_axes(split_axes)
new_combined_axis = combined_array.axes[combined_axis]
return combined_array.drop(added_labels.intersection(new_combined_axis))
>>> arr = ndtest('a_b=BR_A,BR_B,WA_B,WA_C,FL_C,FL_D,FR_A,DE_B')
>>> sum_prefixes(arr, 'a_b', ['BR', 'WA', 'FL'], 'BE')
a_b BE_A BE_B BE_C BE_D FR_A DE_B
0.0 3.0 7.0 5.0 6.0 7.0
>>> split_axes_sum(arr, 'a_b', X.a['BR, WA, FL'] >> 'BE')
a_b BE_A BE_B BE_C BE_D FR_A DE_B
0.0 3.0 7.0 5.0 6.0 7.0
>>> split_axes_sum(arr, 'a_b', X.b['B, C'] >> 'BC')
a_b BR_BC BR_A WA_BC FL_BC FL_D FR_BC FR_A DE_BC
1.0 0.0 5.0 4.0 5.0 0.0 6.0 7.0 This could, one day be solved via some kind of pattern syntax, but it's hard to imagine something powerful enough and still readable: >>> arr.sum('a_b[BR_{prod:*}, WA_{prod:*}, FL_{prod:*}] >> BE_{prod}')
a_b BE_A BE_B BE_C BE_D FR_A DE_B
0.0 3.0 7.0 5.0 6.0 7.0
>>> arr.sum('a_b[(BR|WA|FL)_{prod:*}] >> BE_{prod}')
a_b BE_A BE_B BE_C BE_D FR_A DE_B
0.0 3.0 7.0 5.0 6.0 7.0
>>> arr.sum('a_b[(BR|WA|FL)_*] >> BE_*')
a_b BE_A BE_B BE_C BE_D FR_A DE_B
0.0 3.0 7.0 5.0 6.0 7.0 |
Uh oh!
There was an error while loading. Please reload this page.
Implement an easier way to aggregate only part of an axis and leave other labels intact:
The text was updated successfully, but these errors were encountered: