-
Notifications
You must be signed in to change notification settings - Fork 6
make (partial) disaggregation easier #199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The loop can be rewritten more generically as: >>> for g in group:
... start, stop = g.eval().split(':')
... target = a[start:stop]
... expand[g, target] = 1 / len(target) |
and |
A first proof of concept: def disag_array(array, source_axis, target_axis, mapping=None, fixoverlap=True):
source_axis = array.axes[source_axis]
disag_array = zeros((source_axis, target_axis))
for source_group in source_axis:
source_label = source_group.eval()
target_labels = source_label if mapping is None else mapping[source_label]
target_group = target_axis[target_labels]
disag_array[source_group, target_group] = 1 / len(target_group)
if fixoverlap:
disag_array /= (disag_array > 0).sum(source_axis)
return disag_array
# make a method out of this
def disag(array, source_axis, target_axis, mapping=None):
return array @ disag_array(array, source_axis, target_axis, mapping) then >>> arr = ndtest(3)
>>> arr
a | a0 | a1 | a2
| 0 | 1 | 2
>>> agg = arr.sum('a1:a2;a0:a1').rename('a', 'group')
>>> agg
group | a1:a2 | a0:a1
| 3 | 1
>>> disag(agg, x.group, arr.a)
a | a0 | a1 | a2
| 0.5 | 1.0 | 1.5 |
Here is a new version: def disag_array(array, source_axis, groups=None, target_axis=None, fixoverlap=True):
source_axis = array.axes[source_axis]
if isinstance(groups, collections.Sequence):
assert len(groups) == len(source_axis)
if target_axis is None:
target_axis = Axis(groups[0].axis, np.unique(np.concatenate([g.eval() for g in groups])))
groups = dict(zip(source_axis, groups))
if isinstance(groups, collections.Mapping):
if target_axis is None:
groups = [groups[source_group.eval()] for source_group in source_axis]
groups_labels = [g.eval() if isinstance(g, Group) else g for g in groups]
target_name = groups[0].axis if isinstance(groups[0], Group) else None
target_axis = Axis(target_name, np.unique(np.concatenate(groups_labels)))
if target_axis is None:
raise ValueError('must specify groups, target_axis or both')
disag_array = zeros((source_axis, target_axis))
for source_group in source_axis:
target_labels = source_group.eval() if groups is None else groups[source_group]
# make sure we have a group in case mapping returned raw labels
target_group = target_axis[target_labels]
disag_array[source_group, target_group] = 1 / len(target_group)
if fixoverlap:
disag_array /= (disag_array > 0).sum(source_axis)
return disag_array
# make a method out of this
def disag(array, source_axis, groups=None, target_axis=None):
return (array * disag_array(array, source_axis, groups, target_axis)).sum(source_axis) then: >>> disag(agg2, x.group, (arr.a['a1:a2'], arr.a['a0:a1']))
a | a0 | a1 | a2
| 0.5 | 1.0 | 1.5
>>> disag(agg2, x.group, target_axis=arr.a)
a | a0 | a1 | a2
| 0.5 | 1.0 | 1.5
>>> disag(agg, agegr, age.by(5)) points to improve:
>>> disag(agg, dict(zip(agegr, age.by(5))))
>>> disag(agg, {'a1:a2': ['a1', 'a2'], 'a0:a1': ['a0', 'a1']})
|
Note that the @ optimization currently only works when the disag array is 2D, mostly due to the fact that @ uses the axes positions. It should be possible to transpose to make it work in all cases. Not sure it is worth it though. |
Katia needs this too but to disaggregate only a few labels. I don't know if the above code already works for that case or not. We should make sure it works though. |
The implementation so far assumes the aggregated cells correspond to sums. We should support not dividing by the length of the group (if the aggregation was a mean). Are there other ways/methods to automatically infer the values (possibly user-defined function???). If users want an uneven split, I guess we should redirect them (probably worth mentioning in the disag function documentation) to use an explicit disag array. We could also (probably in addition to mentioning the explicit disag array) provide some special syntax to create it (for disaggregating only a few labels, providing a disag array would seem overkill). >>> disag(agg, {'a1:a2': {'a1': 0.4, 'a2': 0.6}, 'a0:a1': {'a0': 0.7, 'a1': 0.3}}) |
Partial disaggregation is done quite often in user models: >>> arr = ndtest('axis_v1=a0,a12,a3')
>>> arr
axis_v1 a0 a12 a3
0 1 2
>>> arr.rename('axis_v1', 'axis_v2').set_labels('axis_v2', {'a12': 'a1'}).insert(0, after='a1', label = 'a2')
axis_v2 a0 a1 a2 a3
0 1 0 2 It would be nice if we had something nicer for this. The special syntax above would help: >>> disag(arr, {'a12': {'a1': 1, 'a2': 0}})
axis_v2 a0 a1 a2 a3
0 1 0 2 |
Uh oh!
There was an error while loading. Please reload this page.
We need to find some way to make this easier (and fix all the bugs I just came across) -- I will open separate issues for each bug, but the main feature request will remain:
The text was updated successfully, but these errors were encountered: