Skip to content

BUG: Fix groupby with "as_index" for categorical multi #13204 #13394

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.18.2.txt
Original file line number Diff line number Diff line change
Expand Up @@ -527,3 +527,4 @@ Bug Fixes


- Bug in ``Categorical.remove_unused_categories()`` changes ``.codes`` dtype to platform int (:issue:`13261`)
- Bug in ``groupby`` with ``as_index=False`` returns all NaN's when grouping on multiple columns including a categorical one (:issue:`13204`)
38 changes: 34 additions & 4 deletions pandas/core/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -2250,7 +2250,7 @@ def __init__(self, index, grouper=None, obj=None, name=None, level=None,
self.grouper = to_timedelta(self.grouper)

def __repr__(self):
return 'Grouping(%s)' % self.name
return 'Grouping({0})'.format(self.name)

def __iter__(self):
return iter(self.indices)
Expand Down Expand Up @@ -3741,9 +3741,39 @@ def _reindex_output(self, result):
return result

levels_list = [ping.group_index for ping in groupings]
index = MultiIndex.from_product(levels_list, names=self.grouper.names)
d = {self.obj._get_axis_name(self.axis): index, 'copy': False}
return result.reindex(**d).sortlevel(axis=self.axis)
index, _ = MultiIndex.from_product(
levels_list, names=self.grouper.names).sortlevel()

if self.as_index:
d = {self.obj._get_axis_name(self.axis): index, 'copy': False}
return result.reindex(**d)

# GH 13204
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't something like this solve the issue? This is conceptually what as_index=False is anyhow.

3760)_reindex_output()
-> indexer = self.grouper.result_index.get_indexer(index)
(Pdb) result.set_index(['cat','A']).reindex(index).reset_index()
   cat   A      B
0    1  10  101.0
1    1  11    NaN
2    2  10    NaN
3    2  11  205.0
4    3  10    NaN
5    3  11    NaN

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, this is correct (it's slightly slower because of dropping and inserting columns to the data frame - but overall much cleaner).

In a general case, not all groupers have to be columns. So, probably, we have to manually find which groupers are columns and remove them from result, which will make the code even more obscure.

Something like this

result.set_index(self.grouper.result_index).reindex(index).reset_index()

fails because reset_index raises ValueError: cannot insert A, already exists.

And there is also a (probably rare but not unlikely) corner case, where a non-column grouper has the same name as some other (non-grouping) column in result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes agreed. I think that you need to iterate thrue the grouper.names and check if they are already in the axis (or not) and just adjust this. Speed is not an issue here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it can probably be done but there's one issue here: if we have a non-column grouper, should it become a column of the final result? This will happen after reset_index(). And its name can clash with a name of some other column.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [7]: df = DataFrame({'cat': [1,2,3],'A': [10, 11, 11],'B': [101, 102, 103]})

In [8]: df.groupby(['cat','A']).sum()
Out[8]: 
          B
cat A      
1   10  101
2   11  102
3   11  103

In [9]: df.groupby(['cat','A'],as_index=False).sum()
Out[9]: 
   cat   A    B
0    1  10  101
1    2  11  102
2    3  11  103

In [10]: f = lambda r: df.loc[r, 'A'] % 2

In [11]: df.groupby(['cat',f],as_index=False).sum()
Out[11]: 
   cat   A    B
0    1  10  101
1    2  11  102
2    3  11  103

It appears non-named groupers (e.g. not in_axis) are excluded, for non-categoricals (which makes sense its en emphereal grouper, not something we started with).

Copy link
Contributor Author

@pijucha pijucha Jun 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean something like this where df1 is some other data frame

df = pd.DataFrame({'cat': pd.Categorical([1, 2, 2], [1, 2, 3]),
                     'A': [10, 11, 11], 'B': [101, 102, 103]})
df1 = pd.DataFrame({'B' : [3, 4, 5]})

df.groupby(['cat', df1['B']]).sum()
Out[21]: 
          A      B
cat B             
1   3  10.0  101.0
    4   NaN    NaN
    5   NaN    NaN
2   3  11.0  102.0
    4  11.0  103.0
    5   NaN    NaN
3   3   NaN    NaN
    4   NaN    NaN
    5   NaN    NaN

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've got it. I'll try to update it later in the evening.

# Here, the categorical in-axis groupers, which need to be fully
# expanded, are columns in `result`. An idea is to do:
# result = result.set_index(self.grouper.names)
# .reindex(index).reset_index()
# but special care has to be taken because of possible not-in-axis
# groupers.
# So, we manually select and drop the in-axis grouper columns,
# reindex `result`, and then reset the in-axis grouper columns.

# Select in-axis groupers
in_axis_grps = [(i, ping.name) for (i, ping)
in enumerate(groupings) if ping.in_axis]
g_nums, g_names = zip(*in_axis_grps)

result = result.drop(labels=list(g_names), axis=1)

# Set a temp index and reindex (possibly expanding)
result = result.set_index(self.grouper.result_index
).reindex(index, copy=False)

# Reset in-axis grouper columns
# (using level numbers `g_nums` because level names may not be unique)
result = result.reset_index(level=g_nums)

return result.reset_index(drop=True)

def _iterate_column_groupbys(self):
for i, colname in enumerate(self._selected_obj.columns):
Expand Down
51 changes: 51 additions & 0 deletions pandas/tests/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -6304,6 +6304,47 @@ def test_groupby_categorical_two_columns(self):
nan, nan, nan, nan, 200, 34]}, index=idx)
tm.assert_frame_equal(res, exp)

def test_groupby_multi_categorical_as_index(self):
# GH13204
df = DataFrame({'cat': Categorical([1, 2, 2], [1, 2, 3]),
'A': [10, 11, 11],
'B': [101, 102, 103]})
result = df.groupby(['cat', 'A'], as_index=False).sum()
expected = DataFrame({'cat': [1, 1, 2, 2, 3, 3],
'A': [10, 11, 10, 11, 10, 11],
'B': [101.0, nan, nan, 205.0, nan, nan]},
columns=['cat', 'A', 'B'])
tm.assert_frame_equal(result, expected)

# function grouper
f = lambda r: df.loc[r, 'A']
result = df.groupby(['cat', f], as_index=False).sum()
expected = DataFrame({'cat': [1, 1, 2, 2, 3, 3],
'A': [10.0, nan, nan, 22.0, nan, nan],
'B': [101.0, nan, nan, 205.0, nan, nan]},
columns=['cat', 'A', 'B'])
tm.assert_frame_equal(result, expected)

# another not in-axis grouper (conflicting names in index)
s = Series(['a', 'b', 'b'], name='cat')
result = df.groupby(['cat', s], as_index=False).sum()
expected = DataFrame({'cat': [1, 1, 2, 2, 3, 3],
'A': [10.0, nan, nan, 22.0, nan, nan],
'B': [101.0, nan, nan, 205.0, nan, nan]},
columns=['cat', 'A', 'B'])
tm.assert_frame_equal(result, expected)

# is original index dropped?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a question?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, no, it's a comment. It should be rather "Check whether the original index is dropped".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh ok

expected = DataFrame({'cat': [1, 1, 2, 2, 3, 3],
'A': [10, 11, 10, 11, 10, 11],
'B': [101.0, nan, nan, 205.0, nan, nan]},
columns=['cat', 'A', 'B'])

for name in [None, 'X', 'B', 'cat']:
df.index = Index(list("abc"), name=name)
result = df.groupby(['cat', 'A'], as_index=False).sum()
tm.assert_frame_equal(result, expected, check_index_type=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Added the tests. (Hope this is what you meant.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. what needed changing in the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing. Just added these tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this example is wrong though. #13394 (comment) There shouldn't be an index column at all.

Copy link
Contributor Author

@pijucha pijucha Jun 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand. With as_index=False, the aggregated output has RangeIndex (edit: actually, it's Int64Index([0, 1, 2]) here) as an index:

df = pd.DataFrame({"A": [1,1,1], "B": [2,2,2], "C": [1,2,3]})
df.groupby(["A","C"], as_index=False).sum()
Out[34]: 
   A  C  B
0  1  1  2
1  1  2  2
2  1  3  2

and now, after .reset_index() it's just reset as a new column.

df.groupby(["A","C"], as_index=False).sum().reset_index()
Out[35]: 
   index  A  C  B
0      0  1  1  2
1      1  1  2  2
2      2  1  3  2

It has nothing to do with groupers being categorical.

Didn't you mean rather as_index=True?

df.groupby(["A","C"],as_index=True).sum().reset_index()
Out[36]: 
   A  C  B
0  1  1  2
1  1  2  2
2  1  3  2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this example is wrong though. #13394 (comment) There shouldn't be an index column at all.

Are you still convinced something may be wrong here? I think it's correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was looking at something weird. looks ok.


def test_groupby_apply_all_none(self):
# Tests to make sure no errors if apply function returns all None
# values. Issue 9684.
Expand Down Expand Up @@ -6431,6 +6472,16 @@ def test_numpy_compat(self):
tm.assertRaisesRegexp(UnsupportedFunctionCall, msg,
getattr(g, func), foo=1)

def test_grouping_string_repr(self):
# GH 13394
mi = MultiIndex.from_arrays([list("AAB"), list("aba")])
df = DataFrame([[1, 2, 3]], columns=mi)
gr = df.groupby(df[('A', 'a')])

result = gr.grouper.groupings[0].__repr__()
expected = "Grouping(('A', 'a'))"
tm.assert_equal(result, expected)


def assert_fp_equal(a, b):
assert (np.abs(a - b) < 1e-12).all()
Expand Down