-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Fix groupby with "as_index" for categorical multi #13204 #13394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6304,6 +6304,47 @@ def test_groupby_categorical_two_columns(self): | |
nan, nan, nan, nan, 200, 34]}, index=idx) | ||
tm.assert_frame_equal(res, exp) | ||
|
||
def test_groupby_multi_categorical_as_index(self): | ||
# GH13204 | ||
df = DataFrame({'cat': Categorical([1, 2, 2], [1, 2, 3]), | ||
'A': [10, 11, 11], | ||
'B': [101, 102, 103]}) | ||
result = df.groupby(['cat', 'A'], as_index=False).sum() | ||
expected = DataFrame({'cat': [1, 1, 2, 2, 3, 3], | ||
'A': [10, 11, 10, 11, 10, 11], | ||
'B': [101.0, nan, nan, 205.0, nan, nan]}, | ||
columns=['cat', 'A', 'B']) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
# function grouper | ||
f = lambda r: df.loc[r, 'A'] | ||
result = df.groupby(['cat', f], as_index=False).sum() | ||
expected = DataFrame({'cat': [1, 1, 2, 2, 3, 3], | ||
'A': [10.0, nan, nan, 22.0, nan, nan], | ||
'B': [101.0, nan, nan, 205.0, nan, nan]}, | ||
columns=['cat', 'A', 'B']) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
# another not in-axis grouper (conflicting names in index) | ||
s = Series(['a', 'b', 'b'], name='cat') | ||
result = df.groupby(['cat', s], as_index=False).sum() | ||
expected = DataFrame({'cat': [1, 1, 2, 2, 3, 3], | ||
'A': [10.0, nan, nan, 22.0, nan, nan], | ||
'B': [101.0, nan, nan, 205.0, nan, nan]}, | ||
columns=['cat', 'A', 'B']) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
# is original index dropped? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this a question? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, no, it's a comment. It should be rather "Check whether the original index is dropped". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ahh ok |
||
expected = DataFrame({'cat': [1, 1, 2, 2, 3, 3], | ||
'A': [10, 11, 10, 11, 10, 11], | ||
'B': [101.0, nan, nan, 205.0, nan, nan]}, | ||
columns=['cat', 'A', 'B']) | ||
|
||
for name in [None, 'X', 'B', 'cat']: | ||
df.index = Index(list("abc"), name=name) | ||
result = df.groupby(['cat', 'A'], as_index=False).sum() | ||
tm.assert_frame_equal(result, expected, check_index_type=True) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jreback Added the tests. (Hope this is what you meant.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes. what needed changing in the code? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nothing. Just added these tests. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this example is wrong though. #13394 (comment) There shouldn't be an index column at all. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't quite understand. With df = pd.DataFrame({"A": [1,1,1], "B": [2,2,2], "C": [1,2,3]})
df.groupby(["A","C"], as_index=False).sum()
Out[34]:
A C B
0 1 1 2
1 1 2 2
2 1 3 2 and now, after df.groupby(["A","C"], as_index=False).sum().reset_index()
Out[35]:
index A C B
0 0 1 1 2
1 1 1 2 2
2 2 1 3 2 It has nothing to do with groupers being categorical. Didn't you mean rather df.groupby(["A","C"],as_index=True).sum().reset_index()
Out[36]:
A C B
0 1 1 2
1 1 2 2
2 1 3 2 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Are you still convinced something may be wrong here? I think it's correct. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I was looking at something weird. looks ok. |
||
|
||
def test_groupby_apply_all_none(self): | ||
# Tests to make sure no errors if apply function returns all None | ||
# values. Issue 9684. | ||
|
@@ -6431,6 +6472,16 @@ def test_numpy_compat(self): | |
tm.assertRaisesRegexp(UnsupportedFunctionCall, msg, | ||
getattr(g, func), foo=1) | ||
|
||
def test_grouping_string_repr(self): | ||
# GH 13394 | ||
mi = MultiIndex.from_arrays([list("AAB"), list("aba")]) | ||
df = DataFrame([[1, 2, 3]], columns=mi) | ||
gr = df.groupby(df[('A', 'a')]) | ||
|
||
result = gr.grouper.groupings[0].__repr__() | ||
expected = "Grouping(('A', 'a'))" | ||
tm.assert_equal(result, expected) | ||
|
||
|
||
def assert_fp_equal(a, b): | ||
assert (np.abs(a - b) < 1e-12).all() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't something like this solve the issue? This is conceptually what
as_index=False
is anyhow.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, this is correct (it's slightly slower because of dropping and inserting columns to the data frame - but overall much cleaner).
In a general case, not all groupers have to be columns. So, probably, we have to manually find which groupers are columns and remove them from
result
, which will make the code even more obscure.Something like this
fails because
reset_index
raisesValueError: cannot insert A, already exists
.And there is also a (probably rare but not unlikely) corner case, where a non-column grouper has the same name as some other (non-grouping) column in
result
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes agreed. I think that you need to iterate thrue the grouper.names and check if they are already in the axis (or not) and just adjust this. Speed is not an issue here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, it can probably be done but there's one issue here: if we have a non-column grouper, should it become a column of the final
result
? This will happen afterreset_index()
. And its name can clash with a name of some other column.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears non-named groupers (e.g. not
in_axis
) are excluded, for non-categoricals (which makes sense its en emphereal grouper, not something we started with).Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean something like this where
df1
is some other data frameThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've got it. I'll try to update it later in the evening.