Skip to content

BUG?: getitem with a MultiIndex returns a Series only when the lower level is "" #50805

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rhshadrach opened this issue Jan 18, 2023 · 3 comments
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex

Comments

@rhshadrach
Copy link
Member

rhshadrach commented Jan 18, 2023

xref #46944

columns1 = pd.MultiIndex.from_tuples([("a", "a2"), ("b", "c")])
df1 = pd.DataFrame([[1, 2]], columns=columns1)
print(df1["a"])
#    a2
# 0   1

columns2 = pd.MultiIndex.from_tuples([("a", ""), ("b", "c")])
df2 = pd.DataFrame([[1, 2]], columns=columns2)
print(df2["a"])
# 0    1
# Name: a, dtype: int64

The first case produces a DataFrame, whereas the second case produces a Series. I don't think this is intentional. This gives rise to a difference in DataFrameGroupBy._selected_obj and DataFrameGroupBy._obj_with_exclusions which can lead to erroneous results (#50804 is one example).

Currently, df2.groupby("a") is allowed whereas df1.groupby("a") raises. So returning a DataFrame in the 2nd case will resolve the groupby inconsistency as well.

One can do df1.groupby(("a", "a2")) successfully, so I don't think there is a worry about making certain ops not possible.

cc @phofl for any thoughts

@rhshadrach rhshadrach added Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Jan 18, 2023
@phofl
Copy link
Member

phofl commented Jan 18, 2023

Not really any thoughts, but we had an issue about this as well. I looked into why this was done the way it's done and it has been around forever and is tested explicitly

@rhshadrach
Copy link
Member Author

rhshadrach commented Jan 19, 2023

I think this could be supported in groupby with just a few lines, but I do find it somewhat odd behavior. I'm guessing an empty string enables having some column behave as if they are multiindexed whereas other columns not (or just with fewer levels). I didn't see it mentioned anywhere in the docs, but could have missed it.

I'd support removing this special case behavior, but not going to push for it. It seems like added complexity that doesn't offer much in the way of benefits (but maybe it does and I just don't see it).

@rhshadrach
Copy link
Member Author

The recommendation of using _obj_with_exclusions instead of _selected_obj in #46944 (comment) has the added benefit of handling this behavior without any other change to groupby.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

No branches or pull requests

2 participants