Skip to content

BUG: idxmin & idxmax axis = 1 str reducer for transform #50329

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jan 5, 2023

Conversation

iofall
Copy link
Contributor

@iofall iofall commented Dec 17, 2022

To Do:

  • Fix failing test_api_consistency due to changed signatures.
  • Fix _shared_docs for idxmin to reflect the changes in the signature.

Description

  • .transform("idxmin") & .transform("idxmax") now works with axis=1 groupby objects.
  • The default axis=0 argument is ignored. Instead we directly use the grouper's axis.
  • Note that this behavior is different from a generic idxmin call which still respects the axis argument.

@rhshadrach rhshadrach added Bug Groupby Apply Apply, Aggregate, Transform, Map labels Dec 18, 2022
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix!

As it's somewhat relevant here, I'll mention that I think we should deprecate the axis argument across most (all?) groupby methods where it is not useful, but this is to be done separately.

axis: Axis = 0,
skipna: bool = True,
numeric_only: bool = False,
self, axis: Axis = 0, skipna: bool = True, numeric_only: bool = False, **kwargs
Copy link
Member

@rhshadrach rhshadrach Dec 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be adding kwargs to the function signature. Whenever self.axis=1 and axis = 0 in idxmin we will raise (whether using with transform or not), so I'd suggest instead just ignoring the axis argument altogether in this case.

Assuming this sounds like a good route, can you change the axis argument in the docstring to state that it is ignored. Right now we're using the shared_docs; I'd recommend copying this and adding it as an explicit docstring to this method - that would then allow us in the future to add groupby-specific docs for this method and correct some difference in behavior between DataFrame and DataFrameGroupBy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback, it does sound like a good route. As for the axis parameter, it does not seem intuitive to raise when we are stating in the docs that the argument will be ignored. We could however, instead of ignoring axis, just mention in the docs that axis and self.axis should match. If they don't match, we will raise with the message - they should match.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the axis parameter, it does not seem intuitive to raise when we are stating in the docs that the argument will be ignored.

I think this is in reply to my sentence Whenever self.axis=1 and axis = 0 in idxmin we will raise (whether using with transform or not), so I'd suggest.... When I said "we will raise", I mean "currently pandas raises". In other words, I think it's okay to ignore the axis argument because currently no one can be using .groupby(..., axis=1).idxmin(axis=0).

We could however, instead of ignoring axis, just mention in the docs that axis and self.axis should match. If they don't match, we will raise with the message - they should match.

I agree this is a viable route, but I think a bit odd. This is perhaps an over simplification, but it's like having a function

def foo(x=1):
    if x != 2:
        raise ValueError("You must set x to be 2")
    ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I said "we will raise", I mean "currently pandas raises". In other words, I think it's okay to ignore the axis argument because currently no one can be using .groupby(..., axis=1).idxmin(axis=0).

Okay now I understand what you mean. For now we will ignore axis, in the future remove it and directly take axis from the grouper.

@rhshadrach rhshadrach changed the title FIX idxmin axis = 1 str reducer for transform BUG: idxmin axis = 1 str reducer for transform Dec 18, 2022
@iofall
Copy link
Contributor Author

iofall commented Dec 23, 2022

I have implemented the required changes in the latest commit.

We ignore the axis parameter, we take it directly based on the self.axis and have mentioned the same in the docs.

Comment on lines 1931 to 1933
>>> df = pd.DataFrame({{'consumption': [10.51, 103.11, 55.48],
... 'co2_emissions': [37.2, 19.66, 1712]}},
... index=['Pork', 'Wheat Products', 'Beef'])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doctest appears to be failing because of the {{. I think keeping a single brace should fix it, but not sure if we need it as an escape character.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the previous docstring had two {{ due to its use as a template; agreed there should only be one.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, small requests below. Also, would you be willing to do idxmax here as well?

Also needs a whatsnew entry in 2.0.0 under the bugfix section within groupby.

The axis to use. 0 or 'index' for row-wise, 1 or 'columns' for column-wise.
The axis argument is ignored, instead we use the grouper's axis.

.. versionchanged:: 1.5.3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2.0.0 here; in generally only regression fixes make it to patch versions.

dtype: object
"""

axis = self.axis
axis = DataFrame._get_axis_number(axis)
Copy link
Member

@rhshadrach rhshadrach Dec 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L1958 here is now unnecessary, as self.axis is always 0 or 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense

@iofall
Copy link
Contributor Author

iofall commented Dec 27, 2022

Looks good, small requests below. Also, would you be willing to do idxmax here as well?

Sure, I will do that.

@rhshadrach
Copy link
Member

@iofall - I think this looks good, just need to resolve a whatsnew conflict

@iofall iofall changed the title BUG: idxmin axis = 1 str reducer for transform BUG: idxmin & idxmax axis = 1 str reducer for transform Dec 30, 2022
@@ -944,7 +944,7 @@ Groupby/resample/rolling
- Bug in :meth:`.SeriesGroupBy.nunique` would incorrectly raise when the grouper was an empty categorical and ``observed=True`` (:issue:`21334`)
- Bug in :meth:`.SeriesGroupBy.nth` would raise when grouper contained NA values after subsetting from a :class:`DataFrameGroupBy` (:issue:`26454`)
- Bug in :meth:`DataFrame.groupby` would not include a :class:`.Grouper` specified by ``key`` in the result when ``as_index=False`` (:issue:`50413`)
-
- Bug in :meth:`.DataFrameGroupBy.transform` and :meth:`.SeriesGroupBy.transform` would raise incorrectly when grouper had ``axis=1`` for ``"idxmin"`` and ``"idxmax"`` arguments (:issue:`45986`). The method now ignores the ``axis`` argument, instead we use the grouper's ``axis``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this one sentence, no period, ending with (:issue:`45986`) as the other lines do. I think it's okay to leave off the last sentence.

@iofall
Copy link
Contributor Author

iofall commented Dec 31, 2022

I looked at the failing test, we might need to remove / rewrite it, since it depends on supplying axis = 1 to idxmax which we are ignoring.

gb = df.groupby("A")
res = gb.idxmax(axis=1)

Here, we groupby on line 509, this happens with axis = 0, so we do get some groups. But if we change the same line and put axis = 1, we don't get any groups.

>>> gb = df.groupby("A")
>>> gb.groups
{1: [0, 3, 6], 2: [1, 4, 7], 3: [2, 5, 8], 4: [9]}

>>> gb = df.groupby("A", axis = 1)
>>> gb.groups
{}

Thus the res variable comes out as an empty DataFrame, failing the test.

I am not sure how to proceed but we can try:

  • The same thing we had done before, honor the axis argument if given otherwise use self.axis
  • Change the test to check for a different result that does not rely on groups across different axes.

@rhshadrach
Copy link
Member

I was under the impression groupby(..., axis=0).idxmin(axis=1) would raise, but it looks like this is not the case.

The same thing we had done before, honor the axis argument if given otherwise use self.axis

+1 on this

@iofall
Copy link
Contributor Author

iofall commented Jan 2, 2023

I was under the impression groupby(..., axis=0).idxmin(axis=1) would raise, but it looks like this is not the case.

Sorry I did not convey this properly, it does raise but only when the number of rows and columns mismatch in the issue's minimal example.

# raises
df = pd.DataFrame({"a": [1, 2,], "b": [3, 4], "c": [5, 6]}, index=["x", "y"])
result = df.groupby([0, 1, 1], axis=0).transform("idxmin", axis = 1)

# passes
df = pd.DataFrame({"a": [1, 2, 3], "b": [3, 4, 5], "c": [5, 6, 7]}, index=["x", "y", "z"])
result = df.groupby([0, 1, 1], axis=0).transform("idxmin", axis = 1)

+1 on this

What do you recommend?

  • Put axis in kwargs, that way we can check if it exists in the kwargs and handle accordingly.
  • Remove the default axis = 0 and make it a required argument.
  • Or make axis = None as the default. If axis is None, we use self.axis, else use whatever value supplied by the user.

@rhshadrach
Copy link
Member

rhshadrach commented Jan 3, 2023

@iofall - The code

df = pd.DataFrame({"a": [1, 2,], "b": [3, 4], "c": [5, 6]}, index=["x", "y"])
result = df.groupby([0, 1, 1], axis=0)

raises because you have two rows to group and you're specifying groups for three rows.

Or make axis = None as the default. If axis is None, we use self.axis, else use whatever value supplied by the user.

This sounds like a great way forward.

@iofall
Copy link
Contributor Author

iofall commented Jan 4, 2023

raises because you have two rows to group and you're specifying groups for three rows.

Sorry I meant even this raises -

df = pd.DataFrame({"a": [1, 2,], "b": [3, 4], "c": [5, 6]}, index=["x", "y"])
result = df.groupby([0, 1, 1], axis=1).transform("idxmin", axis=0)

@iofall
Copy link
Contributor Author

iofall commented Jan 4, 2023

I think we have the right changes now, we have completed what the issue had asked for:

This example now works and all the tests are passing as well.

df = DataFrame({"a": [1, 2], "b": [3, 4], "c": [5, 6]}, index=["x", "y"])
result = df.groupby([0, 0, 1], axis=1).transform("first")

However, examples like these still raise

df = pd.DataFrame({"a": [1, 2,], "b": [3, 4], "c": [5, 6]}, index=["x", "y"])
result = df.groupby([0, 1, 1], axis=1).transform("idxmin", axis=0)

Although this should be included in another issue or PR, but here is some investigation

def reset_identity(values):
# reset the identities of the components
# of the values to prevent aliasing
for v in com.not_none(*values):
ax = v._get_axis(self.axis)
ax._reset_identity()
return values
Specifically for idxmin and idxmax, when the indexes mismatch, we end up at this function, which tries to reset the axis for each of the vs in the values that are obtained after grouping. However Series objects don't have axis which results in the error we see in the above idxmin call.

If we put a conditional such that axis is not checked when v is a Series, we do get some output shown below:

     a    b    c
a    x  NaN  NaN
b  NaN    x    x
c  NaN    x    x

This might help solve this issue.

@rhshadrach
Copy link
Member

Thanks for the update, will take a look in the next day or so. Just wanted to give my thoughts here:

Although this should be included in another issue or PR, but here is some investigation

I think the axis argument should be deprecated entirely; see #50405.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rhshadrach rhshadrach added this to the 2.0 milestone Jan 5, 2023
def idxmax(
self,
axis: Axis = 0,
axis: Axis = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is causing mypy to fail on the CI. Need to type-hint as Axis | None. Same for idxmin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have changed it

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm; failure is unrelated (SQL test)

@rhshadrach rhshadrach merged commit 3b26841 into pandas-dev:main Jan 5, 2023
@rhshadrach
Copy link
Member

Thanks @iofall - very nice!

@iofall
Copy link
Contributor Author

iofall commented Jan 8, 2023

Thanks @iofall - very nice!

Thanks! It was fun 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants