Skip to content

BUG: DataFrameGroupBy.transform unnecessarily coerces dtype #42617

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rhshadrach opened this issue Jul 19, 2021 · 2 comments
Open

BUG: DataFrameGroupBy.transform unnecessarily coerces dtype #42617

rhshadrach opened this issue Jul 19, 2021 · 2 comments
Labels
Apply Apply, Aggregate, Transform, Map Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby

Comments

@rhshadrach
Copy link
Member

rhshadrach commented Jul 19, 2021

Comparing

df = pd.DataFrame({'a' : [1], 'b' : [2.], 'c' : ['3']})
res = df.groupby([0]).transform(lambda x: x.max() - x.min())
print(res)

df = pd.DataFrame({'a' : [1], 'b' : [2.]})
res = df.groupby([0]).transform(lambda x: x.max() - x.min())
print(res)

gives

   a    b
0  0  0.0
     a    b
0  0.0  0.0

Note that in the first result, the dtype of a starts and ends as integer, whereas in the 2nd, the dtype starts as integer and ends as float.

When the object column (c) is on the frame, we call _transform_item_by_item which operates on each column individually, giving the expected dtype in the result. Without the object column, we take the slowpath calling apply. apply returns a Series whose index is ['a', 'b'] with values [0.0, 0.0], where the dtype of column b coerces the dtype of column a into the final result of all floats.

The first example above is deprecated (will raise in a future version), but the second example will still be valid and result in the wrong dtype.

@rhshadrach rhshadrach added Bug Groupby Dtype Conversions Unexpected or buggy dtype conversions Apply Apply, Aggregate, Transform, Map labels Jul 19, 2021
@rhshadrach rhshadrach added this to the Contributions Welcome milestone Jul 19, 2021
@rhshadrach
Copy link
Member Author

rhshadrach commented Jul 20, 2021

Thinking about this more, if the user could control what input was being given to the transform function (i.e. column-by-column vs entire frame), we could default to the fastpath and then allow them to utilize the slowpath if they so desired. By having arguments control the input (and hence, output), the result above would no longer be surprising. It would also mean cleaner code paths internally (as opposed to try A fallback to B).

@michael-michelotti
Copy link

@rhshadrach I was working on a tangentially related issue and had a similar idea. See writeup here:
#26840

I thought there were a couple of other things it might be worth looking at regarding the _choose_path method as well.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby
Projects
None yet
Development

No branches or pull requests

3 participants