Skip to content

BUG: Changing way of applying numpy functions on Series #33775

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

mproszewska
Copy link
Contributor

@mproszewska
Copy link
Contributor Author

Building documentation returns

/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/IPython/sphinxext/ipython_directive.py:1023: UserWarning: Code input with no code at /home/runner/work/pandas/pandas/doc/source/user_guide/computation.rst, line 622

which looks like issue #26788

@TomAugspurger
Copy link
Contributor

Is this going to kill performance for something like the following?

s = pd.Series(list(range(100_000))
s.apply(np.sin)

@mproszewska
Copy link
Contributor Author

Is this going to kill performance for something like the following?

s = pd.Series(list(range(100_000))
s.apply(np.sin)

I think so, but I feel like it's more appropriate use of the apply function. I think that using numpy functions in the old way should be possible but not like that.

@@ -3847,9 +3847,6 @@ def f(x):
f = func

with np.errstate(all="ignore"):
if isinstance(f, np.ufunc):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would we want to do this?
is going to kill perf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue #33492 describes that. I also added test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I said I am not inclided to accept this w/o proof that this does not kill perf. We likely don't have asv's for this (or even using a np.ufunc in apply).

Now if you add those and either perf is not degraded, or you actually catch the TypeError then I would be ok.

@@ -3847,9 +3847,6 @@ def f(x):
f = func

with np.errstate(all="ignore"):
if isinstance(f, np.ufunc):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I said I am not inclided to accept this w/o proof that this does not kill perf. We likely don't have asv's for this (or even using a np.ufunc in apply).

Now if you add those and either perf is not degraded, or you actually catch the TypeError then I would be ok.

@jreback jreback added the Performance Memory or execution speed performance label Apr 25, 2020
@mproszewska
Copy link
Contributor Author

I felt like numpy functions could be called differently e.g. np.sin(series).
I could add if that checks if series dtype is numpy array.
However, I wouldn't expect that numpy array would be used as dtype, I would usually expect inferred object dtype.

@TomAugspurger
Copy link
Contributor

If we're going to change this then we'll need to deprecate the current behavior first. I haven't looked closely at the original issue, but at a glance it looks like it's dealing with nested data, which isn't especially well-supported in pandas. It's not clear to me that the gain from changing the implementation here is worth the effect to existing users.

@mproszewska
Copy link
Contributor Author

Maybe better solution would be adding new parameter e.g if parameter force_mapping is set to True, function will be applied on each element separately.

@pep8speaks
Copy link

pep8speaks commented May 10, 2020

Hello @mproszewska! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-05-10 21:18:01 UTC

@TomAugspurger
Copy link
Contributor

I don't like just adding the force_mapping keyword. It's not clear from the name why you would want it.

If we want to do this, I think the solution is to actually deprecate the current behavior of ufuncs in apply and instruct people to use .pipe(ufunc) rather than .apply(ufunc). But I'm not sure right now if it's worth the churn.

@jreback
Copy link
Contributor

jreback commented May 11, 2020

I agree with @TomAugspurger here, would be ok with showing a PerformanceWarning, but unlikely to be able to change / deprecate this behavior and cannot kill performance like this.

@mproszewska
Copy link
Contributor Author

What do you mean by showing PerformanceWarning? Where should be this warning included?
Should I provide better parameter name and description? or are you completely against idea behind that PR?

@jreback
Copy link
Contributor

jreback commented May 15, 2020

What do you mean by showing PerformanceWarning? Where should be this warning included?
Should I provide better parameter name and description? or are you completely against idea behind that PR?

we are -1 on adding any options to this function. you cannot simply change the way something like this works w/o showing the perf benefit (or drag in this case). I suspect this is likely going to be bad.

So the only way this could possibly go forward is to detect cases where this is likey to be used (e.g. using a ufunc on an object dtype), then dispatch that way. You might need to then show a PerformanceWarning in that case.

@TomAugspurger
Copy link
Contributor

@jreback do you have a preference for PerformanceWarning vs. deprecating the current behavior. IMO, deprecating the current behavior is the best option: We don't slow down people's code immediately. Gives them time to update their code from .apply(np.sin) to .pipe(np.sin).

The downside is that we can't fix the bug immediately. But IMO not slowing down this code without warning is more important than fixing the bug that (I think) only shows up for nested data.

@jreback
Copy link
Contributor

jreback commented Jun 1, 2020

I don’t think we should change this at all including deprecation

@jreback
Copy link
Contributor

jreback commented Jun 9, 2020

closing this as I don't see positions changing here. If you want to raise an issue and get consensus on a change okie dokie.

@jreback jreback closed this Jun 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: apply for series of numpy arrays falis with np functions
5 participants