-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: DataFrame.agg crashing with user-defined function #41768
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
To add to this (tested with 1.2.4) print(df.agg({'A': max, 'B': min, 'C': lambda x: np.mean(x)})) # Does not work also does not work. This is, to me, confusing since it works with df = pd.DataFrame([[1, 2, 3, 0],
[4, 5, 6, 0],
[7, 8, 9, 0],
[np.nan, np.nan, np.nan, 0]],
columns=['A', 'B', 'C', 'idx'])
print(df.groupby('idx').agg({'A': max, 'B': min, 'C': lambda x: np.mean(x)})) # works!
print(df.agg({'A': max, 'B': min, 'C': lambda x: np.mean(x)})) # Does not work! |
It would be really great if someone could explain, here, what is the actual cause of this error. It's hard to even find an acceptable workaround without that information. |
@PorcelainMouse If I'm reading the source code right, the problem is that
And the reason I'm not sure why the |
I think here is another bizarre effect of what I think is the same issue. I still can't explain this one fully though. It seems like pandas is tracking somehow, if the series provided to the function is aggregated, even if the value is not returned? import pandas as pd
# Example data with some random numbers of shape (3,5)
df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9],
'D': [10, 11, 12],
'E': [13, 14, 15]
})
mean = df["A"].mean()
def custom_agg_simple(x):
return x.mean()
def custom_agg_constant(x):
return 1
def custom_agg_constant_with_useless_call(x):
x.mean()
return 1
print(df.agg({"A": custom_agg_simple}))
# Output:
# A 2.0
print(df.agg({"A": custom_agg_constant}))
# Output:
# A
# 0 1
# 1 1
# 2 1
print(df.agg({"A": custom_agg_constant_with_useless_call}))
# Output:
# A 1 EDIT: Ok I understood know what is going on. As mentioned in the previous comment it is the situation that To observe this in action you could use this function as aggregation: def custom_sum(x):
print(type(x))
if not isinstance(x, pd.Series):
print("Error")
raise ValueError()
return np.sum(x) Which prints This also hints at a workaround. You can use a custom decorator to always force the second codepath. This might look something like this: def allow_only_series(func):
@wraps(func)
def wrapper(x):
if not isinstance(x, pd.Series):
raise ValueError("Only Series allowed")
return func(x)
return wrapper Wrapping this around your custom function always forces the Still strange... So I hope this behaviour will be streamlined in the future. |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
Code Sample
Problem description
Passing lambda or user defined function causes
pd.DataFrame.agg
to crash.When setting a break point in the user defined function I noticed that the input is not a
Series
containing all values of the column, but a single scalar value from that column. This seems to be isolated toDataFrame.agg
, since my workaround was to add a trivial group by and make use ofGroupBy.agg
, which works fine.Output of
pd.show_versions()
INSTALLED VERSIONS
commit : b5958ee
python : 3.7.10.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-143-generic
Version : #147-Ubuntu SMP Wed Apr 14 16:10:11 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.1.5
numpy : 1.18.2
pytz : 2020.4
dateutil : 2.8.1
pip : 21.1.2
setuptools : 54.2.0
Cython : 0.29.21
pytest : None
hypothesis : None
sphinx : 3.1.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.17.0
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.4.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None
The text was updated successfully, but these errors were encountered: