Skip to content

BUG: DataFrame.agg crashing with user-defined function #41768

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 tasks done
JoakimZachrisson opened this issue Jun 1, 2021 · 4 comments
Open
2 tasks done

BUG: DataFrame.agg crashing with user-defined function #41768

JoakimZachrisson opened this issue Jun 1, 2021 · 4 comments
Labels
Apply Apply, Aggregate, Transform, Map Bug

Comments

@JoakimZachrisson
Copy link

JoakimZachrisson commented Jun 1, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

Code Sample

import pandas as pd
import numpy as np

df = pd.DataFrame([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9],
                   [np.nan, np.nan, np.nan]],
                  columns=['A', 'B', 'C'])

print(df.agg(x=('A', max), y=('B', 'min'), z=('C', np.mean)))  # Works fine
print(df.agg(x=('A', max), y=('B', 'min'), z=('C', lambda x: np.mean(x))))  # Does not work

Problem description

Passing lambda or user defined function causes pd.DataFrame.agg to crash.

Traceback (most recent call last):
  File "/home/tobii.intra/jzn/.PyCharm2018.3/config/scratches/scratch_200.py", line 22, in <module>
    print(df.agg(x=('A', max), y=('B', 'min'), z=('C', lambda x: np.mean(x))))
  File "/home/tobii.intra/jzn/.virtualenvs/venv/lib/python3.7/site-packages/pandas/core/frame.py", line 7379, in aggregate
    result_in_dict = relabel_result(result, func, columns, order)
  File "/home/tobii.intra/jzn/.virtualenvs/venv/lib/python3.7/site-packages/pandas/core/aggregation.py", line 342, in relabel_result
    s = s[col_idx_order]
  File "/home/tobii.intra/jzn/.virtualenvs/venv/lib/python3.7/site-packages/pandas/core/frame.py", line 2912, in __getitem__
    indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
  File "/home/tobii.intra/jzn/.virtualenvs/venv/lib/python3.7/site-packages/pandas/core/indexing.py", line 1254, in _get_listlike_indexer
    self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
  File "/home/tobii.intra/jzn/.virtualenvs/venv/lib/python3.7/site-packages/pandas/core/indexing.py", line 1298, in _validate_read_indexer
    raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Int64Index([0], dtype='int64')] are in the [columns]"

When setting a break point in the user defined function I noticed that the input is not a Series containing all values of the column, but a single scalar value from that column. This seems to be isolated to DataFrame.agg, since my workaround was to add a trivial group by and make use of GroupBy.agg, which works fine.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : b5958ee
python : 3.7.10.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-143-generic
Version : #147-Ubuntu SMP Wed Apr 14 16:10:11 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.5
numpy : 1.18.2
pytz : 2020.4
dateutil : 2.8.1
pip : 21.1.2
setuptools : 54.2.0
Cython : 0.29.21
pytest : None
hypothesis : None
sphinx : 3.1.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.17.0
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.4.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@JoakimZachrisson JoakimZachrisson added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 1, 2021
@JoakimZachrisson JoakimZachrisson changed the title BUG: BUG: DataFrame.agg crashing with user-defined function Jun 1, 2021
@zerothi
Copy link

zerothi commented Jun 17, 2021

To add to this (tested with 1.2.4)

print(df.agg({'A': max, 'B': min, 'C': lambda x: np.mean(x)}))  # Does not work

also does not work. This is, to me, confusing since it works with groupby statements:

df = pd.DataFrame([[1, 2, 3, 0],
                   [4, 5, 6, 0],
                   [7, 8, 9, 0],
                   [np.nan, np.nan, np.nan, 0]],
                  columns=['A', 'B', 'C', 'idx'])

print(df.groupby('idx').agg({'A': max, 'B': min, 'C': lambda x: np.mean(x)}))  # works!
print(df.agg({'A': max, 'B': min, 'C': lambda x: np.mean(x)}))  # Does not work!

@mroeschke mroeschke added Apply Apply, Aggregate, Transform, Map and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 21, 2021
@PorcelainMouse
Copy link

It would be really great if someone could explain, here, what is the actual cause of this error. It's hard to even find an acceptable workaround without that information.

@wjandrea
Copy link
Contributor

wjandrea commented Oct 20, 2023

@PorcelainMouse If I'm reading the source code right, the problem is that s.apply(f) is tried first and only if that fails is f(s) called (in SeriesApply.agg). For more details see #53208.

np.mean is special-cased to skip that part (via SeriesApply.aggApply.aggget_cython_func).

And the reason s.apply(lambda x: np.mean(x)) doesn't fail is "The average is taken over the flattened array by default, otherwise over the specified axis" and np.array(x).flatten()np.array([x]). As a workaround you could do np.mean(x, axis=0), which raises a numpy.AxisError, which crashes out of s.apply(f) and into f(s).

I'm not sure why the KeyError is occurring though. I haven't looked at that part. I got here from this Stack Overflow question that doesn't involve it.

@AKuederle
Copy link

AKuederle commented Jun 7, 2024

I think here is another bizarre effect of what I think is the same issue. I still can't explain this one fully though.

It seems like pandas is tracking somehow, if the series provided to the function is aggregated, even if the value is not returned?

import pandas as pd

# Example data with some random numbers of shape (3,5)
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9],
    'D': [10, 11, 12],
    'E': [13, 14, 15]
})

mean = df["A"].mean()

def custom_agg_simple(x):
    return x.mean()

def custom_agg_constant(x):
    return 1

def custom_agg_constant_with_useless_call(x):
    x.mean()
    return 1

print(df.agg({"A": custom_agg_simple}))
# Output:
# A    2.0
print(df.agg({"A": custom_agg_constant}))
# Output:
#    A
# 0  1
# 1  1
# 2  1
print(df.agg({"A": custom_agg_constant_with_useless_call}))
# Output:
#    A   1

EDIT:

Ok I understood know what is going on. As mentioned in the previous comment it is the situation that agg first tries s.apply(f) before falling back before trying f(s).
apply tries to call the function first for each element. This of course fails when I am calling .mean() or any other Series specific method on the object. This means it falls back into the f(s) code path, which results in a different output shape.

To observe this in action you could use this function as aggregation:

def custom_sum(x):
    print(type(x))
    if not isinstance(x, pd.Series):
        print("Error")
        raise ValueError()
    return np.sum(x)

Which prints int first -> then Error -> then pd.Series.

This also hints at a workaround. You can use a custom decorator to always force the second codepath. This might look something like this:

def allow_only_series(func):
    @wraps(func)
    def wrapper(x):
        if not isinstance(x, pd.Series):
            raise ValueError("Only Series allowed")
        return func(x)
    return wrapper

Wrapping this around your custom function always forces the f(s) code path.

Still strange... So I hope this behaviour will be streamlined in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants