BUG: DataFrame.agg crashing with user-defined function #41768

JoakimZachrisson · 2021-06-01T16:01:14Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.

Code Sample

import pandas as pd
import numpy as np

df = pd.DataFrame([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9],
                   [np.nan, np.nan, np.nan]],
                  columns=['A', 'B', 'C'])

print(df.agg(x=('A', max), y=('B', 'min'), z=('C', np.mean)))  # Works fine
print(df.agg(x=('A', max), y=('B', 'min'), z=('C', lambda x: np.mean(x))))  # Does not work

Problem description

Passing lambda or user defined function causes pd.DataFrame.agg to crash.

Traceback (most recent call last):
  File "/home/tobii.intra/jzn/.PyCharm2018.3/config/scratches/scratch_200.py", line 22, in <module>
    print(df.agg(x=('A', max), y=('B', 'min'), z=('C', lambda x: np.mean(x))))
  File "/home/tobii.intra/jzn/.virtualenvs/venv/lib/python3.7/site-packages/pandas/core/frame.py", line 7379, in aggregate
    result_in_dict = relabel_result(result, func, columns, order)
  File "/home/tobii.intra/jzn/.virtualenvs/venv/lib/python3.7/site-packages/pandas/core/aggregation.py", line 342, in relabel_result
    s = s[col_idx_order]
  File "/home/tobii.intra/jzn/.virtualenvs/venv/lib/python3.7/site-packages/pandas/core/frame.py", line 2912, in __getitem__
    indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
  File "/home/tobii.intra/jzn/.virtualenvs/venv/lib/python3.7/site-packages/pandas/core/indexing.py", line 1254, in _get_listlike_indexer
    self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
  File "/home/tobii.intra/jzn/.virtualenvs/venv/lib/python3.7/site-packages/pandas/core/indexing.py", line 1298, in _validate_read_indexer
    raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Int64Index([0], dtype='int64')] are in the [columns]"

When setting a break point in the user defined function I noticed that the input is not a Series containing all values of the column, but a single scalar value from that column. This seems to be isolated to DataFrame.agg, since my workaround was to add a trivial group by and make use of GroupBy.agg, which works fine.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : b5958ee
python : 3.7.10.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-143-generic
Version : #147-Ubuntu SMP Wed Apr 14 16:10:11 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.5
numpy : 1.18.2
pytz : 2020.4
dateutil : 2.8.1
pip : 21.1.2
setuptools : 54.2.0
Cython : 0.29.21
pytest : None
hypothesis : None
sphinx : 3.1.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.17.0
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.4.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

zerothi · 2021-06-17T07:54:44Z

To add to this (tested with 1.2.4)

print(df.agg({'A': max, 'B': min, 'C': lambda x: np.mean(x)}))  # Does not work

also does not work. This is, to me, confusing since it works with groupby statements:

df = pd.DataFrame([[1, 2, 3, 0],
                   [4, 5, 6, 0],
                   [7, 8, 9, 0],
                   [np.nan, np.nan, np.nan, 0]],
                  columns=['A', 'B', 'C', 'idx'])

print(df.groupby('idx').agg({'A': max, 'B': min, 'C': lambda x: np.mean(x)}))  # works!
print(df.agg({'A': max, 'B': min, 'C': lambda x: np.mean(x)}))  # Does not work!

PorcelainMouse · 2023-01-10T21:41:18Z

It would be really great if someone could explain, here, what is the actual cause of this error. It's hard to even find an acceptable workaround without that information.

wjandrea · 2023-10-20T18:16:49Z

@PorcelainMouse If I'm reading the source code right, the problem is that s.apply(f) is tried first and only if that fails is f(s) called (in SeriesApply.agg). For more details see #53208.

np.mean is special-cased to skip that part (via SeriesApply.agg → Apply.agg → get_cython_func).

And the reason s.apply(lambda x: np.mean(x)) doesn't fail is "The average is taken over the flattened array by default, otherwise over the specified axis" and np.array(x).flatten() → np.array([x]). As a workaround you could do np.mean(x, axis=0), which raises a numpy.AxisError, which crashes out of s.apply(f) and into f(s).

I'm not sure why the KeyError is occurring though. I haven't looked at that part. I got here from this Stack Overflow question that doesn't involve it.

AKuederle · 2024-06-07T11:18:27Z

I think here is another bizarre effect of what I think is the same issue. I still can't explain this one fully though.

It seems like pandas is tracking somehow, if the series provided to the function is aggregated, even if the value is not returned?

import pandas as pd

# Example data with some random numbers of shape (3,5)
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9],
    'D': [10, 11, 12],
    'E': [13, 14, 15]
})

mean = df["A"].mean()

def custom_agg_simple(x):
    return x.mean()

def custom_agg_constant(x):
    return 1

def custom_agg_constant_with_useless_call(x):
    x.mean()
    return 1

print(df.agg({"A": custom_agg_simple}))
# Output:
# A    2.0
print(df.agg({"A": custom_agg_constant}))
# Output:
#    A
# 0  1
# 1  1
# 2  1
print(df.agg({"A": custom_agg_constant_with_useless_call}))
# Output:
#    A   1

EDIT:

Ok I understood know what is going on. As mentioned in the previous comment it is the situation that agg first tries s.apply(f) before falling back before trying f(s).
apply tries to call the function first for each element. This of course fails when I am calling .mean() or any other Series specific method on the object. This means it falls back into the f(s) code path, which results in a different output shape.

To observe this in action you could use this function as aggregation:

def custom_sum(x):
    print(type(x))
    if not isinstance(x, pd.Series):
        print("Error")
        raise ValueError()
    return np.sum(x)

Which prints int first -> then Error -> then pd.Series.

This also hints at a workaround. You can use a custom decorator to always force the second codepath. This might look something like this:

def allow_only_series(func):
    @wraps(func)
    def wrapper(x):
        if not isinstance(x, pd.Series):
            raise ValueError("Only Series allowed")
        return func(x)
    return wrapper

Wrapping this around your custom function always forces the f(s) code path.

Still strange... So I hope this behaviour will be streamlined in the future.

JoakimZachrisson added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 1, 2021

JoakimZachrisson changed the title ~~BUG:~~ BUG: DataFrame.agg crashing with user-defined function Jun 1, 2021

mroeschke added Apply Apply, Aggregate, Transform, Map and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 21, 2021

mroeschke mentioned this issue May 13, 2023

BUG: DataFrame.agg not returning a reduced result when providing a lambda #53208

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrame.agg crashing with user-defined function #41768

BUG: DataFrame.agg crashing with user-defined function #41768

JoakimZachrisson commented Jun 1, 2021 •

edited

Loading

INSTALLED VERSIONS

zerothi commented Jun 17, 2021 •

edited

Loading

PorcelainMouse commented Jan 10, 2023

wjandrea commented Oct 20, 2023 •

edited

Loading

AKuederle commented Jun 7, 2024 •

edited

Loading

BUG: DataFrame.agg crashing with user-defined function #41768

BUG: DataFrame.agg crashing with user-defined function #41768

Comments

JoakimZachrisson commented Jun 1, 2021 • edited Loading

Code Sample

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

zerothi commented Jun 17, 2021 • edited Loading

PorcelainMouse commented Jan 10, 2023

wjandrea commented Oct 20, 2023 • edited Loading

AKuederle commented Jun 7, 2024 • edited Loading

JoakimZachrisson commented Jun 1, 2021 •

edited

Loading

Output of `pd.show_versions()`

zerothi commented Jun 17, 2021 •

edited

Loading

wjandrea commented Oct 20, 2023 •

edited

Loading

AKuederle commented Jun 7, 2024 •

edited

Loading