DOC: groupby aligns Series when passed as groupands #15244

jcrist · 2017-01-27T19:01:08Z

Calling groupby with an unaligned index on the same frame in multiple threads can result in incorrect results. The following example demonstrates:

import numpy as np
import pandas as pd
from multiprocessing.pool import ThreadPool


def build_frame(N, seed):
    rs = np.random.RandomState(seed)
    amounts = rs.exponential(0.1, N) * 100
    ids = np.random.randint(0, 500, N)
    df = pd.DataFrame({'amount': amounts, 'id': ids})
    # XXX: call groupby once before makes everything pass.  Note that the
    # pre-filter by amount is necessary, without it things still fail.
    #df[df.amount < 0].groupby(df.id)
    return df


def f(x):
    return x[x.amount < 20].groupby(x.id).amount.count()


N = 100000
NTHREADS = 8
SEED = 100

df = build_frame(N, SEED)

pool = ThreadPool(NTHREADS)
args = [df] * NTHREADS * 2
r1 = pool.map(f, args)
r2 = pool.map(f, args)

# Print out lengths, which don't always match
print([len(i) for i in r1])
print([len(i) for i in r2])

# Check that results are equivalent
matches = [a.equals(b) for (a, b) in zip(r1, r2)]
assert all(matches)

On my machine, running python test.py results in:

[45, 471, 500, 429, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500]
[500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500]
Traceback (most recent call last):
  File "test.py", line 38, in <module>
    assert all(matches)
AssertionError

A few notes:

Calling groupby once beforehand fixes everything. (see comment in build_frame). To me this indicates that the groupby call sets up some state (that's cached) that may not be threadsafe. Also, the second call to pool.map always returns correct results, only the first calls fail.
Doing this without threading results in the same answer every time.
Without filtering by amount, the result is also always correct.

Tested with pandas master, as well as pandas 0.19.0 and up.

The text was updated successfully, but these errors were encountered:

jreback · 2017-01-27T19:20:37Z

def f(x):
    df = x[x.amount < 20]
    return df.groupby(df.id).amount.count()

in your original function, you are passing x.id which certainly can change as the filtered frame will be new, but the grouping index is the original. That is very unsafe in general (though pandas may take it)

jorisvandenbossche · 2017-01-27T19:22:06Z

I can't seem to reproduce it on my machine.

However, you are passing a grouper with a different shape. I actually didn't know that pandas is aligning the grouper. You can also pass the string: df.groupby('id').amount.count()

jreback · 2017-01-27T19:23:12Z

I think @jorisvandenbossche has it right.

.groupby is aligning the group-and (x.id). So this 'works', but I suspect this is in general an unsafe operation (from a thread point of view).

If you dont't align then this will be ok I think (IOW, if you pass in the group-and from the same frame).

certainly no guarantees :<

jcrist · 2017-01-27T19:41:13Z

Yeah, I think this is why this didn't show up until now - I didn't know pandas supported unaligned indices passed to groupby. Showed up in a user bug dask/dask#1876.

I can fix this in dask by manually aligning beforehand, but the threadsafe issue still stands. Out of curiousity, why does this fail only on the first call? Some index structure being built up and then cached on later calls?

jorisvandenbossche · 2017-01-27T19:41:34Z

The passed series (df.id) gets reindexed (

pandas/pandas/core/groupby.py

Line 2580 in ba05744

return grouper.reindex(axis)._values

), but that still shouldn't modify the original object?

jcrist · 2017-01-27T19:43:09Z

Hmmm, well something is getting modified, as it only fails the first time.

jreback · 2017-01-27T20:12:35Z

its shouldn't ever modify the original object, only the groupby object itself has state, but that could be the problem. IOW, this is a cached_property, which I suppose could be interrupted and if the groupby object is shared......

jcrist · 2017-01-27T20:23:10Z

This fails in the code above though, where only the original frame is shared (neither the groupby or the filtered frame is shared). But df.id is the same object for all calls - could that be modified by the reindex? Or am I misunderstanding what's being cached here?

jcrist · 2017-01-27T20:49:16Z

This seems to only fail if the index is longer than the grouped frame. Swapping the filter onto the index passes every time (not that this is recommended):

def f(x):
    return x.groupby(x[x.amount < 20].id).amount.count()

closes pandas-dev#15789 closes pandas-dev#15244

jcrist changed the title ~~Groupby on filtered column not threadsafe~~ Groupby with unaligned index not threadsafe Jan 27, 2017

jcrist mentioned this issue Jan 27, 2017

Random values/wrong type when grouping a filtered DataFrame by an unfiltered column dask/dask#1876

Closed

jreback added the Groupby label Jan 27, 2017

kernc mentioned this issue Feb 7, 2017

.groupby by should indicate it aligns the passed in Series #15338

Closed

jreback added the Docs label Mar 23, 2017

jreback added this to the Next Major Release milestone Mar 23, 2017

jreback changed the title ~~Groupby with unaligned index not threadsafe~~ DOC: groupby aligns Series when passed as groupands Mar 23, 2017

kernc mentioned this issue Mar 23, 2017

DOC: .groupby() aligns Series, accepts ndarray #15789

Closed

4 tasks

jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017

jreback closed this as completed in 56ccad8 Mar 23, 2017

mattip pushed a commit to mattip/pandas that referenced this issue Apr 3, 2017

DOC: .groupby() aligns Series, accepts ndarray

e11a037

closes pandas-dev#15789 closes pandas-dev#15244

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

DOC: groupby aligns Series when passed as groupands #15244

DOC: groupby aligns Series when passed as groupands #15244

jcrist commented Jan 27, 2017 •

edited by jreback

Loading

jreback commented Jan 27, 2017

Uh oh!

jorisvandenbossche commented Jan 27, 2017

Uh oh!

jreback commented Jan 27, 2017 •

edited

Loading

Uh oh!

jcrist commented Jan 27, 2017

Uh oh!

jorisvandenbossche commented Jan 27, 2017

Uh oh!

jcrist commented Jan 27, 2017

Uh oh!

jreback commented Jan 27, 2017

Uh oh!

jcrist commented Jan 27, 2017

Uh oh!

jcrist commented Jan 27, 2017

Uh oh!

Uh oh!

DOC: groupby aligns Series when passed as groupands #15244

DOC: groupby aligns Series when passed as groupands #15244

Comments

jcrist commented Jan 27, 2017 • edited by jreback Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

jreback commented Jan 27, 2017

Uh oh!

jorisvandenbossche commented Jan 27, 2017

Uh oh!

jreback commented Jan 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcrist commented Jan 27, 2017

Uh oh!

jorisvandenbossche commented Jan 27, 2017

Uh oh!

jcrist commented Jan 27, 2017

Uh oh!

jreback commented Jan 27, 2017

Uh oh!

jcrist commented Jan 27, 2017

Uh oh!

jcrist commented Jan 27, 2017

Uh oh!

jcrist commented Jan 27, 2017 •

edited by jreback

Loading

jreback commented Jan 27, 2017 •

edited

Loading