BUG: unwanted numeric coercion after groupby-apply #14423

waqarmalik · 2016-10-14T04:13:53Z

xref #14873 (boolean casts)
xref #14849 (datetime)

A small, complete example of the issue

import pandas as pd

def predictions(tool):
    out = pd.Series(index=['p1', 'p2', 'useTime'], dtype=object)
    if 'step1' in list(tool.State):
        out['p1'] = str(tool[tool.State == 'step1'].Machine.values[0])
    if 'step2' in list(tool.State):
        out['p2'] = str(tool[tool.State == 'step2'].Machine.values[0])
        out['useTime'] = str(tool[tool.State == 'step2'].oTime.values[0])
    return out


df1 = pd.DataFrame({'Key': ['B', 'B', 'A', 'A'],
                   'State': ['step1', 'step2', 'step1', 'step2'],
                   'oTime': ['', '2016-09-19 05:24:33', '', '2016-09-19 23:59:04'],
                   'Machine': ['23', '36L', '36R', '36R']})

df2 = df1.copy()
df2.oTime = pd.to_datetime(df2.oTime)


pred1 = df1.groupby('Key').apply(predictions)
pred2 = df2.groupby('Key').apply(predictions)

print(pred1)
print(pred2)

Actual Output:

      p1   p2              useTime
Key                               
A    36R  36R  2016-09-19 23:59:04
B     23  36L  2016-09-19 05:24:33
       p1   p2                        useTime
Key                                          
A     NaN  36R  2016-09-19T23:59:04.000000000
B    23.0  36L  2016-09-19T05:24:33.000000000

Expected Output

pred1 and pred2 should have the same values in column p1.
pred1 is correct whereas pred2 is changing type to float64.

Output of `pd.show_versions()`

## INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.0
nose: None
pip: 8.1.2
setuptools: 27.2.0
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: None
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.0
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2016-10-14T10:21:43Z

So I think we a duplicate of this already, need to search for it. In any event I think its doing a coercing conversion. This should strictly be a soft-conversion from object -> numeric. So the following works (though I think the existing code should actually work correctly, maybe something is not getting passed thru).

diff --git a/pandas/core/groupby.py b/pandas/core/groupby.py
index 3c376e3..a86e6d6 100644
--- a/pandas/core/groupby.py
+++ b/pandas/core/groupby.py
@@ -10,6 +10,7 @@ from pandas.compat import(
     zip, range, long, lzip,
     callable, map
 )
+import pandas as pd
 from pandas import compat
 from pandas.compat.numpy import function as nv
 from pandas.compat.numpy import _np_version_under1p8
@@ -3446,7 +3447,7 @@ class NDFrameGroupBy(GroupBy):
                 # as we are stacking can easily have object dtypes here
                 so = self._selected_obj
                 if (so.ndim == 2 and so.dtypes.isin(_DATELIKE_DTYPES).any()):
-                    result = result._convert(numeric=True)
+                    result = result.apply(lambda x: pd.to_numeric(x, errors='ignore'))
                     date_cols = self._selected_obj.select_dtypes(
                         include=list(_DATELIKE_DTYPES)).columns
                     date_cols = date_cols.intersection(result.columns)

a pull-request with tests would be welcome.

as an aside, what you are doing in side the .apply is completely inefficient and non-idiomatic.

waqarmalik · 2016-10-14T16:58:34Z

Tested and the suggested change works on a much larger data set too.

As an aside, I'd like to find better ways to do it -- groupby followed by extracting key parameters from each group. I couldn't devise a way to make aggregate work. Could you provide some suggestion on improving this? I've setup another page on stackoverflow for the discussion.

http://stackoverflow.com/questions/40032039/pandas-groupby-apply-weird-behavior-with-series

GH Bug pandas-dev#14423 During a group-by/apply on a DataFrame, in the presence of one or more DateTime-like columns, Pandas would incorrectly coerce the type of all other columns to numeric. E.g. a String column would be coerced to numeric, producing NaNs. Fix the issue, and add a test.

Rename test_numeric_coercion to test_apply_numeric_coercion_when_datetime, and add tests for GH pandas-dev#15421 and pandas-dev#14423

closes pandas-dev#14423 closes pandas-dev#15421 closes pandas-dev#15670 During a group-by/apply on a DataFrame, in the presence of one or more DateTime-like columns, Pandas would incorrectly coerce the type of all other columns to numeric. E.g. a String column would be coerced to numeric, producing NaNs. Author: Greg Williams <[email protected]> Closes pandas-dev#15680 from gwpdt/bugfix14423 and squashes the following commits: e1ed104 [Greg Williams] TST: Rename and expand test_numeric_coercion 0a15674 [Greg Williams] CLN: move import, add whatsnew entry c8844e0 [Greg Williams] CLN: PEP8 (whitespace fixes) 46d12c2 [Greg Williams] BUG: Group-by numeric type-coericion with datetime

jreback added Bug Groupby Difficulty Novice Dtype Conversions Unexpected or buggy dtype conversions labels Oct 14, 2016

jreback added this to the Next Major Release milestone Oct 14, 2016

jreback changed the title ~~Weird behavior in groupby-apply~~ BUG: unwanted numeric coercion after groupby-apply Oct 14, 2016

wes-turner mentioned this issue Dec 10, 2016

groupby type coercion dependent on presence of datetime column in grouped data #14849

Closed

jreback mentioned this issue Dec 13, 2016

BUG: groupby.agg coercing booleans #14873

Closed

masongallo mentioned this issue Dec 31, 2016

groupby casting to int64 #15027

Closed

jreback mentioned this issue Jan 11, 2017

Aggregate with pd.Series.nunique in datetime column has weird result #15112

Closed

jorisvandenbossche mentioned this issue Feb 16, 2017

Unexpected string->float conversion in DataFrame.groupby().apply() #15421

Closed

jreback mentioned this issue Mar 13, 2017

Date Type Corrupting Other Types in Group-by/Apply #15670

Closed

jreback modified the milestones: 0.20.0, Next Major Release Mar 13, 2017

gwpdt mentioned this issue Mar 14, 2017

BUG: Group-by numeric type-coercion with datetime #15680

Closed

gwpdt added a commit to gwpdt/pandas that referenced this issue Mar 16, 2017

TST: Rename and expand test_numeric_coercion

e1ed104

Rename test_numeric_coercion to test_apply_numeric_coercion_when_datetime, and add tests for GH pandas-dev#15421 and pandas-dev#14423

jreback closed this as completed in 37e5f78 Mar 16, 2017

toobaz mentioned this issue Jul 20, 2017

cast to float when using groupby.agg with function returning int on float input #17035

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: unwanted numeric coercion after groupby-apply #14423

BUG: unwanted numeric coercion after groupby-apply #14423

waqarmalik commented Oct 14, 2016 •

edited by jreback

Loading

jreback commented Oct 14, 2016

waqarmalik commented Oct 14, 2016 •

edited

Loading

BUG: unwanted numeric coercion after groupby-apply #14423

BUG: unwanted numeric coercion after groupby-apply #14423

Comments

waqarmalik commented Oct 14, 2016 • edited by jreback Loading

A small, complete example of the issue

Actual Output:

Expected Output

Output of pd.show_versions()

jreback commented Oct 14, 2016

waqarmalik commented Oct 14, 2016 • edited Loading

waqarmalik commented Oct 14, 2016 •

edited by jreback

Loading

Output of `pd.show_versions()`

waqarmalik commented Oct 14, 2016 •

edited

Loading