Skip to content

BUG: unwanted numeric coercion after groupby-apply #14423

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
waqarmalik opened this issue Oct 14, 2016 · 2 comments
Closed

BUG: unwanted numeric coercion after groupby-apply #14423

waqarmalik opened this issue Oct 14, 2016 · 2 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby
Milestone

Comments

@waqarmalik
Copy link

waqarmalik commented Oct 14, 2016

xref #14873 (boolean casts)
xref #14849 (datetime)

A small, complete example of the issue

import pandas as pd

def predictions(tool):
    out = pd.Series(index=['p1', 'p2', 'useTime'], dtype=object)
    if 'step1' in list(tool.State):
        out['p1'] = str(tool[tool.State == 'step1'].Machine.values[0])
    if 'step2' in list(tool.State):
        out['p2'] = str(tool[tool.State == 'step2'].Machine.values[0])
        out['useTime'] = str(tool[tool.State == 'step2'].oTime.values[0])
    return out


df1 = pd.DataFrame({'Key': ['B', 'B', 'A', 'A'],
                   'State': ['step1', 'step2', 'step1', 'step2'],
                   'oTime': ['', '2016-09-19 05:24:33', '', '2016-09-19 23:59:04'],
                   'Machine': ['23', '36L', '36R', '36R']})

df2 = df1.copy()
df2.oTime = pd.to_datetime(df2.oTime)


pred1 = df1.groupby('Key').apply(predictions)
pred2 = df2.groupby('Key').apply(predictions)

print(pred1)
print(pred2)

Actual Output:

      p1   p2              useTime
Key                               
A    36R  36R  2016-09-19 23:59:04
B     23  36L  2016-09-19 05:24:33
       p1   p2                        useTime
Key                                          
A     NaN  36R  2016-09-19T23:59:04.000000000
B    23.0  36L  2016-09-19T05:24:33.000000000

Expected Output

pred1 and pred2 should have the same values in column p1.
pred1 is correct whereas pred2 is changing type to float64.

Output of pd.show_versions()

## INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.0
nose: None
pip: 8.1.2
setuptools: 27.2.0
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: None
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.0
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Oct 14, 2016

So I think we a duplicate of this already, need to search for it. In any event I think its doing a coercing conversion. This should strictly be a soft-conversion from object -> numeric. So the following works (though I think the existing code should actually work correctly, maybe something is not getting passed thru).

diff --git a/pandas/core/groupby.py b/pandas/core/groupby.py
index 3c376e3..a86e6d6 100644
--- a/pandas/core/groupby.py
+++ b/pandas/core/groupby.py
@@ -10,6 +10,7 @@ from pandas.compat import(
     zip, range, long, lzip,
     callable, map
 )
+import pandas as pd
 from pandas import compat
 from pandas.compat.numpy import function as nv
 from pandas.compat.numpy import _np_version_under1p8
@@ -3446,7 +3447,7 @@ class NDFrameGroupBy(GroupBy):
                 # as we are stacking can easily have object dtypes here
                 so = self._selected_obj
                 if (so.ndim == 2 and so.dtypes.isin(_DATELIKE_DTYPES).any()):
-                    result = result._convert(numeric=True)
+                    result = result.apply(lambda x: pd.to_numeric(x, errors='ignore'))
                     date_cols = self._selected_obj.select_dtypes(
                         include=list(_DATELIKE_DTYPES)).columns
                     date_cols = date_cols.intersection(result.columns)

a pull-request with tests would be welcome.

as an aside, what you are doing in side the .apply is completely inefficient and non-idiomatic.

@jreback jreback added Bug Groupby Difficulty Novice Dtype Conversions Unexpected or buggy dtype conversions labels Oct 14, 2016
@jreback jreback added this to the Next Major Release milestone Oct 14, 2016
@jreback jreback changed the title Weird behavior in groupby-apply BUG: unwanted numeric coercion after groupby-apply Oct 14, 2016
@waqarmalik
Copy link
Author

waqarmalik commented Oct 14, 2016

Tested and the suggested change works on a much larger data set too.

As an aside, I'd like to find better ways to do it -- groupby followed by extracting key parameters from each group. I couldn't devise a way to make aggregate work. Could you provide some suggestion on improving this? I've setup another page on stackoverflow for the discussion.

http://stackoverflow.com/questions/40032039/pandas-groupby-apply-weird-behavior-with-series

@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 13, 2017
gwpdt added a commit to gwpdt/pandas that referenced this issue Mar 14, 2017
GH Bug pandas-dev#14423

During a group-by/apply on a DataFrame, in the presence of one or more
DateTime-like columns, Pandas would incorrectly coerce the type of all
other columns to numeric.  E.g. a String column would be coerced to
numeric, producing NaNs.

Fix the issue, and add a test.
gwpdt added a commit to gwpdt/pandas that referenced this issue Mar 16, 2017
Rename test_numeric_coercion to
test_apply_numeric_coercion_when_datetime, and add tests for GH pandas-dev#15421
and pandas-dev#14423
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017
closes pandas-dev#14423
closes pandas-dev#15421
closes pandas-dev#15670

During a group-by/apply
on a DataFrame, in the presence of one or more  DateTime-like columns,
Pandas would incorrectly coerce the type of all  other columns to
numeric.  E.g. a String column would be coerced to  numeric, producing
NaNs.

Author: Greg Williams <[email protected]>

Closes pandas-dev#15680 from gwpdt/bugfix14423 and squashes the following commits:

e1ed104 [Greg Williams] TST: Rename and expand test_numeric_coercion
0a15674 [Greg Williams] CLN: move import, add whatsnew entry
c8844e0 [Greg Williams] CLN: PEP8 (whitespace fixes)
46d12c2 [Greg Williams] BUG: Group-by numeric type-coericion with datetime
mattip pushed a commit to mattip/pandas that referenced this issue Apr 3, 2017
closes pandas-dev#14423
closes pandas-dev#15421
closes pandas-dev#15670

During a group-by/apply
on a DataFrame, in the presence of one or more  DateTime-like columns,
Pandas would incorrectly coerce the type of all  other columns to
numeric.  E.g. a String column would be coerced to  numeric, producing
NaNs.

Author: Greg Williams <[email protected]>

Closes pandas-dev#15680 from gwpdt/bugfix14423 and squashes the following commits:

e1ed104 [Greg Williams] TST: Rename and expand test_numeric_coercion
0a15674 [Greg Williams] CLN: move import, add whatsnew entry
c8844e0 [Greg Williams] CLN: PEP8 (whitespace fixes)
46d12c2 [Greg Williams] BUG: Group-by numeric type-coericion with datetime
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants