Converting from categorical to int ignores NaNs #28406

hnykda · 2019-09-12T10:25:07Z

Code Sample, a copy-pastable example if possible

In [6]: s = pd.Series([1, 0, None], dtype='category')                                                                                                                                                                                            

In [7]: s                                                                                                                                                                                                                                      
Out[7]: 
0      1
1      0
2    NaN
dtype: category
Categories (2, int64): [0, 1]

In [8]: s.astype(int)                                                                                                                                                                                                                          
Out[8]: 
0                      1
1                      0
2   -9223372036854775808  # <- this is unexpected
dtype: int64

Problem description

When converting categorical series back into Int column, it converts NaN to incorect integer negative value.

Expected Output

I would expect that NaN in category converts to NaN in IntX(nullable integer) or float.

When trying to use d.astype('Int8'), I get an error dtype not understood

Output of `pd.show_versions()`

In [147]: pd.show_versions()                                                                                                                                                                                                                   

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.4.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.2.13-arch1-1-ARCH
machine          : x86_64
processor        : 
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 0.25.1
numpy            : 1.17.2
pytz             : 2019.2
dateutil         : 2.8.0
pip              : 19.2.3
setuptools       : 41.2.0
Cython           : None
pytest           : 5.1.2
hypothesis       : None
sphinx           : None
blosc            : None
feather          : 0.4.0
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : 7.8.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : None
numexpr          : 2.7.0
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 0.14.1
pytables         : None
s3fs             : None
scipy            : None
sqlalchemy       : None
tables           : 3.5.2
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-09-12T17:13:04Z

I would expect s.atype(int) to raise, the same as s.astype(float).astype(int)

To get a nullable integer, s.astype("Int8") should work, but let's leave that as a second issue (we may already have one for it, not sure).

dsaxton · 2019-09-12T17:40:53Z

I would expect s.atype(int) to raise, the same as s.astype(float).astype(int)

To get a nullable integer, s.astype("Int8") should work, but let's leave that as a second issue (we may already have one for it, not sure).

It seems like the weird value gets introduced here: https://github.com/pandas-dev/pandas/blob/master/pandas/core/arrays/categorical.py#L523 . Would this be a numpy issue?

In [1]: import numpy as np                                                      

In [2]: arr = np.array([1.0, 0.0, np.nan])                                      

In [3]: np.array(arr, dtype=np.int64)                                            
Out[3]: array([                   1,                    0, -9223372036854775808])

hnykda · 2019-09-12T18:09:34Z

I would expect s.atype(int) to raise, the same as s.astype(float).astype(int)

Well, but that's a bit sad because it means that once you convert to categoricals, you cannot get back :-( .

TomAugspurger · 2019-09-12T18:13:25Z

It seems like the weird value gets introduced here: https://github.com/pandas-dev/pandas/blob/master/pandas/core/arrays/categorical.py#L523 . Would this be a numpy issue?

Thanks for tracking it down. I think that's just NumPy's defined behavior.

We might want astype_nansafe there (should also test with inf).

Well, but that's a bit sad because it means that once you convert to categoricals, you cannot get back :-( .

What do you mean?

hnykda · 2019-09-12T18:40:05Z

What do you mean?

Well, I meant something like if you did s = pd.Series([1, None]).astype(int).astype(category) and now wanted to get back to int somehow. The workaround seems to be s.astype('float').astype('Int8') for example.

I understand it's tricky, especially when dealing with nullable integers. It's true that raising is probably better than trying to be somehow clever.

TomAugspurger · 2019-09-12T18:56:57Z

Sorry, I'm still not understanding. In your example,

s = pd.Series([1, None]).astype(int)

raises. The conversion from float to int is what raises, not to or from categorical.

hnykda · 2019-09-12T19:28:12Z

Sorry, I wrote it wrongly. I meant if you do:

In [3]: s = pd.Series([1, None], dtype='Int8').astype('category')

and now wanted to get back to numerics, or more specifically e.g. Int8. You have to do

In [6]: s.astype(float).astype('Int8')

while it would be super cool to do just s.astype('Int8')

TomAugspurger · 2019-09-12T19:33:23Z

s.astype('Int8')

That should work just fine. As I said earlier

To get a nullable integer, s.astype("Int8") should work, but let's leave that as a second issue (we may already have one for it, not sure).

dsaxton · 2019-09-12T23:25:10Z

Thanks for tracking it down. I think that's just NumPy's defined behavior.

We might want astype_nansafe there (should also test with inf).

Seems that astype_nansafe expects an ndarray but self is Categorical here. Any thoughts on the best workaround?

TomAugspurger · 2019-09-13T21:07:19Z

Hmm, OK.

It's a bit unfortunate, but I we'll need to include something like

if is_integer_dtype(dtype):
    if self.isna().any():
        raise ValueError(...)

TomAugspurger added Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Sep 12, 2019

TomAugspurger added this to the Contributions Welcome milestone Sep 12, 2019

TomAugspurger added the Effort Low label Sep 12, 2019

dsaxton mentioned this issue Sep 13, 2019

BUG: Don't cast categorical nan to int #28438

Merged

4 tasks

jreback modified the milestones: Contributions Welcome, 1.0 Sep 18, 2019

jreback closed this as completed in #28438 Sep 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Converting from categorical to int ignores NaNs #28406

Converting from categorical to int ignores NaNs #28406

hnykda commented Sep 12, 2019 •

edited

Loading

TomAugspurger commented Sep 12, 2019

Uh oh!

dsaxton commented Sep 12, 2019

Uh oh!

hnykda commented Sep 12, 2019 •

edited

Loading

Uh oh!

TomAugspurger commented Sep 12, 2019

Uh oh!

hnykda commented Sep 12, 2019 •

edited

Loading

Uh oh!

TomAugspurger commented Sep 12, 2019

Uh oh!

hnykda commented Sep 12, 2019 •

edited

Loading

Uh oh!

TomAugspurger commented Sep 12, 2019

Uh oh!

dsaxton commented Sep 12, 2019

Uh oh!

TomAugspurger commented Sep 13, 2019

Uh oh!

Uh oh!

Converting from categorical to int ignores NaNs #28406

Converting from categorical to int ignores NaNs #28406

Comments

hnykda commented Sep 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

TomAugspurger commented Sep 12, 2019

Uh oh!

dsaxton commented Sep 12, 2019

Uh oh!

hnykda commented Sep 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Sep 12, 2019

Uh oh!

hnykda commented Sep 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Sep 12, 2019

Uh oh!

hnykda commented Sep 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Sep 12, 2019

Uh oh!

dsaxton commented Sep 12, 2019

Uh oh!

TomAugspurger commented Sep 13, 2019

Uh oh!

hnykda commented Sep 12, 2019 •

edited

Loading

Output of `pd.show_versions()`

hnykda commented Sep 12, 2019 •

edited

Loading

hnykda commented Sep 12, 2019 •

edited

Loading

hnykda commented Sep 12, 2019 •

edited

Loading