Skip to content

Converting from categorical to int ignores NaNs #28406

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hnykda opened this issue Sep 12, 2019 · 10 comments · Fixed by #28438
Closed

Converting from categorical to int ignores NaNs #28406

hnykda opened this issue Sep 12, 2019 · 10 comments · Fixed by #28438
Labels
Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@hnykda
Copy link

hnykda commented Sep 12, 2019

Code Sample, a copy-pastable example if possible

In [6]: s = pd.Series([1, 0, None], dtype='category')                                                                                                                                                                                            

In [7]: s                                                                                                                                                                                                                                      
Out[7]: 
0      1
1      0
2    NaN
dtype: category
Categories (2, int64): [0, 1]

In [8]: s.astype(int)                                                                                                                                                                                                                          
Out[8]: 
0                      1
1                      0
2   -9223372036854775808  # <- this is unexpected
dtype: int64

Problem description

When converting categorical series back into Int column, it converts NaN to incorect integer negative value.

Expected Output

I would expect that NaN in category converts to NaN in IntX(nullable integer) or float.

When trying to use d.astype('Int8'), I get an error dtype not understood

Output of pd.show_versions()

In [147]: pd.show_versions()                                                                                                                                                                                                                   

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.4.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.2.13-arch1-1-ARCH
machine          : x86_64
processor        : 
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 0.25.1
numpy            : 1.17.2
pytz             : 2019.2
dateutil         : 2.8.0
pip              : 19.2.3
setuptools       : 41.2.0
Cython           : None
pytest           : 5.1.2
hypothesis       : None
sphinx           : None
blosc            : None
feather          : 0.4.0
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : 7.8.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : None
numexpr          : 2.7.0
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 0.14.1
pytables         : None
s3fs             : None
scipy            : None
sqlalchemy       : None
tables           : 3.5.2
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None
@TomAugspurger TomAugspurger added Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Sep 12, 2019
@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Sep 12, 2019
@TomAugspurger
Copy link
Contributor

I would expect s.atype(int) to raise, the same as s.astype(float).astype(int)

To get a nullable integer, s.astype("Int8") should work, but let's leave that as a second issue (we may already have one for it, not sure).

@dsaxton
Copy link
Member

dsaxton commented Sep 12, 2019

I would expect s.atype(int) to raise, the same as s.astype(float).astype(int)

To get a nullable integer, s.astype("Int8") should work, but let's leave that as a second issue (we may already have one for it, not sure).

It seems like the weird value gets introduced here: https://github.com/pandas-dev/pandas/blob/master/pandas/core/arrays/categorical.py#L523 . Would this be a numpy issue?

In [1]: import numpy as np                                                      

In [2]: arr = np.array([1.0, 0.0, np.nan])                                      

In [3]: np.array(arr, dtype=np.int64)                                            
Out[3]: array([                   1,                    0, -9223372036854775808])

@hnykda
Copy link
Author

hnykda commented Sep 12, 2019

I would expect s.atype(int) to raise, the same as s.astype(float).astype(int)

Well, but that's a bit sad because it means that once you convert to categoricals, you cannot get back :-( .

@TomAugspurger
Copy link
Contributor

It seems like the weird value gets introduced here: https://github.com/pandas-dev/pandas/blob/master/pandas/core/arrays/categorical.py#L523 . Would this be a numpy issue?

Thanks for tracking it down. I think that's just NumPy's defined behavior.

We might want astype_nansafe there (should also test with inf).

Well, but that's a bit sad because it means that once you convert to categoricals, you cannot get back :-( .

What do you mean?

@hnykda
Copy link
Author

hnykda commented Sep 12, 2019

What do you mean?

Well, I meant something like if you did s = pd.Series([1, None]).astype(int).astype(category) and now wanted to get back to int somehow. The workaround seems to be s.astype('float').astype('Int8') for example.

I understand it's tricky, especially when dealing with nullable integers. It's true that raising is probably better than trying to be somehow clever.

@TomAugspurger
Copy link
Contributor

Sorry, I'm still not understanding. In your example,

s = pd.Series([1, None]).astype(int)

raises. The conversion from float to int is what raises, not to or from categorical.

@hnykda
Copy link
Author

hnykda commented Sep 12, 2019

Sorry, I wrote it wrongly. I meant if you do:

In [3]: s = pd.Series([1, None], dtype='Int8').astype('category')

and now wanted to get back to numerics, or more specifically e.g. Int8. You have to do

In [6]: s.astype(float).astype('Int8')

while it would be super cool to do just s.astype('Int8')

@TomAugspurger
Copy link
Contributor

s.astype('Int8')

That should work just fine. As I said earlier

To get a nullable integer, s.astype("Int8") should work, but let's leave that as a second issue (we may already have one for it, not sure).

@dsaxton
Copy link
Member

dsaxton commented Sep 12, 2019

Thanks for tracking it down. I think that's just NumPy's defined behavior.

We might want astype_nansafe there (should also test with inf).

Seems that astype_nansafe expects an ndarray but self is Categorical here. Any thoughts on the best workaround?

@TomAugspurger
Copy link
Contributor

Hmm, OK.

It's a bit unfortunate, but I we'll need to include something like

if is_integer_dtype(dtype):
    if self.isna().any():
        raise ValueError(...)

@jreback jreback modified the milestones: Contributions Welcome, 1.0 Sep 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants