BUG: 'Series.to_numpy(dtype=, na_value=)' behaves differently with 'pd.NA' and 'np.nan' #48951

bbassett-tibco · 2022-10-05T03:27:00Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

x1 = pd.Series([1, 2, None, 4])
x2 = pd.Series([1, 2, np.nan, 4])
x3 = pd.Series([1, 2, pd.NA, 4])

print(x1.to_numpy('int32', na_value=0))
# [1 2 0 4]

print(x2.to_numpy('int32', na_value=0))
# [1 2 0 4]

print(x3.to_numpy('int32', na_value=0))
# Traceback (most recent call last):
#   File "<input>", line 14, in <module>
#     print(x3.to_numpy('int32', na_value=0))
#   File "C:\src\venv\w310\lib\site-packages\pandas\core\base.py", line 535, in to_numpy
#     result = np.asarray(self._values, dtype=dtype)
# TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NAType'

print(x3.to_numpy('float64', na_value=0))
# Traceback (most recent call last):
#   File "<input>", line 22, in <module>
#     print(x3.to_numpy('float64', na_value=0))
#   File "C:\src\venv\w310\lib\site-packages\pandas\core\base.py", line 535, in to_numpy
#     result = np.asarray(self._values, dtype=dtype)
# TypeError: float() argument must be a string or a real number, not 'NAType'

Issue Description

It appears that a Series that has a missing value that was created using either None or np.nan can be replaced by using Series.to_numpy(dtype=, na_value=), but one created with pd.NA fails with a raised exception (both arguments must be specified to trigger the behavior).

Expected Behavior

It is expected that since all three values (None, np.nan, and pd.NA) all represent missing values, that all three should behave the same. For the above reproducible example, the print statements should all report [1 2 0 4] (or [1. 2. 0. 4.] for the fourth 'float64' case).

Installed Versions

INSTALLED VERSIONS ------------------ commit : 87cfe4e python : 3.10.6.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19043 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United States.1252 pandas : 1.5.0 numpy : 1.23.3 pytz : 2022.2.1 dateutil : 2.8.2 setuptools : 65.3.0 pip : 22.2.2 Cython : 0.29.32 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.9.1 html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.6.0 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : NoneReplace this line with the output of pd.show_versions()

The text was updated successfully, but these errors were encountered:

abatomunkuev · 2022-10-05T04:46:12Z

Is it because pd.NA is experimental ? see Experimental new features

The exception raises in line 535. Is it a problem with numpy?

pandas/pandas/core/base.py

Line 535 in 87cfe4e

result = np.asarray(self._values, dtype=dtype)

phofl · 2022-10-05T08:02:32Z

Could you post this in #48891?

MarcoGorelli · 2022-10-05T08:26:14Z

Not sure this is the same as #48891, I think this can already be considered a bug

The following should work:

In [5]: pd.Series([1, 2, pd.NA, 4]).to_numpy('int64', na_value=0)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [5], line 1
----> 1 pd.Series([1, 2, pd.NA, 4]).to_numpy('int64', na_value=0)

File ~/pandas-dev/pandas/core/base.py:540, in IndexOpsMixin.to_numpy(self, dtype, copy, na_value, **kwargs)
    535     bad_keys = list(kwargs.keys())[0]
    536     raise TypeError(
    537         f"to_numpy() got an unexpected keyword argument '{bad_keys}'"
    538     )
--> 540 result = np.asarray(self._values, dtype=dtype)
    541 # TODO(GH-24345): Avoid potential double copy
    542 if copy or na_value is not lib.no_default:

TypeError: int() argument must be a string, a bytes-like object or a number, not 'NAType'

I think this is more related to #48864

MarcoGorelli · 2022-10-06T12:56:18Z

This works if you start with a nullable type:

In [1]: x3 = pd.Series([1, 2, pd.NA, 4], dtype='Int64')

In [2]: print(x3.to_numpy('int64', na_value=1))
[1 2 1 4]

In [3]: x3.to_numpy('int64', na_value=1)
Out[3]: array([1, 2, 1, 4])

The issue is if you start with dtype=object

Looking into this

baloe · 2022-11-30T09:31:42Z

Oh yes, I was about to open an issue, too, and name it

to_numpy(): na_value ignored when converting object-type pandas data

floattypeseries = pd.Series( [1,2,None], dtype='Float64')
objecttypeseries = floattypeseries.astype('object')

floattypeseries.to_numpy(dtype=float, na_value=np.nan)   # → succeeds, 'array([ 1.,  2., nan])'
objecttypeseries.to_numpy(dtype=float, na_value=np.nan)  # → fails with  'TypeError: float() argument must be a string or a number, not 'NAType''

Looking for a drop-in replacement as a workaround, my first idea was to use

.to_numpy(na_value=np.nan).astype(float)

but this fails for integer-type pandas data (not containing any <NA>).
This seems to work:

.astype('object').to_numpy(na_value=np.nan).astype(float)

bbassett-tibco added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 5, 2022

MarcoGorelli added NA - MaskedArrays Related to pd.NA and nullable extension arrays and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 5, 2022

MarcoGorelli mentioned this issue Oct 7, 2022

BUG: 'Series.to_numpy(dtype=, na_value=)' behaves differently with 'pd.NA' and 'np.nan' #48951 #48971

Closed

5 tasks

phofl mentioned this issue Dec 30, 2022

BUG: to_numpy not respecting na_value before converting to array #50506

Merged

6 tasks

mroeschke closed this as completed in #50506 Jan 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: 'Series.to_numpy(dtype=, na_value=)' behaves differently with 'pd.NA' and 'np.nan' #48951

BUG: 'Series.to_numpy(dtype=, na_value=)' behaves differently with 'pd.NA' and 'np.nan' #48951

bbassett-tibco commented Oct 5, 2022

abatomunkuev commented Oct 5, 2022

phofl commented Oct 5, 2022

MarcoGorelli commented Oct 5, 2022

MarcoGorelli commented Oct 6, 2022

baloe commented Nov 30, 2022 •

edited

Loading

BUG: 'Series.to_numpy(dtype=, na_value=)' behaves differently with 'pd.NA' and 'np.nan' #48951

BUG: 'Series.to_numpy(dtype=, na_value=)' behaves differently with 'pd.NA' and 'np.nan' #48951

Comments

bbassett-tibco commented Oct 5, 2022

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

abatomunkuev commented Oct 5, 2022

phofl commented Oct 5, 2022

MarcoGorelli commented Oct 5, 2022

MarcoGorelli commented Oct 6, 2022

baloe commented Nov 30, 2022 • edited Loading

baloe commented Nov 30, 2022 •

edited

Loading