Skip to content

BUG: ExtensionArrays whose elements are non-numeric numpy arrays crash Series.__repr__() #33770

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
frreiss opened this issue Apr 24, 2020 · 0 comments · Fixed by #33771
Closed
3 tasks done
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. Output-Formatting __repr__ of pandas objects, to_string
Milestone

Comments

@frreiss
Copy link
Contributor

frreiss commented Apr 24, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

Prerequisites:

$ pip install pandas numpy memoized-property
$ pip install git+https://github.com/frreiss/text-extensions-for-pandas

Reproduce the problem, using TensorArray, our extension type for storing tensors in a Pandas series:

>>> import pandas as pd
>>> import numpy as np
>>> import text_extensions_for_pandas as tp
>>> # Integers work
    int_tensors = np.array([[1, 2], [3, 4]])
    int_tensor_series = pd.Series(tp.TensorArray(int_tensors))
    int_tensor_series

0   [1 2]
1   [3 4]
dtype: TensorType

>>> # Boolean values don't work
    bool_tensors = np.array([[True, False], [False, True]])
    bool_tensor_series = pd.Series(tp.TensorArray(bool_tensors))
    bool_tensor_series
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/pd/covid-notebooks/env/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

[...many lines of stack trace...]

~/pd/covid-notebooks/env/lib/python3.7/site-packages/pandas/io/formats/format.py in _format_strings(self)
   1255         fmt_values = []
   1256         for i, v in enumerate(vals):
-> 1257             if not is_float_type[i] and leading_space:
   1258                 fmt_values.append(" {v}".format(v=_format(v)))
   1259             elif is_float_type[i]:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Problem description

Series.__repr__() invokes ExtensionArrayFormatter to render extension types. If the individual elements managed by an ExtensionArray are numpy arrays (or slices of a larger numpy array), then ExtensionArrayFormatter uses Pandas' facilities for rendering numpy arrays. These facilities comprise the base class GenericArrayFormatter and subclasses such as FloatArrayFormatter for handling specific types (see pandas/io/formats/format.py). The formatters for numeric types can render numpy arrays with more than one dimension, but the base class GenericArrayFormatter cannot. The limitation appears to be an oversight. I will submit a small pull request with a fix in a few minutes.

Expected Output

The above code should output something like this:

0   [True False]
1   [False True]
dtype: TensorType

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.7.final.0
python-bits : 64
OS : Darwin
OS-release : 19.4.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.1.3.post20200330
Cython : 0.29.15
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.13.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : 0.3.3
gcsfs : None
lxml.etree : None
matplotlib : 3.1.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.13.0
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : 0.8.3
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : 0.49.0

@frreiss frreiss added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 24, 2020
@jreback jreback added this to the 1.1 milestone Apr 24, 2020
@jreback jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Output-Formatting __repr__ of pandas objects, to_string and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 24, 2020
rhshadrach pushed a commit to rhshadrach/pandas that referenced this issue May 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants