-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
EA: support basic 2D operations #27142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
d673008
b2a837b
c2fd1b1
bed5563
29d9dc2
f3ce13c
4fd24c1
c0505ee
c81daeb
4d77dbe
d43ef30
91c979b
bc220c3
421b5a3
7c6df89
203504c
bc12f01
2f18dae
eb0645d
2540933
5b60fb5
96f3ae2
92a0a56
81191bf
91639dd
34cc9e9
444f9f7
7c15b74
768d75d
3b7b2b2
177bfb0
f5cba22
174a1da
1d78fbe
41b49d9
fc331b8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -888,6 +888,7 @@ Other API changes | |
- :meth:`ExtensionArray.argsort` places NA values at the end of the sorted array. (:issue:`21801`) | ||
- :meth:`DataFrame.to_hdf` and :meth:`Series.to_hdf` will now raise a ``NotImplementedError`` when saving a :class:`MultiIndex` with extention data types for a ``fixed`` format. (:issue:`7775`) | ||
- Passing duplicate ``names`` in :meth:`read_csv` will now raise a ``ValueError`` (:issue:`17346`) | ||
- :meth:`Categorical.ravel` will now return a :class:`Categorical` instead of a NumPy array. (:issue:`27153`) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this change in 0.25? (IIRC yes?), if not can you break it out There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i think we only recently deprecated the old behavior. will double-check and see whats appropriate |
||
|
||
.. _whatsnew_0250.deprecations: | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,240 @@ | ||
""" | ||
Utilities for implementing 2D compatibility for 1D ExtensionArrays. | ||
""" | ||
from functools import wraps | ||
from typing import Tuple | ||
|
||
import numpy as np | ||
|
||
from pandas._libs.lib import is_integer | ||
|
||
|
||
def implement_2d(cls): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what is the reasoning behind making this a decorator as opposed to just having base class method? it seems much simpler and more obvious. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The main pushback on this idea was the onus put on EA authors who only support 1D. ATM this decorator mainly just renames But the step after this is to patch There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the plan still to have an "opt me out of this, I natively support 2D arrays"? If we go down the route of putting every Block.values inside an EA (with PandasArray for the current NumPy-backed Blocks), then we'll want that, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, I see you're applying this to subclasses, rather than ExtensionArray itself. Motivation for that? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
less magic than a metaclass (and my first attempt at using a metaclass failed)
Defined There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, right, because simply decorating @implements_2d
class ExtensionArray: would work for subclasses right? It has to be a metaclass. I'll poke around at the meta class approach to see what's going on. May be too magical. But if we do go with a |
||
""" | ||
A decorator to take a 1-dimension-only ExtensionArray subclass and make | ||
it support limited 2-dimensional operations. | ||
""" | ||
from pandas.core.arrays import ExtensionArray | ||
|
||
# For backwards-compatibility, if an EA author implemented __len__ | ||
# but not size, we use that __len__ method to get an array's size. | ||
has_size = cls.size is not ExtensionArray.size | ||
has_shape = cls.shape is not ExtensionArray.shape | ||
has_len = cls.__len__ is not ExtensionArray.__len__ | ||
|
||
if not has_size and has_len: | ||
cls.size = property(cls.__len__) | ||
cls.__len__ = ExtensionArray.__len__ | ||
|
||
elif not has_size and has_shape: | ||
|
||
@property | ||
def size(self) -> int: | ||
return np.prod(self.shape) | ||
|
||
cls.size = size | ||
|
||
orig_copy = cls.copy | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
@wraps(orig_copy) | ||
def copy(self): | ||
result = orig_copy(self) | ||
result._shape = self._shape | ||
return result | ||
|
||
cls.copy = copy | ||
|
||
orig_getitem = cls.__getitem__ | ||
|
||
def __getitem__(self, key): | ||
if self.ndim == 1: | ||
return orig_getitem(self, key) | ||
|
||
key = expand_key(key, self.shape) | ||
if is_integer(key[0]): | ||
assert key[0] in [0, -1] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't understand this at a glance. Who does it have to be in Just from looking at it, I would expect this to break this arr = pd.array([1, 2, 3], dtype="Int64")
# arr = arra.reshape(...)
arr[1] There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I wrote this with only (1, N) in mind |
||
result = orig_getitem(self, key[1]) | ||
return result | ||
|
||
if isinstance(key[0], slice): | ||
if slice_contains_zero(key[0]): | ||
result = orig_getitem(self, key[1]) | ||
result._shape = (1, result.size) | ||
return result | ||
|
||
raise NotImplementedError(key) | ||
# TODO: ellipses? | ||
raise NotImplementedError(key) | ||
|
||
cls.__getitem__ = __getitem__ | ||
|
||
orig_take = cls.take | ||
|
||
# kwargs for compat with Interval | ||
# allow_fill=None instead of False is for compat with Categorical | ||
def take(self, indices, allow_fill=None, fill_value=None, axis=0, **kwargs): | ||
if self.ndim == 1 and axis == 0: | ||
return orig_take( | ||
self, indices, allow_fill=allow_fill, fill_value=fill_value, **kwargs | ||
) | ||
|
||
if self.ndim != 2 or self.shape[0] != 1: | ||
raise NotImplementedError | ||
if axis not in [0, 1]: | ||
raise ValueError(axis) | ||
if kwargs: | ||
raise ValueError( | ||
"kwargs should not be passed in the 2D case, " | ||
"are only included for compat with Interval" | ||
) | ||
|
||
if axis == 1: | ||
result = orig_take( | ||
self, indices, allow_fill=allow_fill, fill_value=fill_value | ||
) | ||
result._shape = (1, result.size) | ||
return result | ||
|
||
# For axis == 0, because we only support shape (1, N) | ||
# there are only limited indices we can accept | ||
if len(indices) != 1: | ||
# TODO: we could probably support zero-len here | ||
raise NotImplementedError | ||
|
||
def take_item(n): | ||
if n == -1: | ||
seq = [fill_value] * self.shape[1] | ||
return type(self)._from_sequence(seq) | ||
else: | ||
return self[n, :] | ||
|
||
arrs = [take_item(n) for n in indices] | ||
result = type(self)._concat_same_type(arrs) | ||
result.shape = (len(indices), self.shape[1]) | ||
return result | ||
|
||
cls.take = take | ||
|
||
orig_iter = cls.__iter__ | ||
|
||
def __iter__(self): | ||
if self.ndim == 1: | ||
for obj in orig_iter(self): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does just There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IIRC from one of the proofs-of-concept this actually mattered, but I dont remember which direction it mattered in |
||
yield obj | ||
else: | ||
for n in range(self.shape[0]): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can this case be written as something like
? I worry a bit about the cost of calling There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Definitely dont want to call orig_iter here since we want to yield ndim-1 arrays |
||
yield self[n] | ||
|
||
cls.__iter__ = __iter__ | ||
|
||
return cls | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure if we are still doing this approach, but if we are I would break out each of the implementations into a well named function that takes cls. then There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed. ATM the relevant discussion about how to proceed is in Tom's PR here |
||
|
||
|
||
def slice_contains_zero(slc: slice) -> bool: | ||
if slc == slice(None): | ||
return True | ||
if slc == slice(0, None): | ||
return True | ||
if slc == slice(0, 1): | ||
return True | ||
if slc.start == slc.stop: | ||
# Note: order matters here, since we _dont_ want this to catch | ||
# the slice(None) case. | ||
return False | ||
raise NotImplementedError(slc) | ||
|
||
|
||
def expand_key(key, shape): | ||
ndim = len(shape) | ||
if ndim != 2 or shape[0] != 1: | ||
raise NotImplementedError | ||
if not isinstance(key, tuple): | ||
key = (key, slice(None)) | ||
if len(key) != 2: | ||
raise ValueError(key) | ||
|
||
if is_integer(key[0]) and key[0] not in [0, -1]: | ||
raise ValueError(key) | ||
|
||
return key | ||
|
||
|
||
def can_safe_ravel(shape: Tuple[int, ...]) -> bool: | ||
""" | ||
Check if an array with the given shape can be ravelled unambiguously | ||
regardless of column/row order. | ||
|
||
Parameters | ||
---------- | ||
shape : tuple[int] | ||
|
||
Returns | ||
------- | ||
bool | ||
""" | ||
if len(shape) == 1: | ||
return True | ||
if len(shape) > 2: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this come up in practice? I'm having a hard time judging whether NotImplementedError or False is the right behavior in this case. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think NotImplementedError is more accurate, but for the current implementation this should never be reached |
||
raise NotImplementedError(shape) | ||
if shape[0] == 1 or shape[1] == 1: | ||
# column-like or row-like | ||
return True | ||
return False | ||
|
||
|
||
def tuplify_shape(size: int, shape, restrict=True) -> Tuple[int, ...]: | ||
""" | ||
Convert a passed shape into a valid tuple. | ||
Following ndarray.reshape, we accept either `reshape(a, b)` or | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't suppose NumPy has a function we can borrow here? I'm not aware of one. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. numpy/numpy#13768 is the closest thing I found, but I agree it seems like the kind of thing that should exist. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here's a somewhat dumb approach:
Tested as:
|
||
`reshape((a, b))`, the latter being canonical. | ||
|
||
Parameters | ||
---------- | ||
size : int | ||
shape : tuple | ||
restrict : bool, default True | ||
Whether to restrict to shapes (N), (1,N), and (N,1) | ||
|
||
Returns | ||
------- | ||
tuple[int, ...] | ||
""" | ||
if len(shape) == 0: | ||
raise ValueError("shape must be a non-empty tuple of integers", shape) | ||
|
||
if len(shape) == 1: | ||
if is_integer(shape[0]): | ||
pass | ||
else: | ||
shape = shape[0] | ||
if not isinstance(shape, tuple): | ||
raise ValueError("shape must be a non-empty tuple of integers", shape) | ||
|
||
if not all(is_integer(x) for x in shape): | ||
raise ValueError("shape must be a non-empty tuple of integers", shape) | ||
|
||
if any(x < -1 for x in shape): | ||
raise ValueError("Invalid shape {shape}".format(shape=shape)) | ||
|
||
if -1 in shape: | ||
if shape.count(-1) != 1: | ||
raise ValueError("Invalid shape {shape}".format(shape=shape)) | ||
idx = shape.index(-1) | ||
others = [n for n in shape if n != -1] | ||
prod = np.prod(others) | ||
dim = size // prod | ||
shape = shape[:idx] + (dim,) + shape[idx + 1 :] | ||
|
||
if np.prod(shape) != size: | ||
raise ValueError( | ||
"Product of shape ({shape}) must match " | ||
"size ({size})".format(shape=shape, size=size) | ||
) | ||
|
||
num_gt1 = len([x for x in shape if x > 1]) | ||
if num_gt1 > 1 and restrict: | ||
raise ValueError( | ||
"The default `reshape` implementation is limited to " | ||
"shapes (N,), (N,1), and (1,N), not {shape}".format(shape=shape) | ||
) | ||
return shape |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,6 +24,7 @@ | |
from pandas._typing import ArrayLike | ||
from pandas.core import ops | ||
from pandas.core.algorithms import _factorize_array, unique | ||
from pandas.core.arrays._reshaping import can_safe_ravel, tuplify_shape | ||
from pandas.core.missing import backfill_1d, pad_1d | ||
from pandas.core.sorting import nargsort | ||
|
||
|
@@ -83,7 +84,7 @@ class ExtensionArray: | |
* _from_sequence | ||
* _from_factorized | ||
* __getitem__ | ||
* __len__ | ||
* __len__ *or* size | ||
* dtype | ||
* nbytes | ||
* isna | ||
|
@@ -160,6 +161,12 @@ class ExtensionArray: | |
# Don't override this. | ||
_typ = "extension" | ||
|
||
# Whether this class supports 2D arrays natively. If so, set _allows_2d | ||
# to True and override reshape, ravel, and T. Otherwise, apply the | ||
# `implement_2d` decorator to use default implementations of limited | ||
# 2D functionality. | ||
_allows_2d = False | ||
|
||
# ------------------------------------------------------------------------ | ||
# Constructors | ||
# ------------------------------------------------------------------------ | ||
|
@@ -319,7 +326,7 @@ def __len__(self) -> int: | |
------- | ||
length : int | ||
""" | ||
raise AbstractMethodError(self) | ||
return self.shape[0] | ||
|
||
def __iter__(self): | ||
""" | ||
|
@@ -334,6 +341,7 @@ def __iter__(self): | |
# ------------------------------------------------------------------------ | ||
# Required attributes | ||
# ------------------------------------------------------------------------ | ||
_shape = None | ||
|
||
@property | ||
def dtype(self) -> ExtensionDtype: | ||
|
@@ -347,14 +355,33 @@ def shape(self) -> Tuple[int, ...]: | |
""" | ||
Return a tuple of the array dimensions. | ||
""" | ||
return (len(self),) | ||
if self._shape is not None: | ||
return self._shape | ||
|
||
# Default to 1D | ||
length = self.size | ||
return (length,) | ||
|
||
@shape.setter | ||
def shape(self, value): | ||
size = np.prod(value) | ||
if size != self.size: | ||
raise ValueError("Implied size must match actual size.") | ||
self._shape = value | ||
|
||
@property | ||
def ndim(self) -> int: | ||
""" | ||
Extension Arrays are only allowed to be 1-dimensional. | ||
""" | ||
return 1 | ||
return len(self.shape) | ||
|
||
@property | ||
def size(self) -> int: | ||
""" | ||
The number of elements in this array. | ||
""" | ||
raise AbstractMethodError(self) | ||
|
||
@property | ||
def nbytes(self) -> int: | ||
|
@@ -933,6 +960,26 @@ def _formatter(self, boxed: bool = False) -> Callable[[Any], Optional[str]]: | |
# Reshaping | ||
# ------------------------------------------------------------------------ | ||
|
||
def reshape(self, *shape): | ||
""" | ||
Return a view on this array with the given shape. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would be good to document that only There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good thinking. ATM I wrote most of this with only (1, N) in mind |
||
""" | ||
# numpy accepts either a single tuple or an expanded tuple | ||
shape = tuplify_shape(self.size, shape) | ||
result = self.view() | ||
result._shape = shape | ||
return result | ||
|
||
@property | ||
def T(self) -> ABCExtensionArray: | ||
""" | ||
Return a transposed view on self. | ||
""" | ||
if not can_safe_ravel(self.shape): | ||
raise NotImplementedError | ||
shape = self.shape[::-1] | ||
return self.reshape(shape) | ||
|
||
def ravel(self, order="C") -> ABCExtensionArray: | ||
""" | ||
Return a flattened view on this array. | ||
|
@@ -947,10 +994,12 @@ def ravel(self, order="C") -> ABCExtensionArray: | |
|
||
Notes | ||
----- | ||
- Because ExtensionArrays are 1D-only, this is a no-op. | ||
- The "order" argument is ignored, is for compatibility with NumPy. | ||
""" | ||
return self | ||
if not can_safe_ravel(self.shape): | ||
raise NotImplementedError | ||
shape = (self.size,) | ||
return self.reshape(shape) | ||
|
||
@classmethod | ||
def _concat_same_type( | ||
|
Uh oh!
There was an error while loading. Please reload this page.