BUG: read_parquet does not respect index for arrow dtype backend #51726

phofl · 2023-03-01T22:36:00Z

closes BUG: pd.read_parquet drops indexes when mode.dtype_backend='pyarrow' #51717 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

rachtsingh · 2023-03-02T00:45:23Z

Hey, thanks for tackling this so quickly, I appreciate it. One additional request is that RangeIndex can have a name, which would be inside params.

If it helps, there's an implementation here already: https://github.com/apache/arrow/blob/45918a90a6ca1cf3fd67c256a7d6a240249e555a/python/pyarrow/pandas_compat.py#L943 and Wes introduced the dict-based index description here. It feels like an under-defined format so could be hard to write tests for this.

phofl · 2023-03-02T01:31:25Z

thx, adjusted

jorisvandenbossche · 2023-03-03T14:08:47Z

pandas/io/parquet.py

-                )
+                index_columns = pa_table.schema.pandas_metadata.get("index_columns", [])
+                result_dc = {
+                    col_name: arrays.ArrowExtensionArray(pa_col)


Is there a reason to not use types_mapper here (as is done for our own masked nullable arrays)?

Because all this handling of the index is done by pyarrow already, and if using to_pandas() as for the other code paths would ensure this is done consistently.

This is a good idea, but it looks like arrow is not applying the types mapper to index either. Not sure why we didn't do it like this initially though.

I tried:

types_mapper = lambda x: pd.ArrowDtype(x)

Yeah, that's a bug in pyarrow (well, when this keyword was implemented the Index did not yet support EAs, so at that point it wasn't needed to consider the index as well). This was recently reported (apache/arrow#34283), and we can ensure to fix it for the next release in April.

Another reason to go the types_mapper way is that users can define a custom ExtensionArray that has its own conversion from pyarrow->pandas defined, and the current code here would ignore that.

Short term an option to overcome the Index bug could be to convert the Index manually back to an Arrow backed array. That's of course a bit wasteful in case it was not a zero-copy conversion .. But for people following the latest pyarrow releases it should only be a short-term issue.

I switched to types_mapper=pd.ArrowDtype in #51766

If you have converting the Index on your agenda, I'd avoid implementing this ourselves for a couple of weeks basically.

Thoughts?

If you agree, then can just close here

From my limited understanding of https://github.com/apache/arrow/blob/main/python/pyarrow/src/arrow/python/arrow_to_pandas.cc I used the manual conversion to avoid a conversion from numpy(?)

// Functions for pandas conversion via NumPy

They are going through pandas_dtype.__from_arrow__(arr) which receives an arrow array, so we should be good?

Yes, whenever pyarrow detects an extension dtype for a column (either in the metadata, the pyarrow type itself or types_mapper keyword), we don't actually convert to numpy but directly pass the pyarrow array to dtype.__from_arrow__

Got it thanks for the confirmation. I think #51766 should be sufficient then

mroeschke · 2023-03-14T02:35:52Z

Was this supposed to be addressed in #51766?

phofl · 2023-03-14T10:59:54Z

Not directly, but I added type_mapper support for the index in pyarrow a couple of days ago. We will get this out of the box with pyarrow 12.0, so I wouldn't do anything on our side.

phofl added 2 commits March 1, 2023 23:30

BUG: read_parquet does not respect index for arrow dtype backend

723b8da

BUG: read_parquet does not respect index for arrow dtype backend

ca3dca7

phofl requested a review from mroeschke March 1, 2023 22:36

phofl added the Arrow pyarrow functionality label Mar 1, 2023

phofl added this to the 2.0 milestone Mar 1, 2023

phofl added the IO Parquet parquet, feather label Mar 1, 2023

Fix mypy

28502ff

Account for index name

53f3700

phofl force-pushed the 51717 branch from 6aba0e1 to 53f3700 Compare March 2, 2023 10:17

jorisvandenbossche reviewed Mar 3, 2023

View reviewed changes

phofl closed this Mar 14, 2023

phofl deleted the 51717 branch August 28, 2023 21:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: read_parquet does not respect index for arrow dtype backend #51726

BUG: read_parquet does not respect index for arrow dtype backend #51726

Uh oh!

phofl commented Mar 1, 2023

Uh oh!

rachtsingh commented Mar 2, 2023

Uh oh!

phofl commented Mar 2, 2023

Uh oh!

jorisvandenbossche Mar 3, 2023

Uh oh!

phofl Mar 3, 2023

Uh oh!

jorisvandenbossche Mar 3, 2023

Uh oh!

phofl Mar 3, 2023 •

edited

Loading

Uh oh!

mroeschke Mar 5, 2023

Uh oh!

phofl Mar 5, 2023

Uh oh!

jorisvandenbossche Mar 6, 2023

Uh oh!

mroeschke Mar 6, 2023

Uh oh!

mroeschke commented Mar 14, 2023

Uh oh!

phofl commented Mar 14, 2023

Uh oh!

Uh oh!

Uh oh!

BUG: read_parquet does not respect index for arrow dtype backend #51726

BUG: read_parquet does not respect index for arrow dtype backend #51726

Uh oh!

Conversation

phofl commented Mar 1, 2023

Uh oh!

rachtsingh commented Mar 2, 2023

Uh oh!

phofl commented Mar 2, 2023

Uh oh!

jorisvandenbossche Mar 3, 2023

Choose a reason for hiding this comment

Uh oh!

phofl Mar 3, 2023

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Mar 3, 2023

Choose a reason for hiding this comment

Uh oh!

phofl Mar 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mroeschke Mar 5, 2023

Choose a reason for hiding this comment

Uh oh!

phofl Mar 5, 2023

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Mar 6, 2023

Choose a reason for hiding this comment

Uh oh!

mroeschke Mar 6, 2023

Choose a reason for hiding this comment

Uh oh!

mroeschke commented Mar 14, 2023

Uh oh!

phofl commented Mar 14, 2023

Uh oh!

Uh oh!

phofl Mar 3, 2023 •

edited

Loading