-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
pd.json_normalize doesn't return data with index from series #51452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It sure looks weird, but I don't think it is a bug. You perform the join on two dataframes with the rows interchanged due to the sort you performed. If you do this: df1j = pd.json_normalize(df1['json_data'].sort_index())
df2j = pd.json_normalize(df2['json_data'].sort_index())
display(df1.join(df1j))
display(df2.join(df2j)) you'll see it is again as expected. The behavior you described happens because the itteration over a sorted DataFrame returns the items in the sorted order and not in the order of the index. By the way, it is surprising to me that this works at all, as |
I encountered a similar issue while using pd.json_normalize with a filtered DataFrame. It seems that the resulting json normalized series does not retain the original index. Since filtering is a common operation in data analysis, it would be great if json_normalize could preserve the original index to ensure consistency in the data when performing subsequent operations.
|
The current implementation of |
The root problem here is that python duck-typing allows any array-like-of-dicts to pass through Out in the real world, I bet any amount of money that the most common use-case for json_normalise is a Pandas series:
And principle of least surprise should be to persist the index on an incoming Pandas series. It is a bug in the existing code that should not have made it into prod. But unfortunately fixing that is a breaking change, so for now we should have a toggleable param proposed solution:
|
TLDR;
pd.json_normalize
creates a new index for the normalised series.Issue
pd.json_normalize
returns a new index for the normalised series, rather than the one supplied. Given the other methods in pandas, this seems like it violates the 'principal of least surprise'? This caught me out for longer than I would care to admit :DMinimum working example:

The join on the two indices provides and unexpected result as the two rows are now no-longer consistent.When looking at the data returned from
json_normalize
, we can see that the index of the returned data has been reset, which ultimately means the join is not as expected.The text was updated successfully, but these errors were encountered: