-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH:column-wise DataFrame.fillna and duplicated DataFrame.fillna with Series and Dict #30922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2aa3738
to
cc7037e
Compare
pandas/core/generic.py
Outdated
"with dict/Series column " | ||
"by column" | ||
) | ||
for label in self.columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accessing by label like typically causes issues when users have duplicate label names - have you tested that case by chance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd
I didn't. Thanks for pointing out. i change the logic and add test cases.
1b4a89e
to
ad97507
Compare
6f79400
to
6060212
Compare
[[100, 100, 3], [100, 5, 100], [7, 200, 200]], | ||
columns=list("ABB"), | ||
index=[0, 0, 1], | ||
dtype="float64", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be integers no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd
There may be disagreement, what i intended is keeping data type same before. Although column's type in df
is float because of "np.nan", anyway each column's in df
is float. So if "fillna" changes data type, it means "fillna" fills NA and also changes data type. I think this is not right and if user want to change data types, then it is user's share.
Nevertheless if you think this must be integers, i will change it.
For that reason, "downcast" parameter is.
6bc8fe2
to
8a25980
Compare
fde8087
to
8346780
Compare
8346780
to
ec012e9
Compare
pandas/core/generic.py
Outdated
|
||
if axis == 1: | ||
result = result.T | ||
|
||
return result if not inplace else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't make any sense if we have transposed
rather do
self._update_inplace(result._data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback
Okay i change logic. but two ".T" operations need. because what i try to do is axis==1 and axis==0 both can handle. current ".fillna" can't handle duplicated columns.
pandas/tests/frame/test_missing.py
Outdated
# disable this for now | ||
with pytest.raises(NotImplementedError, match="column by column"): | ||
df.fillna(df.max(1), axis=1) | ||
expected = DataFrame( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make a new test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback
I'm confused. you mean
- just delete those part of codes which fails to pass test and make a new test
- change this "test_fillna_dict_series" as a new test
give me more specific what you wrote.
ec012e9
to
258659f
Compare
258659f
to
a3245f9
Compare
pandas/core/generic.py
Outdated
@@ -6005,20 +6005,25 @@ def fillna( | |||
) | |||
|
|||
elif isinstance(value, (dict, ABCSeries)): | |||
new_data = self.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy / transpose are both potentially very expensive operations - is there a way to do this without requiring both of those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd
Okay. copy conditional. remove transpose.
pandas/core/generic.py
Outdated
obj = result[k] | ||
obj.fillna(v, limit=limit, inplace=True, downcast=downcast) | ||
return result if not inplace else None | ||
new_data.iloc[:, i] = new_data.iloc[:, i].fillna( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would rather build up a list of the results here in a list comprehension
a3245f9
to
621d2a7
Compare
5e5d272
to
36bfd78
Compare
387ca34
to
3763e31
Compare
f66afef
to
c434366
Compare
0c64e73
to
5b3c363
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls merge master as well
"by column" | ||
) | ||
for i, item in enumerate(temp_data.items()): | ||
label, content = item |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't make sense with the axis here; you are updating the same column whether axis==0 or 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback
Yes. but, filled value is different whether axis==0 or 1. And 'downcast' works properly when execute column-based.
de6d4f7
to
587f619
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you merge master and move the release note
587f619
to
38f3657
Compare
@proost can you fix merge conflict and try to fix CI failure? |
@proost is this still active? If yes can you merge master and fix the CI failure? |
Closing in favor of #38352 |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
description:
Access "DataFrame" each column, fills up NA values using "Series.fillna"
and what i found is if dataframe is duplicated, ".fillna" not guarantee filling NA. So i change them can fill NA.
Q. Why access column base not index base?
A. problem of Index base f".illna" is not preserve column's dtype. so i use column base.
Q. Why assign in column new values not use inplace=True ?
A. To avoid chained indexing. If column and index are both duplicated, chained indexing happens, i want to avoid this situation.