-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH add fill_value feature to pd.get_dummies #15926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #15926 +/- ##
==========================================
+ Coverage 90.96% 90.97% +<.01%
==========================================
Files 145 145
Lines 49557 49573 +16
==========================================
+ Hits 45081 45097 +16
Misses 4476 4476
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #15926 +/- ##
==========================================
+ Coverage 90.98% 90.99% +<.01%
==========================================
Files 145 145
Lines 49556 49618 +62
==========================================
+ Hits 45090 45150 +60
- Misses 4466 4468 +2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of comments, mostly around dtypes.
I'm a bit hesitant to add this, now that I see the issues around the argument forcing float output :/ We're trying to avoid new APIs that will "break" when we fix NaN behavior in pandas 2.
pandas/tests/test_reshape.py
Outdated
expected = expected[['C', 'A_b', 'A_nan', 'B_c', 'B_nan']] | ||
assert_frame_equal(result, expected) | ||
|
||
self.df = DataFrame({'A': ['a', 'b', 'a'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should just be df = ...
correct? Not self.df
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops. :(
pandas/tests/test_reshape.py
Outdated
res_na = get_dummies(s, dummy_na='keep', sparse=self.sparse) | ||
exp_na = DataFrame({'a': {0: 1, 1: 0, 2: np.nan}, | ||
'b': {0: 0, 1: 1, 2: np.nan}}, | ||
dtype=np.float16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback have you added a pytest.mark
for marking stuff that will change in pandas2? these "should" have int dtype in pandas 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not yet. don't use float16
though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What type then? Float32? I was trying to save space but the dynamic range of a uint8 might require a larger float?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah float32
is the best can do for now. (and FYI that's a nother reason to drop the NaN's, then you can use a small int type).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. Sparse support helps with this though.
pandas/core/reshape.py
Outdated
@@ -1280,7 +1320,10 @@ def get_empty_Frame(data, sparse): | |||
else: | |||
dummy_mat = np.eye(number_of_cols, dtype=np.uint8).take(codes, axis=0) | |||
|
|||
if not dummy_na: | |||
if dummy_na == 'keep': | |||
dummy_mat = dummy_mat.astype(np.float16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I have a DataFrame like
df = pd.DataFrame({"A": ['a', 'b', np.nan], 'B': [1, 2, 3]})
What is the result .dtypes
? Does it force all columns to float16, even if there' wasn't a NaN in that set (e.g. 'B').
Also, this will force it to float16, even if there were no NaNs to begin with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's right on both accounts. It is pretty straight forward to add more careful type handling if we think it is worth it. To be honest, it is not clear what should be done with types.
Should allow multiple different types for each column being dummied?
Should we return float16 if any column has a nan and uint8 otherwise?
There is not a clear right answer afaict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can make them float32
I guess. float16
is barely supported.
I get hesitating. Let me know your decision before we do a lot work. |
Alright. Everything but what the option should be called is fixed, the linting too hopefully. :/ |
pandas/core/reshape.py
Outdated
dummy_na : bool or 'keep', default False | ||
If True, then add a column to indicate NaNs otherwise if | ||
False NaNs are ignored. If set to 'keep', NaNs will be propagated | ||
into the final result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
versionadded tag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pandas/core/reshape.py
Outdated
@@ -1126,6 +1128,14 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, | |||
1 0 1 0 | |||
2 0 0 1 | |||
|
|||
>>> s1 = ['a', 'b', np.nan] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add examples covering the bases with dummy_na
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas/core/reshape.py
Outdated
@@ -1126,6 +1130,14 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, | |||
1 0 1 0 | |||
2 0 0 1 | |||
|
|||
>>> s1 = ['a', 'b', np.nan] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can remove this duplicated line maybe?
dummy_na='keep
feature to pd.get_dummies
doc/source/whatsnew/v0.20.0.txt
Outdated
@@ -366,6 +366,7 @@ Other Enhancements | |||
- ``pandas.io.json.json_normalize()`` with an empty ``list`` will return an empty ``DataFrame`` (:issue:`15534`) | |||
- ``pandas.io.json.json_normalize()`` has gained a ``sep`` option that accepts ``str`` to separate joined fields; the default is ".", which is backward compatible. (:issue:`14883`) | |||
- ``pd.read_csv()`` will now raise a ``csv.Error`` error whenever an end-of-file character is encountered in the middle of a data row (:issue:`15913`) | |||
- `pd.get_dummies()` now accepts the `dummy_na='keep'` keyword which propagates missing values into the dummy columns instead of setting them to zero. (:issue:`15923`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double quotes are pd.get_dummies()
and dummy_na='keep'
pandas/core/reshape.py
Outdated
False NaNs are ignored. If set to 'keep', NaNs will be propagated | ||
into the final result. | ||
|
||
.. versionadded:: 0.20.0 | ||
columns : list-like, default None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a blank line after this (or sphinx complains)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas/core/reshape.py
Outdated
@@ -1126,6 +1130,12 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, | |||
1 0 1 0 | |||
2 0 0 1 | |||
|
|||
>>> pd.get_dummies(s1, dummy_na='keep') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add examples with the other dummy_na
(same s1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas/core/reshape.py
Outdated
return get_empty_Frame(data, sparse) | ||
|
||
number_of_cols = len(levels) | ||
|
||
# If all NaN values are passed, then we add a NaN column | ||
# (and put NaN in it below). | ||
if dummy_na == 'keep' and number_of_cols == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have a test for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas/core/reshape.py
Outdated
sparse_index=IntIndex(N, ixs), fill_value=0, | ||
dtype=np.float32) | ||
else: | ||
sarr = SparseArray( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you pull the SparseArray creation out of the if/then. you can simply pass in an already created sarr
(of the correctly dtype; don't actually need to pass the dtype.
pandas/tests/test_reshape.py
Outdated
'A_a': [1, 0, 1, np.nan], | ||
'A_b': [0, 1, 0, np.nan], | ||
'B_b': [1, 1, 0, np.nan], | ||
'B_c': [0, 0, 1, np.nan]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just pass the columns=cols
on creation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can pass dtype=...
as well
pandas/tests/test_reshape.py
Outdated
@@ -551,11 +576,23 @@ def test_basic_drop_first_NA(self): | |||
['b', nan], 1) | |||
assert_frame_equal(res_na, exp_na) | |||
|
|||
res_na = get_dummies(s_NA, dummy_na='keep', sparse=self.sparse, | |||
drop_first=True) | |||
exp_na = DataFrame({'b': {0: 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is awkard constrution, do use a dict-of-dicts, just dict-of-list is much more natural.
pandas/tests/test_reshape.py
Outdated
res_just_na = get_dummies([nan], dummy_na=True, sparse=self.sparse, | ||
drop_first=True) | ||
exp_just_na = DataFrame(index=np.arange(1)) | ||
assert_frame_equal(res_just_na, exp_just_na) | ||
|
||
res_just_na = get_dummies([nan], dummy_na='keep', sparse=self.sparse, | ||
drop_first=True) | ||
exp_just_na = DataFrame(index=np.arange(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use bare numpy arrays like this, instead index=range(1)
pandas/tests/test_reshape.py
Outdated
'A_b': [0, 1, 0, np.nan], | ||
'B_c': [0, 0, 1, np.nan]}) | ||
cols = ['A_b', 'B_c'] | ||
expected[cols] = expected[cols].astype(np.float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
pandas/tests/test_reshape.py
Outdated
'A_nan': [0, 0, 0, 1], | ||
'B_c': [0, 0, 1, 0], | ||
'B_nan': [0, 0, 0, 1]}) | ||
cols = ['A_b', 'A_nan', 'B_c', 'B_nan'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
@jreback I'm happy to address these comments. Almost all of this testing code was copied from other parts of the test suite. Should all of that other code be cleaned up as well? Also, are we set on the interface re: the other discussion on the issue thread? |
sure if you'd like. I haven't looked at all code (hahha). So for sure some non-conforming code exists. (things like cleaning up other code usually better in other PR though). |
I am ok with the API. @TomAugspurger @jorisvandenbossche |
Could you summarize the final API from that other issue? Like one comment said, dropna is a bit weird since we aren't ever dropping rows or cols.
In a sense, the current behavior is like fillna=0. You do the dummy encoding, with NaN -> NaN, then fill those output NaNs with 0. Your proposal is to not fill them. That's all assuming dummy_na=False. You wouldn't want to allow fillna=None or nan.
So I could see either fillna=0, or the third value for dummy_na like now. Not a strong preference over those two. With fillna would there be weird interactions with dummy_na?
… On Apr 7, 2017, at 8:34 AM, Jeff Reback ***@***.***> wrote:
I am ok with the API. @TomAugspurger @jorisvandenbossche
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Yeah I think this is where the rub is and 100% agree! Right now the code is filling missing values by default using something like Here are the main considerations compiled in one spot:
So I think the best solution is to not fill missing values by default at all and let people use the already provided public API methods to do this. This keeps the API purely idiomatic. Further it would let people do things like drop rows with missing values before or after they apply the Thus my suggested API is
The current API with the PR as it stands now (i.e., with the
IMO, the simpler solution of just not filling missing values with zero by default makes the code easier to reason about, easier to maintain since there is not a maze of options, and fundamentally more consistent with the other pandas API conventions. OFC it is possible to go with the second solution above, but this adds more complication to the code, deviates from other parts of the pandas API, and leaves the code harder to reason about. |
@beckermr so that actually might be a nice alternative here, though it would have to be |
Thanks for the summary, will give feedback later.
Jeff is right, anywhere I used fillna it should have been fill_value.
… On Apr 7, 2017, at 9:52 AM, Jeff Reback ***@***.***> wrote:
@beckermr fillna is a function name, the one you are looking for is fill_value which is pretty common. e.g. see .unstack or .reindex
so that actually might be a nice alternative here, though it would have to be fill_value=0 to preserve the existing.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Ahhhhh @jreback. Thanks! Are there any other examples where I searched the code base and I cannot find any. So using We keep coming up with alternative solutions that generate their own conflicts with other parts of the pandas API. I think this all points to the fact that filling with zero by default is fundamentally in conflict with the rest of the API and so should be deprecated. |
So this does deviate from the API (as you have noted) 😄 The question is it is worth breaking this (and it would be a bit break I think). as @jorisvandenbossche noted, this is generally a 'final' function so filling is nice implicity. |
See my response to this point:
Basically, pandas cannot promise that filling zeros is the right thing to do. Making as few assumptions about peoples' input data and letting them do things like data_final = pd.get_dummies(data).fillna(0) is the preferred solution since it is explicit, idiomatic with respect to the rest of the API, and respectful to the user to give them the option to do what they like. Breaking this basic part of the API for just this one function for this one case using an assumption about the input data that we cannot possibly ensure is always true seems like an easy choice to me. Don't break the API. Filling and dropping missing values is a core use of pandas and this bit the of the API should be as pure as possible. Further, there are major version changes for a reason. Pandas regularly pushes backwards incompatible changes in a effort to improve the API and make the code better. My suggested solution is very much inline with this ongoing work. |
I'm +1 on adding a Fortunately, we can defer discussion on what the default should be in the future. @beckermr just in case it's not clear, I think you're argument is completely valid. I just am on the fence about whether it's worth deprecating and changing in the future. While pandas has been aggressive about changing APIs, we're trying to become more stable. Are you OK with implementing |
I see your point of course, but I think the purity of the API on this core function is very important. I suspect that as the API is cleaned up, there will be less to change. In other words, once it is all internally consistent with clearly correct choices, no more of these changes that break stuff will need to be done in the first place. The other thing to be said is that supporting old features which have turned out to be problematic indefinitely is a bad practice too since it hampers the development of the code base as a whole. I do get being queasy about making them. However, I don't think people will realistically ever move away from pandas, unless they move from python too. Given this, it is better to fix stuff sooner rather than later so that pandas can further mature into the core tool it is already. I am happy with adding |
@beckermr so just to be clear, we would leave ok with deprecation actually (though could be just a |
Yes, this is close to my proposed solution. I would
I do not know the difference in how |
yes you set |
Ohhhh ok @jreback. Sounds good. |
@jreback @TomAugspurger Are we all in agreement then? I am happy to make the changes OFC. |
good point, |
to be extra defensive you can do check if |
Sounds good. A simple if block is not terribly difficult. :) |
maybe better is to use |
I see what you mean, but I think this should be a separate issue that can be addressed in another PR. I am guessing the downcasting logic is repeated a lot in the code base? Probably better to look at the other uses and make sure that something centralized makes sense? |
no we don't downcast on scalars anywhere. (we do on arrays but that is elsewhere and much more complicated). This is a nice way to add general functionaility. I never want to add code that is more general purpose in nature. If its implemented multiple times its wasteful and generally not as well tested. |
OK I am not a huge fan of making PRs that put a lot of new code in more than one spot if it is not needed, but happy to move the logic there. |
but that's exactly the point. You need to re-use as much infrastructure as possible. I know where most of the bodies are buried. other pandas would never by DRY. |
OFC - I have a working draft here but still need to refactor a bit. I will get to it later. |
pandas/core/reshape.py
Outdated
else: | ||
any_null = False | ||
except TypeError: | ||
any_null = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this block sucks. I am sure pandas has an internal for this. I will try and find it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reshaping is a bit of a mess
pandas/core/reshape.py
Outdated
elif 'float' in str(fill_value_dtype): | ||
output_dtype = np.float32 | ||
else: | ||
raise ValueError('`fill_value` must be `np.nan`, an int or a float type!') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more general version of this block needs to be moved to infer_dtype_from_scalar
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep all for that! (something like this you can do here / or in another PR).
an alternative is to strip out the dtype inference and do it in another PR (before this one). That makes this much cleaner.
pandas/tests/test_reshape.py
Outdated
@@ -326,6 +327,25 @@ def test_include_na(self): | |||
dtype=np.uint8) | |||
tm.assert_numpy_array_equal(res_just_na.values, exp_just_na.values) | |||
|
|||
# Add `fill_value` keyword GH15926 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment still needs to be moved.
Another to do is to add tests for the dtype inference. |
yep often times I am fixing one things, then I decide to generalize. So usually I will pull that logic to a stand alone PR, then put what I was doing on top. Its much cleaner / more clear. 1 PR 1 thing. |
@jreback I took another pass at this. I still need to add docs in the whatsnew for the downcast options. However, a 50% review here would be good to make sure there are not any major issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just about right actually, though we want to consolidate this impl with what is already done in to_numeric
; I pasted the code here.
I should have pointed you towards that....sorry.
pandas/core/reshape.py
Outdated
@@ -1075,7 +1077,7 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, | |||
If appending prefix, separator/delimiter to use. Or pass a | |||
list or dictionary as with `prefix.` | |||
dummy_na : bool, default False | |||
Add a column to indicate NaNs, if False NaNs are ignored. | |||
Add a column to indicate NaNs if True. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add what happens if False
pandas/core/reshape.py
Outdated
The default behavior of filling with zeros will be deprecated | ||
in the future and using this default will not raise a | ||
`FutureWarning`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe also some color on result dtypes
pandas/tests/types/test_cast.py
Outdated
@@ -86,6 +86,98 @@ def test_datetime_with_timezone(self): | |||
|
|||
class TestInferDtype(object): | |||
|
|||
def test_infer_dtype_from_scalar_downcast_basic(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally these would be parameterized tests (on the dtype)
pandas/types/cast.py
Outdated
@@ -322,8 +325,52 @@ def infer_dtype_from_scalar(val, pandas_dtype=False): | |||
whether to infer dtype including pandas extension types. | |||
If False, scalar belongs to pandas extension types is inferred as | |||
object | |||
downcast : bool, default False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use this same signature for downcast
here. We worked thru this for a while
Signature: pd.to_numeric(arg, errors='raise', downcast=None)
Docstring:
Convert argument to a numeric type.
Parameters
----------
arg : list, tuple, 1-d array, or Series
errors : {'ignore', 'raise', 'coerce'}, default 'raise'
- If 'raise', then invalid parsing will raise an exception
- If 'coerce', then invalid parsing will be set as NaN
- If 'ignore', then invalid parsing will return the input
downcast : {'integer', 'signed', 'unsigned', 'float'} , default None
If not None, and if the data has been successfully cast to a
numerical dtype (or if the data was numeric to begin with),
downcast that resulting data to the smallest numerical dtype
possible according to the following rules:
- 'integer' or 'signed': smallest signed int dtype (min.: np.int8)
- 'unsigned': smallest unsigned int dtype (min.: np.uint8)
- 'float': smallest float dtype (min.: np.float32)
As this behaviour is separate from the core conversion to
numeric values, any errors raised during the downcasting
will be surfaced regardless of the value of the 'errors' input.
In addition, downcasting will only occur if the size
of the resulting data's dtype is strictly larger than
the dtype it is to be cast to, so if none of the dtypes
checked satisfy that specification, no downcasting will be
performed on the data.
pandas/types/cast.py
Outdated
""" | ||
|
||
def _downcast_dtype(dtype, val): | ||
if 'float' in str(dtype): | ||
if ((val > np.finfo(np.float16).min and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just decide between float32/64
. We don't really support float16
anywhere, so never want to return this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was going to do this, but then you get weird stuff where you can feed a float16
to the code without downsize and get back float16
. Then if you feed a float16
with downsize=True
, you get back float32
which is bigger. Are we cool with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then simply detect float16 and force to float32
pandas/types/cast.py
Outdated
val < np.finfo(np.float16).max) or val is np.nan): | ||
return np.float16 | ||
elif (val > np.finfo(np.float32).min and | ||
val < np.finfo(np.float32).max): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually you could prob take this exact code from to_numeric
and put it in a function and call it here (wrap the scalar), and replace in to_numeric
.
if downcast is not None and is_numeric_dtype(values):
typecodes = None
if downcast in ('integer', 'signed'):
typecodes = np.typecodes['Integer']
elif downcast == 'unsigned' and np.min(values) >= 0:
typecodes = np.typecodes['UnsignedInteger']
elif downcast == 'float':
typecodes = np.typecodes['Float']
# pandas support goes only to np.float32,
# as float dtypes smaller than that are
# extremely rare and not well supported
float_32_char = np.dtype(np.float32).char
float_32_ind = typecodes.index(float_32_char)
typecodes = typecodes[float_32_ind:]
if typecodes is not None:
# from smallest to largest
for dtype in typecodes:
if np.dtype(dtype).itemsize <= values.dtype.itemsize:
values = maybe_downcast_to_dtype(values, dtype)
# successful conversion
if values.dtype == dtype:
break
pandas/types/cast.py
Outdated
elif (val > np.finfo(np.float32).min and | ||
val < np.finfo(np.float32).max): | ||
return np.float32 | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would replace your _downcast_dtype
; though make it another function in pandas.types.cast
call it maybe_downcast_itemsize
(not to confuse it with the maybe_downcast_to_dtype
, which is similar, but doesn't reduce the itemsize)
@jreback Sorry for being MIA. I have been sick. :( I am going to try and make a final pass at this tomorrow, but then I will be out of contact for a week or so. I don't want to delay any releases, so feel free to push this to v0.21.0 if you need to! |
@jreback I have taken another pass at this. I moved some of the downsizing logic outside of the function to infer types since it didn't really work there. Basically, the downsizing functions are built to be used with array-like types, and the infer functions are for scalars. In fact, the downsizing logic in I don't have any more time to spend on this PR, besides some very small edits, so either we merge it or I need to move on. Feel free of course to pick it up yourself and get it merged. I think it is in a good spot without expanding to basically refactor all of the type checking/changing functionality. |
Ack. Generalizing this functionality is now introducing test failures in other places. I really think trying to DRY this up is not needed in this case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@beckermr this looks pretty good actually. I am not sure I will be able to get to this for 0.20.0 but will at some point.
@@ -372,6 +372,8 @@ Other Enhancements | |||
- :func:`MultiIndex.remove_unused_levels` has been added to facilitate :ref:`removing unused levels <advanced.shown_levels>`. (:issue:`15694`) | |||
- ``pd.read_csv()`` will now raise a ``ParserError`` error whenever any parsing error occurs (:issue:`15913`, :issue:`15925`) | |||
- ``pd.read_csv()`` now supports the ``error_bad_lines`` and ``warn_bad_lines`` arguments for the Python parser (:issue:`15925`) | |||
- ``pd.get_dummies()`` now accepts the ``fill_value`` keyword which specifies how to fill NaN values in the dummy variables. (:issue:`15923`) | |||
- ``pd.types.cast`` has a new function ``maybe_downcast_itemsize`` which can be used to reduce the width of numeric types. (:issue:`15923`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2nd entry is not necessary, this is internal.
@@ -382,6 +384,19 @@ Other Enhancements | |||
Backwards incompatible API changes | |||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs a ref
Deprecate Automatic Zero Filling of Missing Values in ``pd.get_dummies`` | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
The :func:`get_dummies` function currently fills NaN values with zero by default. This behavior is in conflict with the rest of the pandas API since NaN values should be filled with ``fillna`` or a ``fill_value`` keyword, and NaN values should be propagated through pandas transformations. In the future, :func:`get_dummies` will propagate NaN values by default. (:issue:`15923`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double back-ticks around NaN
. Please use line-breaks.
fill_value : scalar, default None | ||
Value to fill NaNs with. If no missing values are found or NaN is not | ||
used to fill them, the returned data type will be the smallest | ||
width type that can represent the returned values. See |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't list internal functions.
isnotfinite = [] | ||
for v in vals: | ||
try: | ||
isnotfinite.append(~np.isfinite(v)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is pretty complicated logic and needs to be refactored
return get_empty_Frame(data, sparse) | ||
|
||
# Record NaN values before we munge the codes, GH15826 | ||
nan_codes_msk = codes == -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nan_codes_msk
-> mask
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or can call this check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
though in-line this is pretty reasonable I don't see a reason to create a variable here
its less readable actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future notes, you have to cache the nan-values since anything with code == -1
will get overwritten if you pass dummy_na=True
.
|
||
# Infer the proper output dtype. | ||
# GH15926 | ||
vals = data.values.ravel() if hasattr(data, 'values') else data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using .values
is pretty much a no-no as if mixed dtypes this is completely non-performant. The issue is that we should not have multiple dtypes. this needs more work on this type of checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that we should not have multiple dtypes.
What do you mean here? data
could definitely have multiple dtypes, right?
# GH15926 | ||
sp_indices = [None] * len(dummy_cols) | ||
sp_fill = [None] * len(dummy_cols) | ||
for code in np.unique(codes[codes != -1]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use pd.unique
for code in np.unique(codes[codes != -1]): | ||
# Non-zero value in sparse array if value is of the level | ||
# or the value is NaN and it is filled non-zero and | ||
# and it is not the dummy column for NaNs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this still feels pretty inefficient, but maybe not an issue in practice.
a b NaN | ||
0 1 0 0 | ||
1 0 1 0 | ||
2 NaN NaN 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the output we want? In case of a dummy_na column, I would expect the row to contain 0/1's, and not NaN/1 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is the output. If you want zeros, supply fill_value=0. The point here is that a NaN means we do not know what label it is. Putting a zero in the missing spots would indicate that the example is NOT one of those labels. We of course do not know what label the example is since the label is missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also the added benefit of making the handling of missing values and filling them consistent across pandas so it presents a consistent API to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point here is that a NaN means we do not know what label it is. Putting a zero in the missing spots would indicate that the example is NOT one of those labels.
Yes, but you ask for a dummy column, which means: treat NaNs as a separate value, and thus make a separate column for it.
In a dummies output, you can only have one 1 in one of the columns, so this means the other have to be 0.
If you want zeros, supply fill_value=0
Yes, you can do that, but I think fill_value=NaN
is a poor default for the case of dummy_na=True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you can do that, but I think fill_value=NaN is a poor default for the case of dummy_na=True.
Agreed. I'm struggling to think of a reason you would ever want fill_value=np.nan, dummy_na=True
... But I don't think we should raise an exception in this case, if the user specifies those arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just because you can't think of a reason doesn't mean it doesn't exist. People could fill in a prior for missing values and still want to mark examples which are missing.
The point here is to be principled and consistent in terms of the pandas API while also not making assumptions about other people's data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The consistency here is not a good argument in my opinion. With the default (which we are changing here), the NaNs will be preserved, consistent with how this is typical dealt with in pandas.
But by supplying dummy_na=True
, the user explicitly wants a non-typical behaviour, i.e. creating a column for NaN values. In that case the consistency argument is a bit moot in my opinion, but we should do what is sensible for this specific use-case, which is IMO returning only 0/1's.
When using the dummy_na column, this means that you are treating NaN no longer as a 'missing value' (in the sense of 'I don't know which one of the other values it is'), but as one of the valid values. The consequence is IMO that the other columns should be 0.
People could fill in a prior for missing values and still want to mark examples which are missing.
I don't understand this one.
But I don't think we should raise an exception in this case, if the user specifies those arguments.
Agreed. We could document that the fill_value
argument is ignored when using dummy_na=True
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I explained an example of this in an earlier thread. Suppose you know your are missing labels at random and you have a statistically representative data set. Then filling in missing labels with the fraction of each label from the data you have is a good assumption. You might in this case still want to mark which examples had missing labels too.
The point I have made many times is that the fewer assumptions pandas makes about the intentions of users and their data the better. The pandas user base is broad and covers many academic displines from the social sciences to astronomy plus a broad range of tech and industry domains. Assumptions built into pandas from one domain may not make sense for the others. The goal should be to provide general configurable tooling to work with data based on a consistent API with idiomatic syntax. Not filling missing values with zeros by default and having people supply a fill_value keyword meets these criterion
@beckermr this was a nice idea. closing, but if you want to rebase, pls comment to re-open and we can revisit. |
NaN
correctly #15923git diff upstream/master --name-only -- '*.py' | flake8 --diff
Closes #15923