Skip to content

ENH add fill_value feature to pd.get_dummies #15926

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from
Closed

ENH add fill_value feature to pd.get_dummies #15926

wants to merge 9 commits into from

Conversation

beckermr
Copy link

@beckermr beckermr commented Apr 6, 2017

Closes #15923

@codecov
Copy link

codecov bot commented Apr 6, 2017

Codecov Report

Merging #15926 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #15926      +/-   ##
==========================================
+ Coverage   90.96%   90.97%   +<.01%     
==========================================
  Files         145      145              
  Lines       49557    49573      +16     
==========================================
+ Hits        45081    45097      +16     
  Misses       4476     4476
Flag Coverage Δ
#multiple 88.73% <100%> (ø) ⬆️
#single 40.6% <0%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/reshape.py 99.3% <100%> (+0.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4502e82...d761096. Read the comment docs.

@codecov
Copy link

codecov bot commented Apr 6, 2017

Codecov Report

Merging #15926 into master will increase coverage by <.01%.
The diff coverage is 92.18%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #15926      +/-   ##
==========================================
+ Coverage   90.98%   90.99%   +<.01%     
==========================================
  Files         145      145              
  Lines       49556    49618      +62     
==========================================
+ Hits        45090    45150      +60     
- Misses       4466     4468       +2
Flag Coverage Δ
#multiple 88.76% <92.18%> (+0.01%) ⬆️
#single 40.53% <9.37%> (-0.04%) ⬇️
Impacted Files Coverage Δ
pandas/core/reshape.py 99.3% <100%> (+0.02%) ⬆️
pandas/types/cast.py 85.5% <81.48%> (-0.24%) ⬇️
pandas/indexes/multi.py 96.7% <0%> (ø) ⬆️
pandas/io/parsers.py 95.65% <0%> (+0.01%) ⬆️
pandas/core/common.py 91.03% <0%> (+0.34%) ⬆️
pandas/types/common.py 94.48% <0%> (+0.83%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6d90a43...f7ee8f5. Read the comment docs.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments, mostly around dtypes.

I'm a bit hesitant to add this, now that I see the issues around the argument forcing float output :/ We're trying to avoid new APIs that will "break" when we fix NaN behavior in pandas 2.

expected = expected[['C', 'A_b', 'A_nan', 'B_c', 'B_nan']]
assert_frame_equal(result, expected)

self.df = DataFrame({'A': ['a', 'b', 'a'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should just be df = ... correct? Not self.df

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops. :(

res_na = get_dummies(s, dummy_na='keep', sparse=self.sparse)
exp_na = DataFrame({'a': {0: 1, 1: 0, 2: np.nan},
'b': {0: 0, 1: 1, 2: np.nan}},
dtype=np.float16)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback have you added a pytest.mark for marking stuff that will change in pandas2? these "should" have int dtype in pandas 2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not yet. don't use float16 though

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What type then? Float32? I was trying to save space but the dynamic range of a uint8 might require a larger float?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah float32 is the best can do for now. (and FYI that's a nother reason to drop the NaN's, then you can use a small int type).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Sparse support helps with this though.

@@ -1280,7 +1320,10 @@ def get_empty_Frame(data, sparse):
else:
dummy_mat = np.eye(number_of_cols, dtype=np.uint8).take(codes, axis=0)

if not dummy_na:
if dummy_na == 'keep':
dummy_mat = dummy_mat.astype(np.float16)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I have a DataFrame like

df = pd.DataFrame({"A": ['a', 'b', np.nan], 'B': [1, 2, 3]})

What is the result .dtypes? Does it force all columns to float16, even if there' wasn't a NaN in that set (e.g. 'B').

Also, this will force it to float16, even if there were no NaNs to begin with.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's right on both accounts. It is pretty straight forward to add more careful type handling if we think it is worth it. To be honest, it is not clear what should be done with types.

Should allow multiple different types for each column being dummied?

Should we return float16 if any column has a nan and uint8 otherwise?

There is not a clear right answer afaict.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can make them float32 I guess. float16 is barely supported.

@beckermr
Copy link
Author

beckermr commented Apr 6, 2017

I get hesitating. Let me know your decision before we do a lot work.

@beckermr
Copy link
Author

beckermr commented Apr 6, 2017

Alright. Everything but what the option should be called is fixed, the linting too hopefully. :/

dummy_na : bool or 'keep', default False
If True, then add a column to indicate NaNs otherwise if
False NaNs are ignored. If set to 'keep', NaNs will be propagated
into the final result.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versionadded tag

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -1126,6 +1128,14 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False,
1 0 1 0
2 0 0 1

>>> s1 = ['a', 'b', np.nan]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add examples covering the bases with dummy_na

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean? See below.
screen shot 2017-04-06 at 5 44 43 pm

@@ -1126,6 +1130,14 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False,
1 0 1 0
2 0 0 1

>>> s1 = ['a', 'b', np.nan]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove this duplicated line maybe?

@jreback jreback changed the title ENH add dummy_na='keep feature to pd.get_dummies ENH add dummy_na='keep' feature to pd.get_dummies Apr 7, 2017
@jreback jreback added Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Apr 7, 2017
@@ -366,6 +366,7 @@ Other Enhancements
- ``pandas.io.json.json_normalize()`` with an empty ``list`` will return an empty ``DataFrame`` (:issue:`15534`)
- ``pandas.io.json.json_normalize()`` has gained a ``sep`` option that accepts ``str`` to separate joined fields; the default is ".", which is backward compatible. (:issue:`14883`)
- ``pd.read_csv()`` will now raise a ``csv.Error`` error whenever an end-of-file character is encountered in the middle of a data row (:issue:`15913`)
- `pd.get_dummies()` now accepts the `dummy_na='keep'` keyword which propagates missing values into the dummy columns instead of setting them to zero. (:issue:`15923`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double quotes are pd.get_dummies() and dummy_na='keep'

False NaNs are ignored. If set to 'keep', NaNs will be propagated
into the final result.

.. versionadded:: 0.20.0
columns : list-like, default None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a blank line after this (or sphinx complains)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused. Here are the other docs in this function.

screen shot 2017-04-07 at 8 58 26 am

You can see I followed what was done in the other cases. Do these other ones need to be changed too? How did the docs compile at all?

@@ -1126,6 +1130,12 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False,
1 0 1 0
2 0 0 1

>>> pd.get_dummies(s1, dummy_na='keep')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add examples with the other dummy_na (same s1)

Copy link
Author

@beckermr beckermr Apr 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are already there as I stated above. See this screen shot.

screen shot 2017-04-07 at 9 32 06 am

return get_empty_Frame(data, sparse)

number_of_cols = len(levels)

# If all NaN values are passed, then we add a NaN column
# (and put NaN in it below).
if dummy_na == 'keep' and number_of_cols == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have a test for this?

Copy link
Author

@beckermr beckermr Apr 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sparse_index=IntIndex(N, ixs), fill_value=0,
dtype=np.float32)
else:
sarr = SparseArray(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you pull the SparseArray creation out of the if/then. you can simply pass in an already created sarr (of the correctly dtype; don't actually need to pass the dtype.

'A_a': [1, 0, 1, np.nan],
'A_b': [0, 1, 0, np.nan],
'B_b': [1, 1, 0, np.nan],
'B_c': [0, 0, 1, np.nan]})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just pass the columns=cols on creation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can pass dtype=... as well

@@ -551,11 +576,23 @@ def test_basic_drop_first_NA(self):
['b', nan], 1)
assert_frame_equal(res_na, exp_na)

res_na = get_dummies(s_NA, dummy_na='keep', sparse=self.sparse,
drop_first=True)
exp_na = DataFrame({'b': {0: 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is awkard constrution, do use a dict-of-dicts, just dict-of-list is much more natural.

res_just_na = get_dummies([nan], dummy_na=True, sparse=self.sparse,
drop_first=True)
exp_just_na = DataFrame(index=np.arange(1))
assert_frame_equal(res_just_na, exp_just_na)

res_just_na = get_dummies([nan], dummy_na='keep', sparse=self.sparse,
drop_first=True)
exp_just_na = DataFrame(index=np.arange(1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use bare numpy arrays like this, instead index=range(1)

'A_b': [0, 1, 0, np.nan],
'B_c': [0, 0, 1, np.nan]})
cols = ['A_b', 'B_c']
expected[cols] = expected[cols].astype(np.float32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

'A_nan': [0, 0, 0, 1],
'B_c': [0, 0, 1, 0],
'B_nan': [0, 0, 0, 1]})
cols = ['A_b', 'A_nan', 'B_c', 'B_nan']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@beckermr
Copy link
Author

beckermr commented Apr 7, 2017

@jreback I'm happy to address these comments. Almost all of this testing code was copied from other parts of the test suite. Should all of that other code be cleaned up as well?

Also, are we set on the interface re: the other discussion on the issue thread?

@jreback
Copy link
Contributor

jreback commented Apr 7, 2017

@jreback I'm happy to address these comments. Almost all of this testing code was copied from other parts of the test suite. Should all of that other code be cleaned up as well?

sure if you'd like. I haven't looked at all code (hahha). So for sure some non-conforming code exists. (things like cleaning up other code usually better in other PR though).

@jreback
Copy link
Contributor

jreback commented Apr 7, 2017

I am ok with the API. @TomAugspurger @jorisvandenbossche

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 7, 2017 via email

@beckermr
Copy link
Author

beckermr commented Apr 7, 2017

Yeah I think this is where the rub is and 100% agree! Right now the code is filling missing values by default using something like fillna=0.

Here are the main considerations compiled in one spot:

  1. I searched the rest of the code base and found no examples of a fillna keyword in any public API. In fact, pandas already provides a centralized way to fill missing values, so adding something that is different here doesn't make sense. There should be one, and hopefully only one, way to do things.
  2. There is the further value checking that might need to be done with a fillna keyword.
  3. As I stated in the other issue, dropna=True is weird since we are not dropping rows. In the other parts of pandas public API, I have only seen this keyword used in reduction-like operations (i.e., summing, taking an average, etc.) where you want the operation to ignore missing values. This usage would deviate from those parts of the API.
  4. A further reason dropna is weird is that pandas also already provides a public API to drop data, namely the dropna method. Again, there should be one and only one way to do things.

So I think the best solution is to not fill missing values by default at all and let people use the already provided public API methods to do this. This keeps the API purely idiomatic. Further it would let people do things like drop rows with missing values before or after they apply the get_dummies function, making processing with these functions more robust to changes in the order of operations, more idempotent, etc.

Thus my suggested API is

  1. Don't fill any NaN's in any situation.
  2. The dummy_na keyword just adds the extra column with a flag if there are any NaNs for each row.

The current API with the PR as it stands now (i.e., with the dummy_na='keep' option) does this

  1. By default (dummy_na=False), all NaNs are set to zero.
  2. If dummay_na=True, fill NaNs to zero and add the column of missing value flags.
  3. If dummy_na='keep', don't add an extra missing value column and don't fill missing values to zero.

IMO, the simpler solution of just not filling missing values with zero by default makes the code easier to reason about, easier to maintain since there is not a maze of options, and fundamentally more consistent with the other pandas API conventions.

OFC it is possible to go with the second solution above, but this adds more complication to the code, deviates from other parts of the pandas API, and leaves the code harder to reason about.

@jreback
Copy link
Contributor

jreback commented Apr 7, 2017

@beckermr fillna is a function name, the one you are looking for is fill_value which is pretty common. e.g. see .unstack or .reindex

so that actually might be a nice alternative here, though it would have to be fill_value=0 to preserve the existing.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 7, 2017 via email

@beckermr
Copy link
Author

beckermr commented Apr 7, 2017

Ahhhhh @jreback. Thanks!

Are there any other examples where fill_value is set to something besides NaN or None?

I searched the code base and I cannot find any. So using fill_value=0 would also deviate from the API of (the rest? most of the rest?) of the code.

We keep coming up with alternative solutions that generate their own conflicts with other parts of the pandas API. I think this all points to the fact that filling with zero by default is fundamentally in conflict with the rest of the API and so should be deprecated.

@jreback
Copy link
Contributor

jreback commented Apr 7, 2017

fill_value=np.nan is generally the default.

So this does deviate from the API (as you have noted) 😄

The question is it is worth breaking this (and it would be a bit break I think).

as @jorisvandenbossche noted, this is generally a 'final' function so filling is nice implicity.

@beckermr
Copy link
Author

beckermr commented Apr 7, 2017

See my response to this point:

The current API makes too many assumptions. Right now, the current API asserts that if a label is missing, it is NONE of the labels you have. This cannot possibly be true in all cases.

Instead, if NaNs were always propagated by default (even if an additional missing label column is added) then users would be required to reason about why their labels are missing and how they should fill nulls.

For example, if labels are missing at random and the rest of the data is representative, then filling the missing one hot encoded rows with the fraction of each label would be a pretty good way to fill nulls.

Of course pandas should not do this either since it should not be making assumptions about the input data.

Basically, pandas cannot promise that filling zeros is the right thing to do. Making as few assumptions about peoples' input data and letting them do things like

data_final = pd.get_dummies(data).fillna(0)

is the preferred solution since it is explicit, idiomatic with respect to the rest of the API, and respectful to the user to give them the option to do what they like.

Breaking this basic part of the API for just this one function for this one case using an assumption about the input data that we cannot possibly ensure is always true seems like an easy choice to me. Don't break the API. Filling and dropping missing values is a core use of pandas and this bit the of the API should be as pure as possible.

Further, there are major version changes for a reason. Pandas regularly pushes backwards incompatible changes in a effort to improve the API and make the code better. My suggested solution is very much inline with this ongoing work.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 7, 2017

I'm +1 on adding a fill_value with the default of 0.
I'm -1 on outright breaking API, by having the default be None or np.nan (not sure if this is being proposed or not).
I'm on the fence about adding fill_value=0 with a deprecation warning that it will change to None in the future, probably -0 at this point.

Fortunately, we can defer discussion on what the default should be in the future.
Adding fill_value=0 is backwards compatible, and let's us start a deprecation cycle if we decide to.

@beckermr just in case it's not clear, I think you're argument is completely valid. I just am on the fence about whether it's worth deprecating and changing in the future. While pandas has been aggressive about changing APIs, we're trying to become more stable. Are you OK with implementing fill_value=0 for now, and kicking the deprecation stuff down the road?

@beckermr
Copy link
Author

beckermr commented Apr 7, 2017

I see your point of course, but I think the purity of the API on this core function is very important.

I suspect that as the API is cleaned up, there will be less to change. In other words, once it is all internally consistent with clearly correct choices, no more of these changes that break stuff will need to be done in the first place.

The other thing to be said is that supporting old features which have turned out to be problematic indefinitely is a bad practice too since it hampers the development of the code base as a whole.

I do get being queasy about making them. However, I don't think people will realistically ever move away from pandas, unless they move from python too. Given this, it is better to fix stuff sooner rather than later so that pandas can further mature into the core tool it is already.

I am happy with adding fill_value=0. I think we should also mark the deprecation of the zero default to a NaN default now. This will kick off the deprecation cycle and fix this as soon as realistically possible without really breaking stuff, which I am not in favor of.

@jreback
Copy link
Contributor

jreback commented Apr 7, 2017

@beckermr so just to be clear, we would leave dummy_na (as is) and add fill_value=None (with deprecation warning), where it would still fill with 0. ?

ok with deprecation actually (though could be just a DeprecationWarning), rather than a full-fledged FutureWarning.

@beckermr
Copy link
Author

beckermr commented Apr 7, 2017

Yes, this is close to my proposed solution. I would

  1. Leave dummy_na as is.
  2. Add fill_value=0. (It is weird to me for it to be set to None, but us to fill with zero anyways, so just be explicit and change the keyword later. Edit: You might have meant this. It was not clear to me. Sorry!)
  3. Add either a DeprecationWarning or FutureWarning saying that fill_value will default to NaN or None in the future.
  4. At some point, change the default and remove the warning.

I do not know the difference in how DeprecationWarning and FutureWarning are used so advice on this is much appreciated.

@jreback
Copy link
Contributor

jreback commented Apr 7, 2017

yes you set fill_value=None to signal that someone hasn't passed anything, so you know to warn.

@beckermr
Copy link
Author

beckermr commented Apr 7, 2017

Ohhhh ok @jreback. Sounds good.

@beckermr
Copy link
Author

beckermr commented Apr 7, 2017

@jreback @TomAugspurger Are we all in agreement then?

I am happy to make the changes OFC.

@jreback
Copy link
Contributor

jreback commented Apr 8, 2017

Cool. For signed ints (i.e., fill_value=-1), should I use int8 or int32?

good point, int8 is fine

@jreback
Copy link
Contributor

jreback commented Apr 8, 2017

to be extra defensive you can do check if fill_value is in range of np.iinfo(np.int8).max (and .min), otherwise you can make it a int64 (annoying but pls test as well).

@beckermr
Copy link
Author

beckermr commented Apr 8, 2017

Sounds good. A simple if block is not terribly difficult. :)

@jreback
Copy link
Contributor

jreback commented Apr 8, 2017

maybe better is to use pandas.types.cast.infer_dtype_from_scalar and add an argument there downcast=False (which you will set to True), that for int/float can implement this logic.

@beckermr
Copy link
Author

beckermr commented Apr 8, 2017

I see what you mean, but I think this should be a separate issue that can be addressed in another PR. I am guessing the downcasting logic is repeated a lot in the code base? Probably better to look at the other uses and make sure that something centralized makes sense?

@jreback
Copy link
Contributor

jreback commented Apr 8, 2017

I see what you mean, but I think this should be a separate issue that can be addressed in another PR. I am guessing the downcasting logic is repeated a lot in the code base? Probably better to look at the other uses and make sure that something centralized makes sense?

no we don't downcast on scalars anywhere. (we do on arrays but that is elsewhere and much more complicated). This is a nice way to add general functionaility. I never want to add code that is more general purpose in nature. If its implemented multiple times its wasteful and generally not as well tested.

@beckermr
Copy link
Author

beckermr commented Apr 8, 2017

OK I am not a huge fan of making PRs that put a lot of new code in more than one spot if it is not needed, but happy to move the logic there.

@jreback
Copy link
Contributor

jreback commented Apr 8, 2017

OK I am not a huge fan of making PRs that put a lot of new code in more than one spot if it is not needed, but happy to move the logic there.

but that's exactly the point. You need to re-use as much infrastructure as possible. I know where most of the bodies are buried. other pandas would never by DRY.

@beckermr
Copy link
Author

beckermr commented Apr 8, 2017

OFC - I have a working draft here but still need to refactor a bit. I will get to it later.

else:
any_null = False
except TypeError:
any_null = False
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this block sucks. I am sure pandas has an internal for this. I will try and find it later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reshaping is a bit of a mess

elif 'float' in str(fill_value_dtype):
output_dtype = np.float32
else:
raise ValueError('`fill_value` must be `np.nan`, an int or a float type!')
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more general version of this block needs to be moved to infer_dtype_from_scalar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep all for that! (something like this you can do here / or in another PR).

an alternative is to strip out the dtype inference and do it in another PR (before this one). That makes this much cleaner.

@@ -326,6 +327,25 @@ def test_include_na(self):
dtype=np.uint8)
tm.assert_numpy_array_equal(res_just_na.values, exp_just_na.values)

# Add `fill_value` keyword GH15926
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment still needs to be moved.

@beckermr
Copy link
Author

beckermr commented Apr 8, 2017

Another to do is to add tests for the dtype inference.

@jreback
Copy link
Contributor

jreback commented Apr 8, 2017

Another to do is to add tests for the dtype inference.

yep often times I am fixing one things, then I decide to generalize. So usually I will pull that logic to a stand alone PR, then put what I was doing on top. Its much cleaner / more clear. 1 PR 1 thing.

@beckermr
Copy link
Author

beckermr commented Apr 8, 2017

@jreback I took another pass at this. I still need to add docs in the whatsnew for the downcast options. However, a 50% review here would be good to make sure there are not any major issues.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just about right actually, though we want to consolidate this impl with what is already done in to_numeric; I pasted the code here.

I should have pointed you towards that....sorry.

@@ -1075,7 +1077,7 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False,
If appending prefix, separator/delimiter to use. Or pass a
list or dictionary as with `prefix.`
dummy_na : bool, default False
Add a column to indicate NaNs, if False NaNs are ignored.
Add a column to indicate NaNs if True.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add what happens if False

The default behavior of filling with zeros will be deprecated
in the future and using this default will not raise a
`FutureWarning`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also some color on result dtypes

@@ -86,6 +86,98 @@ def test_datetime_with_timezone(self):

class TestInferDtype(object):

def test_infer_dtype_from_scalar_downcast_basic(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally these would be parameterized tests (on the dtype)

@@ -322,8 +325,52 @@ def infer_dtype_from_scalar(val, pandas_dtype=False):
whether to infer dtype including pandas extension types.
If False, scalar belongs to pandas extension types is inferred as
object
downcast : bool, default False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use this same signature for downcast here. We worked thru this for a while

Signature: pd.to_numeric(arg, errors='raise', downcast=None)
Docstring:
Convert argument to a numeric type.

Parameters
----------
arg : list, tuple, 1-d array, or Series
errors : {'ignore', 'raise', 'coerce'}, default 'raise'
   - If 'raise', then invalid parsing will raise an exception
   - If 'coerce', then invalid parsing will be set as NaN
   - If 'ignore', then invalid parsing will return the input
downcast : {'integer', 'signed', 'unsigned', 'float'} , default None
   If not None, and if the data has been successfully cast to a
   numerical dtype (or if the data was numeric to begin with),
   downcast that resulting data to the smallest numerical dtype
   possible according to the following rules:

   - 'integer' or 'signed': smallest signed int dtype (min.: np.int8)
   - 'unsigned': smallest unsigned int dtype (min.: np.uint8)
   - 'float': smallest float dtype (min.: np.float32)

   As this behaviour is separate from the core conversion to
   numeric values, any errors raised during the downcasting
   will be surfaced regardless of the value of the 'errors' input.

   In addition, downcasting will only occur if the size
   of the resulting data's dtype is strictly larger than
   the dtype it is to be cast to, so if none of the dtypes
   checked satisfy that specification, no downcasting will be
   performed on the data.

"""

def _downcast_dtype(dtype, val):
if 'float' in str(dtype):
if ((val > np.finfo(np.float16).min and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just decide between float32/64. We don't really support float16 anywhere, so never want to return this

Copy link
Author

@beckermr beckermr Apr 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was going to do this, but then you get weird stuff where you can feed a float16 to the code without downsize and get back float16. Then if you feed a float16 with downsize=True, you get back float32 which is bigger. Are we cool with that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then simply detect float16 and force to float32

val < np.finfo(np.float16).max) or val is np.nan):
return np.float16
elif (val > np.finfo(np.float32).min and
val < np.finfo(np.float32).max):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually you could prob take this exact code from to_numeric and put it in a function and call it here (wrap the scalar), and replace in to_numeric.

    if downcast is not None and is_numeric_dtype(values):
        typecodes = None

        if downcast in ('integer', 'signed'):
            typecodes = np.typecodes['Integer']
        elif downcast == 'unsigned' and np.min(values) >= 0:
            typecodes = np.typecodes['UnsignedInteger']
        elif downcast == 'float':
            typecodes = np.typecodes['Float']

            # pandas support goes only to np.float32,
            # as float dtypes smaller than that are
            # extremely rare and not well supported
            float_32_char = np.dtype(np.float32).char
            float_32_ind = typecodes.index(float_32_char)
            typecodes = typecodes[float_32_ind:]

        if typecodes is not None:
            # from smallest to largest
            for dtype in typecodes:
                if np.dtype(dtype).itemsize <= values.dtype.itemsize:
                    values = maybe_downcast_to_dtype(values, dtype)

                    # successful conversion
                    if values.dtype == dtype:
                        break

elif (val > np.finfo(np.float32).min and
val < np.finfo(np.float32).max):
return np.float32
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would replace your _downcast_dtype; though make it another function in pandas.types.cast

call it maybe_downcast_itemsize (not to confuse it with the maybe_downcast_to_dtype, which is similar, but doesn't reduce the itemsize)

@beckermr
Copy link
Author

@jreback Sorry for being MIA. I have been sick. :(

I am going to try and make a final pass at this tomorrow, but then I will be out of contact for a week or so. I don't want to delay any releases, so feel free to push this to v0.21.0 if you need to!

@beckermr
Copy link
Author

@jreback I have taken another pass at this. I moved some of the downsizing logic outside of the function to infer types since it didn't really work there. Basically, the downsizing functions are built to be used with array-like types, and the infer functions are for scalars. In fact, the downsizing logic in to_numeric actually ignores scalars all together. :(

I don't have any more time to spend on this PR, besides some very small edits, so either we merge it or I need to move on. Feel free of course to pick it up yourself and get it merged.

I think it is in a good spot without expanding to basically refactor all of the type checking/changing functionality.

@beckermr
Copy link
Author

Ack. Generalizing this functionality is now introducing test failures in other places. I really think trying to DRY this up is not needed in this case.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@beckermr this looks pretty good actually. I am not sure I will be able to get to this for 0.20.0 but will at some point.

@@ -372,6 +372,8 @@ Other Enhancements
- :func:`MultiIndex.remove_unused_levels` has been added to facilitate :ref:`removing unused levels <advanced.shown_levels>`. (:issue:`15694`)
- ``pd.read_csv()`` will now raise a ``ParserError`` error whenever any parsing error occurs (:issue:`15913`, :issue:`15925`)
- ``pd.read_csv()`` now supports the ``error_bad_lines`` and ``warn_bad_lines`` arguments for the Python parser (:issue:`15925`)
- ``pd.get_dummies()`` now accepts the ``fill_value`` keyword which specifies how to fill NaN values in the dummy variables. (:issue:`15923`)
- ``pd.types.cast`` has a new function ``maybe_downcast_itemsize`` which can be used to reduce the width of numeric types. (:issue:`15923`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2nd entry is not necessary, this is internal.

@@ -382,6 +384,19 @@ Other Enhancements
Backwards incompatible API changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a ref

Deprecate Automatic Zero Filling of Missing Values in ``pd.get_dummies``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The :func:`get_dummies` function currently fills NaN values with zero by default. This behavior is in conflict with the rest of the pandas API since NaN values should be filled with ``fillna`` or a ``fill_value`` keyword, and NaN values should be propagated through pandas transformations. In the future, :func:`get_dummies` will propagate NaN values by default. (:issue:`15923`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double back-ticks around NaN. Please use line-breaks.

fill_value : scalar, default None
Value to fill NaNs with. If no missing values are found or NaN is not
used to fill them, the returned data type will be the smallest
width type that can represent the returned values. See
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't list internal functions.

isnotfinite = []
for v in vals:
try:
isnotfinite.append(~np.isfinite(v))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty complicated logic and needs to be refactored

return get_empty_Frame(data, sparse)

# Record NaN values before we munge the codes, GH15826
nan_codes_msk = codes == -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nan_codes_msk -> mask

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or can call this check

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though in-line this is pretty reasonable I don't see a reason to create a variable here
its less readable actually.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future notes, you have to cache the nan-values since anything with code == -1 will get overwritten if you pass dummy_na=True.


# Infer the proper output dtype.
# GH15926
vals = data.values.ravel() if hasattr(data, 'values') else data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using .values is pretty much a no-no as if mixed dtypes this is completely non-performant. The issue is that we should not have multiple dtypes. this needs more work on this type of checking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that we should not have multiple dtypes.

What do you mean here? data could definitely have multiple dtypes, right?

# GH15926
sp_indices = [None] * len(dummy_cols)
sp_fill = [None] * len(dummy_cols)
for code in np.unique(codes[codes != -1]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use pd.unique

for code in np.unique(codes[codes != -1]):
# Non-zero value in sparse array if value is of the level
# or the value is NaN and it is filled non-zero and
# and it is not the dummy column for NaNs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this still feels pretty inefficient, but maybe not an issue in practice.

@jreback jreback self-assigned this Apr 11, 2017
a b NaN
0 1 0 0
1 0 1 0
2 NaN NaN 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the output we want? In case of a dummy_na column, I would expect the row to contain 0/1's, and not NaN/1 ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is the output. If you want zeros, supply fill_value=0. The point here is that a NaN means we do not know what label it is. Putting a zero in the missing spots would indicate that the example is NOT one of those labels. We of course do not know what label the example is since the label is missing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also the added benefit of making the handling of missing values and filling them consistent across pandas so it presents a consistent API to users.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point here is that a NaN means we do not know what label it is. Putting a zero in the missing spots would indicate that the example is NOT one of those labels.

Yes, but you ask for a dummy column, which means: treat NaNs as a separate value, and thus make a separate column for it.
In a dummies output, you can only have one 1 in one of the columns, so this means the other have to be 0.

If you want zeros, supply fill_value=0

Yes, you can do that, but I think fill_value=NaN is a poor default for the case of dummy_na=True.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can do that, but I think fill_value=NaN is a poor default for the case of dummy_na=True.

Agreed. I'm struggling to think of a reason you would ever want fill_value=np.nan, dummy_na=True... But I don't think we should raise an exception in this case, if the user specifies those arguments.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because you can't think of a reason doesn't mean it doesn't exist. People could fill in a prior for missing values and still want to mark examples which are missing.

The point here is to be principled and consistent in terms of the pandas API while also not making assumptions about other people's data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The consistency here is not a good argument in my opinion. With the default (which we are changing here), the NaNs will be preserved, consistent with how this is typical dealt with in pandas.
But by supplying dummy_na=True, the user explicitly wants a non-typical behaviour, i.e. creating a column for NaN values. In that case the consistency argument is a bit moot in my opinion, but we should do what is sensible for this specific use-case, which is IMO returning only 0/1's.

When using the dummy_na column, this means that you are treating NaN no longer as a 'missing value' (in the sense of 'I don't know which one of the other values it is'), but as one of the valid values. The consequence is IMO that the other columns should be 0.

People could fill in a prior for missing values and still want to mark examples which are missing.

I don't understand this one.

But I don't think we should raise an exception in this case, if the user specifies those arguments.

Agreed. We could document that the fill_value argument is ignored when using dummy_na=True?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explained an example of this in an earlier thread. Suppose you know your are missing labels at random and you have a statistically representative data set. Then filling in missing labels with the fraction of each label from the data you have is a good assumption. You might in this case still want to mark which examples had missing labels too.

The point I have made many times is that the fewer assumptions pandas makes about the intentions of users and their data the better. The pandas user base is broad and covers many academic displines from the social sciences to astronomy plus a broad range of tech and industry domains. Assumptions built into pandas from one domain may not make sense for the others. The goal should be to provide general configurable tooling to work with data based on a consistent API with idiomatic syntax. Not filling missing values with zeros by default and having people supply a fill_value keyword meets these criterion

@jreback
Copy link
Contributor

jreback commented Aug 1, 2017

@beckermr this was a nice idea. closing, but if you want to rebase, pls comment to re-open and we can revisit.

@jreback jreback closed this Aug 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG pd.get_dummies does not propagate NaN correctly
4 participants