ENH add fill_value feature to pd.get_dummies #15926

beckermr · 2017-04-06T20:12:51Z

closes BUG pd.get_dummies does not propagate NaN correctly #15923
tests added / passed
passes git diff upstream/master --name-only -- '*.py' | flake8 --diff
whatsnew entry

codecov · 2017-04-06T21:31:29Z

Codecov Report

Merging #15926 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #15926      +/-   ##
==========================================
+ Coverage   90.96%   90.97%   +<.01%     
==========================================
  Files         145      145              
  Lines       49557    49573      +16     
==========================================
+ Hits        45081    45097      +16     
  Misses       4476     4476

Flag	Coverage Δ
#multiple	`88.73% <100%> (ø)`	⬆️
#single	`40.6% <0%> (-0.02%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/reshape.py	`99.3% <100%> (+0.02%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4502e82...d761096. Read the comment docs.

codecov · 2017-04-06T21:31:39Z

Codecov Report

Merging #15926 into master will increase coverage by <.01%.
The diff coverage is 92.18%.

@@            Coverage Diff             @@
##           master   #15926      +/-   ##
==========================================
+ Coverage   90.98%   90.99%   +<.01%     
==========================================
  Files         145      145              
  Lines       49556    49618      +62     
==========================================
+ Hits        45090    45150      +60     
- Misses       4466     4468       +2

Flag	Coverage Δ
#multiple	`88.76% <92.18%> (+0.01%)`	⬆️
#single	`40.53% <9.37%> (-0.04%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/reshape.py	`99.3% <100%> (+0.02%)`	⬆️
pandas/types/cast.py	`85.5% <81.48%> (-0.24%)`	⬇️
pandas/indexes/multi.py	`96.7% <0%> (ø)`	⬆️
pandas/io/parsers.py	`95.65% <0%> (+0.01%)`	⬆️
pandas/core/common.py	`91.03% <0%> (+0.34%)`	⬆️
pandas/types/common.py	`94.48% <0%> (+0.83%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6d90a43...f7ee8f5. Read the comment docs.

TomAugspurger

A couple of comments, mostly around dtypes.

I'm a bit hesitant to add this, now that I see the issues around the argument forcing float output :/ We're trying to avoid new APIs that will "break" when we fix NaN behavior in pandas 2.

TomAugspurger · 2017-04-06T21:47:05Z

pandas/tests/test_reshape.py

        expected = expected[['C', 'A_b', 'A_nan', 'B_c', 'B_nan']]
        assert_frame_equal(result, expected)

+        self.df = DataFrame({'A': ['a', 'b', 'a'],


This should just be df = ... correct? Not self.df

TomAugspurger · 2017-04-06T21:50:45Z

pandas/tests/test_reshape.py

+        res_na = get_dummies(s, dummy_na='keep', sparse=self.sparse)
+        exp_na = DataFrame({'a': {0: 1, 1: 0, 2: np.nan},
+                            'b': {0: 0, 1: 1, 2: np.nan}},
+                           dtype=np.float16)


@jreback have you added a pytest.mark for marking stuff that will change in pandas2? these "should" have int dtype in pandas 2.

not yet. don't use float16 though

What type then? Float32? I was trying to save space but the dynamic range of a uint8 might require a larger float?

yeah float32 is the best can do for now. (and FYI that's a nother reason to drop the NaN's, then you can use a small int type).

Yep. Sparse support helps with this though.

TomAugspurger · 2017-04-06T21:55:05Z

pandas/core/reshape.py

@@ -1280,7 +1320,10 @@ def get_empty_Frame(data, sparse):
    else:
        dummy_mat = np.eye(number_of_cols, dtype=np.uint8).take(codes, axis=0)

-        if not dummy_na:
+        if dummy_na == 'keep':
+            dummy_mat = dummy_mat.astype(np.float16)


If I have a DataFrame like

df = pd.DataFrame({"A": ['a', 'b', np.nan], 'B': [1, 2, 3]})

What is the result .dtypes? Does it force all columns to float16, even if there' wasn't a NaN in that set (e.g. 'B').

Also, this will force it to float16, even if there were no NaNs to begin with.

Yes that's right on both accounts. It is pretty straight forward to add more careful type handling if we think it is worth it. To be honest, it is not clear what should be done with types.

Should allow multiple different types for each column being dummied?

Should we return float16 if any column has a nan and uint8 otherwise?

There is not a clear right answer afaict.

you can make them float32 I guess. float16 is barely supported.

beckermr · 2017-04-06T22:05:54Z

I get hesitating. Let me know your decision before we do a lot work.

beckermr · 2017-04-06T22:42:48Z

Alright. Everything but what the option should be called is fixed, the linting too hopefully. :/

jreback · 2017-04-06T22:43:04Z

pandas/core/reshape.py

+    dummy_na : bool or 'keep', default False
+        If True, then add a column to indicate NaNs otherwise if
+        False NaNs are ignored. If set to 'keep', NaNs will be propagated
+        into the final result.


versionadded tag

jreback · 2017-04-06T22:43:22Z

pandas/core/reshape.py

@@ -1126,6 +1128,14 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False,
    1  0  1    0
    2  0  0    1

+    >>> s1 = ['a', 'b', np.nan]


can you add examples covering the bases with dummy_na

Not sure what you mean? See below.

beckermr · 2017-04-06T22:47:02Z

pandas/core/reshape.py

@@ -1126,6 +1130,14 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False,
    1  0  1    0
    2  0  0    1

+    >>> s1 = ['a', 'b', np.nan]


I can remove this duplicated line maybe?

jreback · 2017-04-07T00:39:17Z

doc/source/whatsnew/v0.20.0.txt

@@ -366,6 +366,7 @@ Other Enhancements
 - ``pandas.io.json.json_normalize()`` with an empty ``list`` will return an empty ``DataFrame`` (:issue:`15534`)
 - ``pandas.io.json.json_normalize()`` has gained a ``sep`` option that accepts ``str`` to separate joined fields; the default is ".", which is backward compatible. (:issue:`14883`)
 - ``pd.read_csv()`` will now raise a ``csv.Error`` error whenever an end-of-file character is encountered in the middle of a data row (:issue:`15913`)
+- `pd.get_dummies()` now accepts the `dummy_na='keep'` keyword which propagates missing values into the dummy columns instead of setting them to zero. (:issue:`15923`)


double quotes are pd.get_dummies() and dummy_na='keep'

jreback · 2017-04-07T00:39:40Z

pandas/core/reshape.py

+        False NaNs are ignored. If set to 'keep', NaNs will be propagated
+        into the final result.
+
+        .. versionadded:: 0.20.0
    columns : list-like, default None


need a blank line after this (or sphinx complains)

I am confused. Here are the other docs in this function.

You can see I followed what was done in the other cases. Do these other ones need to be changed too? How did the docs compile at all?

jreback · 2017-04-07T00:40:00Z

pandas/core/reshape.py

@@ -1126,6 +1130,12 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False,
    1  0  1    0
    2  0  0    1

+    >>> pd.get_dummies(s1, dummy_na='keep')


can you add examples with the other dummy_na (same s1)

These are already there as I stated above. See this screen shot.

jreback · 2017-04-07T00:40:45Z

pandas/core/reshape.py

        return get_empty_Frame(data, sparse)

    number_of_cols = len(levels)

+    # If all NaN values are passed, then we add a NaN column
+    # (and put NaN in it below).
+    if dummy_na == 'keep' and number_of_cols == 0:


have a test for this?

Yep!

https://github.com/beckermr/pandas/blob/abdff7f7106778be7e1f1409759b0c5c4ba59ac5/pandas/tests/test_reshape.py#L338

jreback · 2017-04-07T00:44:40Z

pandas/core/reshape.py

+                    sparse_index=IntIndex(N, ixs), fill_value=0,
+                    dtype=np.float32)
+            else:
+                sarr = SparseArray(


can you pull the SparseArray creation out of the if/then. you can simply pass in an already created sarr (of the correctly dtype; don't actually need to pass the dtype.

jreback · 2017-04-07T00:47:51Z

pandas/tests/test_reshape.py

+                              'A_a': [1, 0, 1, np.nan],
+                              'A_b': [0, 1, 0, np.nan],
+                              'B_b': [1, 1, 0, np.nan],
+                              'B_c': [0, 0, 1, np.nan]})


just pass the columns=cols on creation

you can pass dtype=... as well

jreback · 2017-04-07T00:48:41Z

pandas/tests/test_reshape.py

@@ -551,11 +576,23 @@ def test_basic_drop_first_NA(self):
                                      ['b', nan], 1)
        assert_frame_equal(res_na, exp_na)

+        res_na = get_dummies(s_NA, dummy_na='keep', sparse=self.sparse,
+                             drop_first=True)
+        exp_na = DataFrame({'b': {0: 0,


this is awkard constrution, do use a dict-of-dicts, just dict-of-list is much more natural.

jreback · 2017-04-07T00:49:10Z

pandas/tests/test_reshape.py

        res_just_na = get_dummies([nan], dummy_na=True, sparse=self.sparse,
                                  drop_first=True)
        exp_just_na = DataFrame(index=np.arange(1))
        assert_frame_equal(res_just_na, exp_just_na)

+        res_just_na = get_dummies([nan], dummy_na='keep', sparse=self.sparse,
+                                  drop_first=True)
+        exp_just_na = DataFrame(index=np.arange(1))


don't use bare numpy arrays like this, instead index=range(1)

jreback · 2017-04-07T00:49:36Z

pandas/tests/test_reshape.py

+                              'A_b': [0, 1, 0, np.nan],
+                              'B_c': [0, 0, 1, np.nan]})
+        cols = ['A_b', 'B_c']
+        expected[cols] = expected[cols].astype(np.float32)


same as above

jreback · 2017-04-07T00:49:46Z

pandas/tests/test_reshape.py

+                              'A_nan': [0, 0, 0, 1],
+                              'B_c': [0, 0, 1, 0],
+                              'B_nan': [0, 0, 0, 1]})
+        cols = ['A_b', 'A_nan', 'B_c', 'B_nan']


same as above

beckermr · 2017-04-07T13:17:32Z

@jreback I'm happy to address these comments. Almost all of this testing code was copied from other parts of the test suite. Should all of that other code be cleaned up as well?

Also, are we set on the interface re: the other discussion on the issue thread?

jreback · 2017-04-07T13:32:45Z

@jreback I'm happy to address these comments. Almost all of this testing code was copied from other parts of the test suite. Should all of that other code be cleaned up as well?

sure if you'd like. I haven't looked at all code (hahha). So for sure some non-conforming code exists. (things like cleaning up other code usually better in other PR though).

jreback · 2017-04-07T13:34:09Z

I am ok with the API. @TomAugspurger @jorisvandenbossche

TomAugspurger · 2017-04-07T13:58:05Z

Could you summarize the final API from that other issue? Like one comment said, dropna is a bit weird since we aren't ever dropping rows or cols. In a sense, the current behavior is like fillna=0. You do the dummy encoding, with NaN -> NaN, then fill those output NaNs with 0. Your proposal is to not fill them. That's all assuming dummy_na=False. You wouldn't want to allow fillna=None or nan. So I could see either fillna=0, or the third value for dummy_na like now. Not a strong preference over those two. With fillna would there be weird interactions with dummy_na?

…

On Apr 7, 2017, at 8:34 AM, Jeff Reback ***@***.***> wrote: I am ok with the API. @TomAugspurger @jorisvandenbossche — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

beckermr · 2017-04-07T14:24:43Z

Yeah I think this is where the rub is and 100% agree! Right now the code is filling missing values by default using something like fillna=0.

Here are the main considerations compiled in one spot:

I searched the rest of the code base and found no examples of a fillna keyword in any public API. In fact, pandas already provides a centralized way to fill missing values, so adding something that is different here doesn't make sense. There should be one, and hopefully only one, way to do things.
There is the further value checking that might need to be done with a fillna keyword.
As I stated in the other issue, dropna=True is weird since we are not dropping rows. In the other parts of pandas public API, I have only seen this keyword used in reduction-like operations (i.e., summing, taking an average, etc.) where you want the operation to ignore missing values. This usage would deviate from those parts of the API.
A further reason dropna is weird is that pandas also already provides a public API to drop data, namely the dropna method. Again, there should be one and only one way to do things.

So I think the best solution is to not fill missing values by default at all and let people use the already provided public API methods to do this. This keeps the API purely idiomatic. Further it would let people do things like drop rows with missing values before or after they apply the get_dummies function, making processing with these functions more robust to changes in the order of operations, more idempotent, etc.

Thus my suggested API is

Don't fill any NaN's in any situation.
The dummy_na keyword just adds the extra column with a flag if there are any NaNs for each row.

The current API with the PR as it stands now (i.e., with the dummy_na='keep' option) does this

By default (dummy_na=False), all NaNs are set to zero.
If dummay_na=True, fill NaNs to zero and add the column of missing value flags.
If dummy_na='keep', don't add an extra missing value column and don't fill missing values to zero.

IMO, the simpler solution of just not filling missing values with zero by default makes the code easier to reason about, easier to maintain since there is not a maze of options, and fundamentally more consistent with the other pandas API conventions.

OFC it is possible to go with the second solution above, but this adds more complication to the code, deviates from other parts of the pandas API, and leaves the code harder to reason about.

jreback · 2017-04-07T14:52:01Z

@beckermr fillna is a function name, the one you are looking for is fill_value which is pretty common. e.g. see .unstack or .reindex

so that actually might be a nice alternative here, though it would have to be fill_value=0 to preserve the existing.

TomAugspurger · 2017-04-07T14:56:44Z

Thanks for the summary, will give feedback later. Jeff is right, anywhere I used fillna it should have been fill_value.

…

On Apr 7, 2017, at 9:52 AM, Jeff Reback ***@***.***> wrote: @beckermr fillna is a function name, the one you are looking for is fill_value which is pretty common. e.g. see .unstack or .reindex so that actually might be a nice alternative here, though it would have to be fill_value=0 to preserve the existing. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

beckermr · 2017-04-07T14:59:11Z

Ahhhhh @jreback. Thanks!

Are there any other examples where fill_value is set to something besides NaN or None?

I searched the code base and I cannot find any. So using fill_value=0 would also deviate from the API of (the rest? most of the rest?) of the code.

We keep coming up with alternative solutions that generate their own conflicts with other parts of the pandas API. I think this all points to the fact that filling with zero by default is fundamentally in conflict with the rest of the API and so should be deprecated.

jreback · 2017-04-07T15:10:29Z

fill_value=np.nan is generally the default.

So this does deviate from the API (as you have noted) 😄

The question is it is worth breaking this (and it would be a bit break I think).

as @jorisvandenbossche noted, this is generally a 'final' function so filling is nice implicity.

beckermr · 2017-04-07T15:22:29Z

See my response to this point:

The current API makes too many assumptions. Right now, the current API asserts that if a label is missing, it is NONE of the labels you have. This cannot possibly be true in all cases.

Instead, if NaNs were always propagated by default (even if an additional missing label column is added) then users would be required to reason about why their labels are missing and how they should fill nulls.

For example, if labels are missing at random and the rest of the data is representative, then filling the missing one hot encoded rows with the fraction of each label would be a pretty good way to fill nulls.

Of course pandas should not do this either since it should not be making assumptions about the input data.

Basically, pandas cannot promise that filling zeros is the right thing to do. Making as few assumptions about peoples' input data and letting them do things like

data_final = pd.get_dummies(data).fillna(0)

is the preferred solution since it is explicit, idiomatic with respect to the rest of the API, and respectful to the user to give them the option to do what they like.

Breaking this basic part of the API for just this one function for this one case using an assumption about the input data that we cannot possibly ensure is always true seems like an easy choice to me. Don't break the API. Filling and dropping missing values is a core use of pandas and this bit the of the API should be as pure as possible.

Further, there are major version changes for a reason. Pandas regularly pushes backwards incompatible changes in a effort to improve the API and make the code better. My suggested solution is very much inline with this ongoing work.

TomAugspurger · 2017-04-07T15:30:42Z

I'm +1 on adding a fill_value with the default of 0.
I'm -1 on outright breaking API, by having the default be None or np.nan (not sure if this is being proposed or not).
I'm on the fence about adding fill_value=0 with a deprecation warning that it will change to None in the future, probably -0 at this point.

Fortunately, we can defer discussion on what the default should be in the future.
Adding fill_value=0 is backwards compatible, and let's us start a deprecation cycle if we decide to.

@beckermr just in case it's not clear, I think you're argument is completely valid. I just am on the fence about whether it's worth deprecating and changing in the future. While pandas has been aggressive about changing APIs, we're trying to become more stable. Are you OK with implementing fill_value=0 for now, and kicking the deprecation stuff down the road?

beckermr · 2017-04-07T15:44:01Z

I see your point of course, but I think the purity of the API on this core function is very important.

I suspect that as the API is cleaned up, there will be less to change. In other words, once it is all internally consistent with clearly correct choices, no more of these changes that break stuff will need to be done in the first place.

The other thing to be said is that supporting old features which have turned out to be problematic indefinitely is a bad practice too since it hampers the development of the code base as a whole.

I do get being queasy about making them. However, I don't think people will realistically ever move away from pandas, unless they move from python too. Given this, it is better to fix stuff sooner rather than later so that pandas can further mature into the core tool it is already.

I am happy with adding fill_value=0. I think we should also mark the deprecation of the zero default to a NaN default now. This will kick off the deprecation cycle and fix this as soon as realistically possible without really breaking stuff, which I am not in favor of.

jreback · 2017-04-07T15:47:24Z

@beckermr so just to be clear, we would leave dummy_na (as is) and add fill_value=None (with deprecation warning), where it would still fill with 0. ?

ok with deprecation actually (though could be just a DeprecationWarning), rather than a full-fledged FutureWarning.

beckermr · 2017-04-07T15:52:51Z

Yes, this is close to my proposed solution. I would

Leave dummy_na as is.
Add fill_value=0. (It is weird to me for it to be set to None, but us to fill with zero anyways, so just be explicit and change the keyword later. Edit: You might have meant this. It was not clear to me. Sorry!)
Add either a DeprecationWarning or FutureWarning saying that fill_value will default to NaN or None in the future.
At some point, change the default and remove the warning.

I do not know the difference in how DeprecationWarning and FutureWarning are used so advice on this is much appreciated.

jreback · 2017-04-07T16:01:28Z

yes you set fill_value=None to signal that someone hasn't passed anything, so you know to warn.

beckermr · 2017-04-07T16:05:34Z

Ohhhh ok @jreback. Sounds good.

beckermr · 2017-04-07T16:06:27Z

@jreback @TomAugspurger Are we all in agreement then?

I am happy to make the changes OFC.

jreback · 2017-04-08T13:44:25Z

Cool. For signed ints (i.e., fill_value=-1), should I use int8 or int32?

good point, int8 is fine

jreback · 2017-04-08T13:47:07Z

to be extra defensive you can do check if fill_value is in range of np.iinfo(np.int8).max (and .min), otherwise you can make it a int64 (annoying but pls test as well).

beckermr · 2017-04-08T13:48:08Z

Sounds good. A simple if block is not terribly difficult. :)

jreback · 2017-04-08T13:48:19Z

maybe better is to use pandas.types.cast.infer_dtype_from_scalar and add an argument there downcast=False (which you will set to True), that for int/float can implement this logic.

beckermr · 2017-04-08T13:54:32Z

I see what you mean, but I think this should be a separate issue that can be addressed in another PR. I am guessing the downcasting logic is repeated a lot in the code base? Probably better to look at the other uses and make sure that something centralized makes sense?

jreback · 2017-04-08T13:56:39Z

I see what you mean, but I think this should be a separate issue that can be addressed in another PR. I am guessing the downcasting logic is repeated a lot in the code base? Probably better to look at the other uses and make sure that something centralized makes sense?

no we don't downcast on scalars anywhere. (we do on arrays but that is elsewhere and much more complicated). This is a nice way to add general functionaility. I never want to add code that is more general purpose in nature. If its implemented multiple times its wasteful and generally not as well tested.

beckermr · 2017-04-08T14:05:22Z

OK I am not a huge fan of making PRs that put a lot of new code in more than one spot if it is not needed, but happy to move the logic there.

jreback · 2017-04-08T14:33:17Z

OK I am not a huge fan of making PRs that put a lot of new code in more than one spot if it is not needed, but happy to move the logic there.

but that's exactly the point. You need to re-use as much infrastructure as possible. I know where most of the bodies are buried. other pandas would never by DRY.

beckermr · 2017-04-08T14:35:26Z

OFC - I have a working draft here but still need to refactor a bit. I will get to it later.

beckermr · 2017-04-08T14:36:47Z

pandas/core/reshape.py

+        else:
+            any_null = False
+    except TypeError:
+        any_null = False


this block sucks. I am sure pandas has an internal for this. I will try and find it later.

reshaping is a bit of a mess

beckermr · 2017-04-08T14:37:19Z

pandas/core/reshape.py

+        elif 'float' in str(fill_value_dtype):
+            output_dtype = np.float32
+        else:
+            raise ValueError('`fill_value` must be `np.nan`, an int or a float type!')


A more general version of this block needs to be moved to infer_dtype_from_scalar.

yep all for that! (something like this you can do here / or in another PR).

an alternative is to strip out the dtype inference and do it in another PR (before this one). That makes this much cleaner.

beckermr · 2017-04-08T14:37:50Z

pandas/tests/test_reshape.py

@@ -326,6 +327,25 @@ def test_include_na(self):
                                dtype=np.uint8)
        tm.assert_numpy_array_equal(res_just_na.values, exp_just_na.values)

+    # Add `fill_value` keyword GH15926


comment still needs to be moved.

beckermr · 2017-04-08T14:38:25Z

Another to do is to add tests for the dtype inference.

jreback · 2017-04-08T14:48:54Z

Another to do is to add tests for the dtype inference.

yep often times I am fixing one things, then I decide to generalize. So usually I will pull that logic to a stand alone PR, then put what I was doing on top. Its much cleaner / more clear. 1 PR 1 thing.

beckermr · 2017-04-08T19:59:57Z

@jreback I took another pass at this. I still need to add docs in the whatsnew for the downcast options. However, a 50% review here would be good to make sure there are not any major issues.

jreback

just about right actually, though we want to consolidate this impl with what is already done in to_numeric; I pasted the code here.

I should have pointed you towards that....sorry.

jreback · 2017-04-08T20:46:39Z

pandas/core/reshape.py

@@ -1075,7 +1077,7 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False,
        If appending prefix, separator/delimiter to use. Or pass a
        list or dictionary as with `prefix.`
    dummy_na : bool, default False
-        Add a column to indicate NaNs, if False NaNs are ignored.
+        Add a column to indicate NaNs if True.


add what happens if False

jreback · 2017-04-08T20:47:01Z

pandas/core/reshape.py

+        The default behavior of filling with zeros will be deprecated
+        in the future and using this default will not raise a
+        `FutureWarning`.
+


maybe also some color on result dtypes

jreback · 2017-04-08T20:48:20Z

pandas/tests/types/test_cast.py

@@ -86,6 +86,98 @@ def test_datetime_with_timezone(self):

 class TestInferDtype(object):

+    def test_infer_dtype_from_scalar_downcast_basic(self):


ideally these would be parameterized tests (on the dtype)

jreback · 2017-04-08T20:50:03Z

pandas/types/cast.py

@@ -322,8 +325,52 @@ def infer_dtype_from_scalar(val, pandas_dtype=False):
        whether to infer dtype including pandas extension types.
        If False, scalar belongs to pandas extension types is inferred as
        object
+    downcast : bool, default False


I would use this same signature for downcast here. We worked thru this for a while

Signature: pd.to_numeric(arg, errors='raise', downcast=None) Docstring: Convert argument to a numeric type. Parameters ---------- arg : list, tuple, 1-d array, or Series errors : {'ignore', 'raise', 'coerce'}, default 'raise' - If 'raise', then invalid parsing will raise an exception - If 'coerce', then invalid parsing will be set as NaN - If 'ignore', then invalid parsing will return the input downcast : {'integer', 'signed', 'unsigned', 'float'} , default None If not None, and if the data has been successfully cast to a numerical dtype (or if the data was numeric to begin with), downcast that resulting data to the smallest numerical dtype possible according to the following rules: - 'integer' or 'signed': smallest signed int dtype (min.: np.int8) - 'unsigned': smallest unsigned int dtype (min.: np.uint8) - 'float': smallest float dtype (min.: np.float32) As this behaviour is separate from the core conversion to numeric values, any errors raised during the downcasting will be surfaced regardless of the value of the 'errors' input. In addition, downcasting will only occur if the size of the resulting data's dtype is strictly larger than the dtype it is to be cast to, so if none of the dtypes checked satisfy that specification, no downcasting will be performed on the data.

jreback · 2017-04-08T20:50:47Z

pandas/types/cast.py

    """

+    def _downcast_dtype(dtype, val):
+        if 'float' in str(dtype):
+            if ((val > np.finfo(np.float16).min and


just decide between float32/64. We don't really support float16 anywhere, so never want to return this

Yeah I was going to do this, but then you get weird stuff where you can feed a float16 to the code without downsize and get back float16. Then if you feed a float16 with downsize=True, you get back float32 which is bigger. Are we cool with that?

then simply detect float16 and force to float32

jreback · 2017-04-08T20:55:02Z

pandas/types/cast.py

+                    val < np.finfo(np.float16).max) or val is np.nan):
+                return np.float16
+            elif (val > np.finfo(np.float32).min and
+                    val < np.finfo(np.float32).max):


actually you could prob take this exact code from to_numeric and put it in a function and call it here (wrap the scalar), and replace in to_numeric.

if downcast is not None and is_numeric_dtype(values): typecodes = None if downcast in ('integer', 'signed'): typecodes = np.typecodes['Integer'] elif downcast == 'unsigned' and np.min(values) >= 0: typecodes = np.typecodes['UnsignedInteger'] elif downcast == 'float': typecodes = np.typecodes['Float'] # pandas support goes only to np.float32, # as float dtypes smaller than that are # extremely rare and not well supported float_32_char = np.dtype(np.float32).char float_32_ind = typecodes.index(float_32_char) typecodes = typecodes[float_32_ind:] if typecodes is not None: # from smallest to largest for dtype in typecodes: if np.dtype(dtype).itemsize <= values.dtype.itemsize: values = maybe_downcast_to_dtype(values, dtype) # successful conversion if values.dtype == dtype: break

jreback · 2017-04-08T20:57:11Z

pandas/types/cast.py

+            elif (val > np.finfo(np.float32).min and
+                    val < np.finfo(np.float32).max):
+                return np.float32
+            else:


this would replace your _downcast_dtype; though make it another function in pandas.types.cast

call it maybe_downcast_itemsize (not to confuse it with the maybe_downcast_to_dtype, which is similar, but doesn't reduce the itemsize)

beckermr · 2017-04-10T21:30:42Z

@jreback Sorry for being MIA. I have been sick. :(

I am going to try and make a final pass at this tomorrow, but then I will be out of contact for a week or so. I don't want to delay any releases, so feel free to push this to v0.21.0 if you need to!

…spots

beckermr · 2017-04-11T18:00:44Z

@jreback I have taken another pass at this. I moved some of the downsizing logic outside of the function to infer types since it didn't really work there. Basically, the downsizing functions are built to be used with array-like types, and the infer functions are for scalars. In fact, the downsizing logic in to_numeric actually ignores scalars all together. :(

I don't have any more time to spend on this PR, besides some very small edits, so either we merge it or I need to move on. Feel free of course to pick it up yourself and get it merged.

I think it is in a good spot without expanding to basically refactor all of the type checking/changing functionality.

beckermr · 2017-04-11T19:12:08Z

Ack. Generalizing this functionality is now introducing test failures in other places. I really think trying to DRY this up is not needed in this case.

jreback

@beckermr this looks pretty good actually. I am not sure I will be able to get to this for 0.20.0 but will at some point.

jreback · 2017-04-11T21:49:54Z

doc/source/whatsnew/v0.20.0.txt

@@ -372,6 +372,8 @@ Other Enhancements
 - :func:`MultiIndex.remove_unused_levels` has been added to facilitate :ref:`removing unused levels <advanced.shown_levels>`. (:issue:`15694`)
 - ``pd.read_csv()`` will now raise a ``ParserError`` error whenever any parsing error occurs (:issue:`15913`, :issue:`15925`)
 - ``pd.read_csv()`` now supports the ``error_bad_lines`` and ``warn_bad_lines`` arguments for the Python parser (:issue:`15925`)
+- ``pd.get_dummies()`` now accepts the ``fill_value`` keyword which specifies how to fill NaN values in the dummy variables. (:issue:`15923`)
+- ``pd.types.cast`` has a new function ``maybe_downcast_itemsize`` which can be used to reduce the width of numeric types. (:issue:`15923`)


2nd entry is not necessary, this is internal.

jreback · 2017-04-11T21:50:02Z

doc/source/whatsnew/v0.20.0.txt

@@ -382,6 +384,19 @@ Other Enhancements
 Backwards incompatible API changes
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



needs a ref

jreback · 2017-04-11T21:50:36Z

doc/source/whatsnew/v0.20.0.txt

+Deprecate Automatic Zero Filling of Missing Values in ``pd.get_dummies``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The :func:`get_dummies` function currently fills NaN values with zero by default. This behavior is in conflict with the rest of the pandas API since NaN values should be filled with ``fillna`` or a ``fill_value`` keyword, and NaN values should be propagated through pandas transformations. In the future, :func:`get_dummies` will propagate NaN values by default. (:issue:`15923`)


double back-ticks around NaN. Please use line-breaks.

jreback · 2017-04-11T21:51:22Z

pandas/core/reshape.py

+    fill_value : scalar, default None
+        Value to fill NaNs with. If no missing values are found or NaN is not
+        used to fill them, the returned data type will be the smallest
+        width type that can represent the returned values. See


don't list internal functions.

jreback · 2017-04-11T21:52:45Z

pandas/core/reshape.py

+    isnotfinite = []
+    for v in vals:
+        try:
+            isnotfinite.append(~np.isfinite(v))


this is pretty complicated logic and needs to be refactored

jreback · 2017-04-11T21:53:33Z

pandas/core/reshape.py

        return get_empty_Frame(data, sparse)

+    # Record NaN values before we munge the codes, GH15826
+    nan_codes_msk = codes == -1


nan_codes_msk -> mask

or can call this check

though in-line this is pretty reasonable I don't see a reason to create a variable here
its less readable actually.

For future notes, you have to cache the nan-values since anything with code == -1 will get overwritten if you pass dummy_na=True.

jreback · 2017-04-11T21:55:53Z

pandas/core/reshape.py

+
+    # Infer the proper output dtype.
+    # GH15926
+    vals = data.values.ravel() if hasattr(data, 'values') else data


using .values is pretty much a no-no as if mixed dtypes this is completely non-performant. The issue is that we should not have multiple dtypes. this needs more work on this type of checking.

The issue is that we should not have multiple dtypes.

What do you mean here? data could definitely have multiple dtypes, right?

jreback · 2017-04-11T21:56:08Z

pandas/core/reshape.py

+        # GH15926
+        sp_indices = [None] * len(dummy_cols)
+        sp_fill = [None] * len(dummy_cols)
+        for code in np.unique(codes[codes != -1]):


use pd.unique

jreback · 2017-04-11T21:56:47Z

pandas/core/reshape.py

+        for code in np.unique(codes[codes != -1]):
+            # Non-zero value in sparse array if value is of the level
+            # or the value is NaN and it is filled non-zero and
+            # and it is not the dummy column for NaNs.


this still feels pretty inefficient, but maybe not an issue in practice.

jorisvandenbossche · 2017-04-12T16:52:05Z

pandas/core/reshape.py

+         a    b  NaN
+    0    1    0    0
+    1    0    1    0
+    2  NaN  NaN    1


Is this the output we want? In case of a dummy_na column, I would expect the row to contain 0/1's, and not NaN/1 ?

Yes this is the output. If you want zeros, supply fill_value=0. The point here is that a NaN means we do not know what label it is. Putting a zero in the missing spots would indicate that the example is NOT one of those labels. We of course do not know what label the example is since the label is missing.

There is also the added benefit of making the handling of missing values and filling them consistent across pandas so it presents a consistent API to users.

The point here is that a NaN means we do not know what label it is. Putting a zero in the missing spots would indicate that the example is NOT one of those labels.

Yes, but you ask for a dummy column, which means: treat NaNs as a separate value, and thus make a separate column for it.
In a dummies output, you can only have one 1 in one of the columns, so this means the other have to be 0.

If you want zeros, supply fill_value=0

Yes, you can do that, but I think fill_value=NaN is a poor default for the case of dummy_na=True.

Yes, you can do that, but I think fill_value=NaN is a poor default for the case of dummy_na=True.

Agreed. I'm struggling to think of a reason you would ever want fill_value=np.nan, dummy_na=True... But I don't think we should raise an exception in this case, if the user specifies those arguments.

Just because you can't think of a reason doesn't mean it doesn't exist. People could fill in a prior for missing values and still want to mark examples which are missing.

The point here is to be principled and consistent in terms of the pandas API while also not making assumptions about other people's data.

The consistency here is not a good argument in my opinion. With the default (which we are changing here), the NaNs will be preserved, consistent with how this is typical dealt with in pandas.
But by supplying dummy_na=True, the user explicitly wants a non-typical behaviour, i.e. creating a column for NaN values. In that case the consistency argument is a bit moot in my opinion, but we should do what is sensible for this specific use-case, which is IMO returning only 0/1's.

When using the dummy_na column, this means that you are treating NaN no longer as a 'missing value' (in the sense of 'I don't know which one of the other values it is'), but as one of the valid values. The consequence is IMO that the other columns should be 0.

People could fill in a prior for missing values and still want to mark examples which are missing.

I don't understand this one.

But I don't think we should raise an exception in this case, if the user specifies those arguments.

Agreed. We could document that the fill_value argument is ignored when using dummy_na=True?

I explained an example of this in an earlier thread. Suppose you know your are missing labels at random and you have a statistically representative data set. Then filling in missing labels with the fraction of each label from the data you have is a good assumption. You might in this case still want to mark which examples had missing labels too.

The point I have made many times is that the fewer assumptions pandas makes about the intentions of users and their data the better. The pandas user base is broad and covers many academic displines from the social sciences to astronomy plus a broad range of tech and industry domains. Assumptions built into pandas from one domain may not make sense for the others. The goal should be to provide general configurable tooling to work with data based on a consistent API with idiomatic syntax. Not filling missing values with zeros by default and having people supply a fill_value keyword meets these criterion

jreback · 2017-08-01T22:54:15Z

@beckermr this was a nice idea. closing, but if you want to rebase, pls comment to re-open and we can revisit.

TomAugspurger reviewed Apr 6, 2017

View reviewed changes

jreback reviewed Apr 6, 2017

View reviewed changes

beckermr commented Apr 6, 2017

View reviewed changes

jreback changed the title ~~ENH add dummy_na='keep feature to pd.get_dummies~~ ENH add dummy_na='keep' feature to pd.get_dummies Apr 7, 2017

jreback added Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Apr 7, 2017

jreback requested changes Apr 7, 2017

View reviewed changes

ENH added type downcasting, still needs to be refactored

766e094

beckermr commented Apr 8, 2017

View reviewed changes

beckermr added 2 commits April 8, 2017 14:54

ENH reworked new get_dummies

f5490f8

STY removed an extra blank line

3e932a8

jreback reviewed Apr 8, 2017

View reviewed changes

beckermr added 2 commits April 11, 2017 12:50

ENH added general routine for downcasting types, refactored in other …

1ac101f

…spots

DOC added docs for new function

3111ed7

TST updated tests for new routine

f7ee8f5

jreback requested changes Apr 11, 2017

View reviewed changes

jreback self-assigned this Apr 11, 2017

jorisvandenbossche reviewed Apr 12, 2017

View reviewed changes

jreback closed this Aug 1, 2017

		@@ -86,6 +86,98 @@ def test_datetime_with_timezone(self):

		class TestInferDtype(object):

		def test_infer_dtype_from_scalar_downcast_basic(self):

		@@ -382,6 +384,19 @@ Other Enhancements
		Backwards incompatible API changes
		~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Uh oh!

ENH add fill_value feature to pd.get_dummies #15926

ENH add fill_value feature to pd.get_dummies #15926

Uh oh!

Conversation

beckermr commented Apr 6, 2017

Uh oh!

codecov bot commented Apr 6, 2017

Codecov Report

Uh oh!

codecov bot commented Apr 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beckermr commented Apr 6, 2017

Uh oh!

beckermr commented Apr 6, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beckermr Apr 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beckermr Apr 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

codecov bot commented Apr 6, 2017 •

edited

Loading

beckermr Apr 7, 2017 •

edited

Loading

beckermr Apr 7, 2017 •

edited

Loading

beckermr commented Apr 7, 2017 •

edited

Loading

TomAugspurger commented Apr 7, 2017 •

edited

Loading

beckermr commented Apr 7, 2017 •

edited

Loading