Default value for missing values on merge #20007

dniku · 2018-03-05T23:27:54Z

This is a reopening of #1836. The suggestion there was to add a parameter to pd.merge, such as fillvalue, whose value would be used instead of NaN for missing values. This isn't simply solved by fillna since adding NaN to columns casts them to float.

#1836 also asked to provide an example where this would be useful. Admittedly, in my case there might be a simpler solution than merge, but anyway.

I have a DataFrame with a single column which is basically an index: it contains distinct numbers. I also have a DataFrame where one column contains some (but not all) values from the same index, while others contain useful data. I want to extend this DataFrame to include all values from the index, filling the other columns with zeros. I do this by calling

pd.merge(df_with_index, smaller_df_with_data, on='col_index', how='outer').fillna(0)

and end up with a DataFrame where all columns except for col_index are cast to float.

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.14.23-1-MANJARO
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: None
setuptools: 38.5.1
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2018-03-07T14:57:01Z

you can add .fillna(0, downcast='infer')if you want.

This is just adding more complexity to an already very complex pd.merge. we already have .fillna for this purpose.

GSanchis · 2018-11-23T12:24:28Z

So I just stumbled into this thread looking for the solution the OP is asking.

I have a different case, in which I have a DataFrame with NaNs, which should remain NaNs, which I need to merge with another DataFrame which has only partial information for the first DataFrame:

df1 = pd.read_csv("all.csv")
df2 = pd.read_csv("activity.csv")
df3 = df1.merge(df2, left_on='dealId', right_on='dealId')
df3.fillna(0)

In this example, all.csv contains missing information in some columns, some rows, and I don't want to make up some fake value for them (and later on I will most likely want to filter out those rows, only in some cases, and depending on whether I am selecting that column with NaNs or not). activity.csv contains activity data for only some of the dealIds in all.csv, and I do want the dealIds that are not in activity.csv to be filled up with 0 (no activity). However, if I use the code above I will end up with a DataFrame where the NaNs in the original all.csv DataFrame are replaced by 0, which is not what I intend to do.

One possible workaround would be something like this:

some_very_odd_value=123123123123123123
df1 = pd.read_csv("all.csv").fillna(some_very_odd_value)
df2 = pd.read_csv("activity.csv")
df3 = df1.merge(df2, left_on='dealId', right_on='dealId')
df3.replace(some_very_odd_value, np.nan)

But this feels like something I should not be doing.

th0ger · 2020-04-27T10:23:01Z

I have a similar issue: Working only with string data, the missing values still becomes NaN, which makes no sense in a string column. My target is to have None values.

The proposed fix .fillna(value=None, downcast='defer') does not work, because fillna thinks the value parameter is missing.
"ValueError: Must specify a fill 'value' or 'method'."

goerlitz · 2020-06-11T20:25:39Z

Same issue here. Like @GSanchis I want to left-join (merge) two DataFrames where, in my case, the second one has a single column of ints or Strings where the missing values need to filled with 0 or empty String, respectively.
This is very common use case in many data science/data mining task. But that automatic conversion to float type for capturing nan with no means to fill the nan values without much effort always breaks stuff.
With all due respect for the greatness of pandas I really don't understand why such basic requirements for data manipulation won't be fixed?

ecureuil · 2020-07-25T13:31:39Z

Hello this feature is needed since on very large and complex dataset we focus on memory size and pandas changes column types to float32 (even if you have int8) :(

TomAugspurger · 2020-09-09T18:07:43Z

A similar issue affects things implemented based on align (#31874)

allComputableThings · 2020-10-27T00:01:17Z

+1 Fillna is no solution.
If yours merge produces NULLs for an int64 column, casting to floats is a disaster since the integer values are lost when casting to floats.

leonarduschen · 2020-10-31T12:41:58Z

+1 we need this feature

Perhaps something similar to fill_value in reindex ?

buhtz · 2021-09-07T09:15:36Z

fillna() is definitely not a solution but a workaround.

You want an int column and have an int column.
But the merge() converts that int column to float (because of missing values).
After that you set the missing values to something you want (e.g. 0).
And that you convert back(!) the whole column to int.

Does anyone disagree that this is only a workaround but not a solution?

But I am not sure if we need an fillna argument in merge() because we also could discussed about why merge() does the conversion! pandas.NA is not always float anymore!

You can have NA int integer columns.

# pandas 1.2.5
>>> df = pd.DataFrame({'n': [1, 3, pd.NA]}, dtype='Int64')
>>> df.n
0       1
1       3
2    <NA>
Name: n, dtype: Int64

Question: Why does not merge() handle the NA type by itself?

phofl · 2021-09-07T09:28:06Z

It does handle the NA type. NA ist the missing value indicator for extension dtypes, e.g. if you are merging

left = pd.DataFrame({"a": [1, 2, 3]}, dtype="Int64")
right = pd.DataFrame({"a": [1, 2, 4], "b": 1}, dtype="Int64")

pd.merge(left, right, how="outer")

you keep Int64 dtype. But if you change the dtypes to int64 np.nan is used, hence the conversion to float.

buhtz · 2021-09-07T09:55:45Z

Thanks. Can understand now.

But maybe the docu can improved here? I am not sure about it.

Despite explaining the technical details and backgrounds just add a FAQ like section to merge() docu Why merge convert my integer column to float and how to solve this.

For newbies it is quite hard to understand the difference between Int64 and int64.

phofl · 2021-09-07T14:16:20Z

We have sections in the user guide explaining the nullable dtypes. Since this is relevant for all functions which may produce missing values, I don't think adding an explanation to merge specifically is desireable

allComputableThings · 2021-09-08T03:12:06Z

Perhaps a reference to the nullable types doc? And if the answer is that this issue is somehow solved in the nullable types doc, then please update the docs need to read: "Yes, we know that lots of functions produce nullable data. We've made a policy decision not to include fill_value for very many cases. If your data consists of categorical values, then perhaps Pandas is not for you." ... if it is indeed the case.

MarcoGorelli · 2021-09-08T08:15:29Z

If you think the docs can be improved, then pull requests are welcome

If your data consists of categorical values, then perhaps Pandas is not for you

Why? What's the issue with categorical values?

allComputableThings · 2021-09-08T15:13:33Z

Auto-casting ints to floats destroys the information categorical values. And 500000000005->500000000005.011111 is no longer primary key in your database. It's a classic programming footgun. When nulls are introduced into an integer field, a value that represents missing (or other information representing it's missingness) is needed.

…

On Wed, Sep 8, 2021 at 1:15 AM Marco Edward Gorelli < ***@***.***> wrote: If you think the docs can be improved, then pull requests are welcome If your data consists of categorical values, then perhaps Pandas is not for you Why? What's the issue with categorical values? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#20007 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB3QJLEZLNPE4YEW526D5G3UA4LS3ANCNFSM4ETXMJWQ> .

MarcoGorelli · 2021-09-08T15:35:03Z

Please show a minimal reproducible example - this isn't what I'm seeing (though I may be misunderstanding):

In [3]: left = pd.DataFrame({"a": [1, 2, 3]}, dtype="Int64")
   ...: right = pd.DataFrame({"a": [1, 2, 4], "b": 1}, dtype="Int64")
   ...: right['b'] = right['b'].astype('category')
   ...: 
   ...: pd.merge(left, right, how="outer")
Out[3]: 
   a    b
0  1    1
1  2    1
2  3  NaN
3  4    1

jreback closed this as completed Mar 7, 2018

jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Dtype Conversions Unexpected or buggy dtype conversions labels Mar 7, 2018

jreback added this to the won't fix milestone Mar 7, 2018

TomAugspurger modified the milestones: won't fix, No action Jul 6, 2018

buhtz mentioned this issue Sep 7, 2021

ENH: Reopen "Default value for missing values on merge" #43438

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Default value for missing values on merge #20007

Default value for missing values on merge #20007

dniku commented Mar 5, 2018

jreback commented Mar 7, 2018

Uh oh!

GSanchis commented Nov 23, 2018 •

edited

Loading

Uh oh!

th0ger commented Apr 27, 2020

Uh oh!

goerlitz commented Jun 11, 2020

Uh oh!

ecureuil commented Jul 25, 2020

Uh oh!

TomAugspurger commented Sep 9, 2020 •

edited

Loading

Uh oh!

allComputableThings commented Oct 27, 2020

Uh oh!

leonarduschen commented Oct 31, 2020

Uh oh!

buhtz commented Sep 7, 2021 •

edited

Loading

Uh oh!

phofl commented Sep 7, 2021 •

edited

Loading

Uh oh!

buhtz commented Sep 7, 2021

Uh oh!

phofl commented Sep 7, 2021

Uh oh!

allComputableThings commented Sep 8, 2021 via email •

edited

Loading

Uh oh!

MarcoGorelli commented Sep 8, 2021

Uh oh!

allComputableThings commented Sep 8, 2021 via email

Uh oh!

MarcoGorelli commented Sep 8, 2021

Uh oh!

Uh oh!

Default value for missing values on merge #20007

Default value for missing values on merge #20007

Comments

dniku commented Mar 5, 2018

Output of pd.show_versions()

jreback commented Mar 7, 2018

Uh oh!

GSanchis commented Nov 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

th0ger commented Apr 27, 2020

Uh oh!

goerlitz commented Jun 11, 2020

Uh oh!

ecureuil commented Jul 25, 2020

Uh oh!

TomAugspurger commented Sep 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

allComputableThings commented Oct 27, 2020

Uh oh!

leonarduschen commented Oct 31, 2020

Uh oh!

buhtz commented Sep 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phofl commented Sep 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

buhtz commented Sep 7, 2021

Uh oh!

phofl commented Sep 7, 2021

Uh oh!

allComputableThings commented Sep 8, 2021 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MarcoGorelli commented Sep 8, 2021

Uh oh!

allComputableThings commented Sep 8, 2021 via email

Uh oh!

MarcoGorelli commented Sep 8, 2021

Uh oh!

Output of `pd.show_versions()`

GSanchis commented Nov 23, 2018 •

edited

Loading

TomAugspurger commented Sep 9, 2020 •

edited

Loading

buhtz commented Sep 7, 2021 •

edited

Loading

phofl commented Sep 7, 2021 •

edited

Loading

allComputableThings commented Sep 8, 2021 via email •

edited

Loading