ENH: Reduce DF/Series to smallest possible dtype #14158

hnykda · 2016-09-05T17:28:46Z

Hi,

I have been currently working with dataset mostly consisting of binary features (column has only 1, 0 or NaN). I know there is a category dtype, but that is not usable in some cases, e.g. when you want to store it HDF and can't use tables format (e.g. because of to wide tables). I have datasets with about 50k rows and 7k columns and it is wasting a lot of memory storing 0/1 as float64 (because int64 can't handle NaNs - but it is still terribly big). In memory size is about 15 GB, while reduced is about 4GB and HDF files are of course much smaller.

So I use this function to reduce a Series to smallest possible dtype, effectively reducing the size of the dataset (up to 8 times from 64b -> 8b):

import numpy as np
def safely_reduce_dtype(ser):  # pandas.Series or numpy.array  
    orig_dtype = "".join([x for x in ser.dtype.name if x.isalpha()]) # float/int
    mx = 1
    for val in ser.values:
        new_itemsize = np.min_scalar_type(val).itemsize
        if mx < new_itemsize:
            mx = new_itemsize
    new_dtype = orig_dtype + str(mx * 8)
    return new_dtype # or converts the pandas.Series by ser.astype(new_dtype)

it's far from perfect and probably some edge cases may occur. It could be definitely enhanced somehow, take this as a first proposal.

I think it could be added as a utility function, something like pd.to_numeric or pd.to_datetime.

What do you think?

Example

>>> import pandas
>>> serie = pd.Series([1,0,1,0], dtype='int32')
>>> safely_reduce_dtype(serie)
dtype('int8')

>>> float_serie = pd.Series([1,0,1,0])
>>> safely_reduce_dtype(float_serie)
dtype('float8')  # from float64

or when returning ser.astype(new_type):

>>> import numpy as np
>>> import pandas as pd

>>> rands = np.random.randint(1,100, 10000)
>>> ser_orig = pd.Series(rands)
>>> ser_reduced = safely_reduce_dtype(ser_orig)
>>> print(ser_orig.memory_usage(), ser_reduced.memory_usage())
80080 10080

The text was updated successfully, but these errors were encountered:

jreback · 2016-09-05T17:34:42Z

you mean like this: #13352 (this is in 0.19.0, rc coming soon)

In [7]: s = pd.Series([1,0,1,0])

In [8]: s
Out[8]: 
0    1
1    0
2    1
3    0
dtype: int64

In [9]: pd.to_numeric?

In [10]: pd.to_numeric(s, downcast='float')
Out[10]: 
0    1.0
1    0.0
2    1.0
3    0.0
dtype: float32

In [11]: pd.to_numeric(s, downcast='integer')
Out[11]: 
0    1
1    0
2    1
3    0
dtype: int8

in general you won't be able to go below float32. float16 is *barelysupported, andfloat8`` is nonsensical.

hnykda · 2016-09-05T17:35:15Z

Ahh. Sorry, didn't know that.

jorisvandenbossche · 2016-09-06T14:24:32Z

@hnykda No reason you should already have known it, since it's not yet released :-)

jreback closed this as completed Sep 5, 2016

jreback added Enhancement Dtype Conversions Unexpected or buggy dtype conversions Duplicate Report Duplicate issue or pull request labels Sep 5, 2016

jreback added this to the No action milestone Sep 5, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Reduce DF/Series to smallest possible dtype #14158

ENH: Reduce DF/Series to smallest possible dtype #14158

hnykda commented Sep 5, 2016 •

edited

Loading

jreback commented Sep 5, 2016

Uh oh!

hnykda commented Sep 5, 2016

Uh oh!

jorisvandenbossche commented Sep 6, 2016

Uh oh!

Uh oh!

ENH: Reduce DF/Series to smallest possible dtype #14158

ENH: Reduce DF/Series to smallest possible dtype #14158

Comments

hnykda commented Sep 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example

jreback commented Sep 5, 2016

Uh oh!

hnykda commented Sep 5, 2016

Uh oh!

jorisvandenbossche commented Sep 6, 2016

Uh oh!

hnykda commented Sep 5, 2016 •

edited

Loading