Skip to content

ENH: Reduce DF/Series to smallest possible dtype #14158

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hnykda opened this issue Sep 5, 2016 · 3 comments
Closed

ENH: Reduce DF/Series to smallest possible dtype #14158

hnykda opened this issue Sep 5, 2016 · 3 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Duplicate Report Duplicate issue or pull request Enhancement

Comments

@hnykda
Copy link

hnykda commented Sep 5, 2016

Hi,

I have been currently working with dataset mostly consisting of binary features (column has only 1, 0 or NaN). I know there is a category dtype, but that is not usable in some cases, e.g. when you want to store it HDF and can't use tables format (e.g. because of to wide tables). I have datasets with about 50k rows and 7k columns and it is wasting a lot of memory storing 0/1 as float64 (because int64 can't handle NaNs - but it is still terribly big). In memory size is about 15 GB, while reduced is about 4GB and HDF files are of course much smaller.

So I use this function to reduce a Series to smallest possible dtype, effectively reducing the size of the dataset (up to 8 times from 64b -> 8b):

import numpy as np
def safely_reduce_dtype(ser):  # pandas.Series or numpy.array  
    orig_dtype = "".join([x for x in ser.dtype.name if x.isalpha()]) # float/int
    mx = 1
    for val in ser.values:
        new_itemsize = np.min_scalar_type(val).itemsize
        if mx < new_itemsize:
            mx = new_itemsize
    new_dtype = orig_dtype + str(mx * 8)
    return new_dtype # or converts the pandas.Series by ser.astype(new_dtype)

it's far from perfect and probably some edge cases may occur. It could be definitely enhanced somehow, take this as a first proposal.

I think it could be added as a utility function, something like pd.to_numeric or pd.to_datetime.

What do you think?

Example

>>> import pandas
>>> serie = pd.Series([1,0,1,0], dtype='int32')
>>> safely_reduce_dtype(serie)
dtype('int8')

>>> float_serie = pd.Series([1,0,1,0])
>>> safely_reduce_dtype(float_serie)
dtype('float8')  # from float64  

or when returning ser.astype(new_type):

>>> import numpy as np
>>> import pandas as pd

>>> rands = np.random.randint(1,100, 10000)
>>> ser_orig = pd.Series(rands)
>>> ser_reduced = safely_reduce_dtype(ser_orig)
>>> print(ser_orig.memory_usage(), ser_reduced.memory_usage())
80080 10080
@jreback
Copy link
Contributor

jreback commented Sep 5, 2016

you mean like this: #13352 (this is in 0.19.0, rc coming soon)

In [7]: s = pd.Series([1,0,1,0])

In [8]: s
Out[8]: 
0    1
1    0
2    1
3    0
dtype: int64

In [9]: pd.to_numeric?

In [10]: pd.to_numeric(s, downcast='float')
Out[10]: 
0    1.0
1    0.0
2    1.0
3    0.0
dtype: float32

In [11]: pd.to_numeric(s, downcast='integer')
Out[11]: 
0    1
1    0
2    1
3    0
dtype: int8

in general you won't be able to go below float32. float16 is *barelysupported, andfloat8`` is nonsensical.

@jreback jreback closed this as completed Sep 5, 2016
@hnykda
Copy link
Author

hnykda commented Sep 5, 2016

Ahh. Sorry, didn't know that.

@jreback jreback added Enhancement Dtype Conversions Unexpected or buggy dtype conversions Duplicate Report Duplicate issue or pull request labels Sep 5, 2016
@jreback jreback added this to the No action milestone Sep 5, 2016
@jorisvandenbossche
Copy link
Member

@hnykda No reason you should already have known it, since it's not yet released :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Duplicate Report Duplicate issue or pull request Enhancement
Projects
None yet
Development

No branches or pull requests

3 participants