ENH: Reduce DF/Series to smallest possible dtype #14158
Labels
Dtype Conversions
Unexpected or buggy dtype conversions
Duplicate Report
Duplicate issue or pull request
Enhancement
Uh oh!
There was an error while loading. Please reload this page.
Hi,
I have been currently working with dataset mostly consisting of binary features (column has only 1, 0 or NaN). I know there is a
category
dtype, but that is not usable in some cases, e.g. when you want to store it HDF and can't usetables
format (e.g. because of to wide tables). I have datasets with about 50k rows and 7k columns and it is wasting a lot of memory storing 0/1 asfloat64
(becauseint64
can't handle NaNs - but it is still terribly big). In memory size is about 15 GB, while reduced is about 4GB and HDF files are of course much smaller.So I use this function to reduce a Series to smallest possible dtype, effectively reducing the size of the dataset (up to 8 times from 64b -> 8b):
it's far from perfect and probably some edge cases may occur. It could be definitely enhanced somehow, take this as a first proposal.
I think it could be added as a utility function, something like
pd.to_numeric
orpd.to_datetime
.What do you think?
Example
or when returning
ser.astype(new_type)
:The text was updated successfully, but these errors were encountered: