Skip to content

Fastpath for to_datetime when providing ISO format as keyword? #8154

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Sep 1, 2014 · 7 comments
Closed
Labels
API Design Datetime Datetime data dtype Performance Memory or execution speed performance

Comments

@jorisvandenbossche
Copy link
Member

Say you have to parse some nicely ISO formatted date strings, you can just parse this with todatetime very fast. But if you were 'overcautious' and provided the format="%Y-%m-%d %H:%M:%S" for safety, this seems to be around 20 times slower.
Would it be possible to provide a fastpath for certain provided format strings (as already exists for %Y%m%d I think).

In [129]: s = pd.Series(pd.date_range('2000-01-01', periods=1000, freq='H'))

In [130]: s_as_dt_strings = s.apply(lambda x: x.strftime("%Y-%m-%dT%H:%M:%S.%f"))

In [131]: %timeit pd.to_datetime(s_as_dt_strings)
1000 loops, best of 3: 406 µs per loop

In [132]: %timeit pd.to_datetime(s_as_dt_strings, format="%Y-%m-%dT%H:%M:%S.%f")
100 loops, best of 3: 9.73 ms per loop
In [133]: s_as_dt_strings = s.apply(lambda x: x.strftime("%Y-%m-%d %H:%M:%S"))

In [134]: %timeit pd.to_datetime(s_as_dt_strings)
1000 loops, best of 3: 361 µs per loop

In [135]: %timeit pd.to_datetime(s_as_dt_strings, format="%Y-%m-%d %H:%M:%S")
100 loops, best of 3: 8.36 ms per loop

For non-standard formats, providing format does give a big improvement:

In [136]: s_as_dt_strings = s.apply(lambda x: x.strftime("%Y/%m/%d %H:%M:%S"))

In [137]: %timeit pd.to_datetime(s_as_dt_strings)
10 loops, best of 3: 92.2 ms per loop

In [138]: %timeit pd.to_datetime(s_as_dt_strings, format="%Y/%m/%d %H:%M:%S")
100 loops, best of 3: 9.08 ms per loop
@jreback
Copy link
Contributor

jreback commented Sep 1, 2014

if u provide a format I think it bypasses the infer_datetime_format option which is much faster if it knows the format

so need to short circuit on certain formats

@jorisvandenbossche
Copy link
Member Author

But the infer_datetime_format is False by default no?

In [134]: %timeit pd.to_datetime(s_as_dt_strings)
1000 loops, best of 3: 382 µs per loop

In [135]: %timeit pd.to_datetime(s_as_dt_strings, infer_datetime_format=True)
1000 loops, best of 3: 963 µs per loop

In [136]: %timeit pd.to_datetime(s_as_dt_strings, format="%Y-%m-%dT%H:%M:%S.%f")
100 loops, best of 3: 9.36 ms per loop

But indeed, the fact that it is faster with infer_datetime_format (which guesses the same format as I provide with format=, I would think) seems to indicate that with providing format= it could also be faster by using some same short circuiting

@jreback
Copy link
Contributor

jreback commented Sep 1, 2014

exactly

I think needs a simple check if format is provided if we can fastpath it (eg for recognized formats), if so then just follow the infer_datetime_format=True path and ignore the format); I think this is done for YYYYMMDD (only) atm

@jreback
Copy link
Contributor

jreback commented Sep 1, 2014

or in the basic format whee we don't need to infer at all just pass it thru (eg your first timeit with the 3rd format)

@jorisvandenbossche
Copy link
Member Author

With infer_datetime_format there is indeed a bypass for iso formatted strings: https://github.com/pydata/pandas/blob/master/pandas/tseries/tools.py#L251, so this should maybe also be done with provided formats

@jreback jreback added this to the 0.15.1 milestone Sep 4, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@chris-b1
Copy link
Contributor

chris-b1 commented Oct 6, 2015

@jorisvandenbossche, fyi this is also closed by PR #10615

@jreback
Copy link
Contributor

jreback commented Oct 6, 2015

thanks

@jreback jreback closed this as completed Oct 6, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Datetime Datetime data dtype Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

3 participants