-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
pd.to_datetime much slower with supplied format than when format is inferred #10178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@dsimmie as I explained to you. This is as expected. It is doing regex-matching. You are welcome to have a look. |
See the code here: https://github.com/pydata/pandas/blob/master/pandas/tseries/tools.py#L277 There are fast-paths for ISO8601 strings and %Y%m%d. Others hit the regex engine. |
Maybe we should clarify |
no, |
you can certainly profile this if you'd like: https://github.com/pydata/pandas/blob/master/pandas/tslib.pyx#L2257 |
Hi Jeff. You said: "If you have repeated non-ISO dates it will help a lot. Since you have an ISO date it doesn't make much difference (as the parser is in c anyhow)". It does make a difference however and supplying the single unchanging format slows this operation significantly. The date format I put in, %Y-%m-%d, is a valid ISO8601 date if perhaps not a datetime. Seeing how '%Y%m%d' is given a fast-path perhaps that string could also get one. https://github.com/pydata/pandas/blob/master/pandas/tseries/tools.py#L297 if format == '%Y%m%d' or format == '%Y-%m-%d':
try:
result = _attempt_YYYYMMDD(arg, coerce=coerce)
except:
raise ValueError("cannot convert the input to '%Y%m%d' date format") A change like that would necessitate a change _attempt _YYYYMMDD that would involve using a split on the hyphen and I don't know if that is agreeable. I don't really mind that inferring is quicker than being told but it is certainly not obvious behaviour when you have put in an ISO8601 date string to start. |
@dsimmie I misspoke a bit - no caching with repeated dates (it could be done and I have seen it done, not sure of the utility; this is a cache not on the format but on the actual date values themselves), sort of a separate issue. the change you propose will be much slower slower (and rather, you could map certain ISO8601 like formats to the generic format (which is fast pathed), e.g. something like:
This is what |
OK thanks for the clarification and your time. It would be nice if that format string '%Y-%m-%d' was fast-pathed... agree my solution was naive, I haven't seen any of this code before and haven't read it in any detail yet. |
Update: this is now a bit repetition, but was already typing: I think the point here is that there is a fastpath for ISO8601 formatted strings. With So we could do this checking for fastpath after @dsimmie reopening this, as this is a valid improvement I think. |
agreed, this is a valid issue. (and the fix is pretty straightforward as I describe above) |
closed by #10615 |
It is much slower when converting a date string to supply a date format for a column than for it to be inferred. I would've though there should be less work to do when the format is known (and supplied)
To test
This plot is taken from this S/O post which shows the difference over a larger range of sizes (and compared to other methods).
INSTALLED VERSIONS
commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-52-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
pandas: 0.15.2
nose: 1.3.4
Cython: 0.22
numpy: 1.9.2
scipy: 0.14.0
statsmodels: 0.5.0
IPython: 2.2.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.2
pytz: 2014.10
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.0
openpyxl: 2.2.0-b1
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.7
lxml: 3.4.0
bs4: 4.3.2
html5lib: None
httplib2: 0.9
apiclient: 1.4.0
rpy2: None
sqlalchemy: 1.0.0
pymysql: None
psycopg2: None
The text was updated successfully, but these errors were encountered: