-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Raise on parse int overflow #47167 #47168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
8d0efca
27629f0
bfb0b89
661c853
ccb6f61
a3b458a
994a634
567fd58
43bcb22
61a36a5
927ddad
d992af5
498b93d
9e1fbbc
ef91ab5
270eb90
dd5cd0e
64047b4
d812c32
b1f83b9
8f4c947
a935ac9
0c9f4e8
8353cba
60b3018
3e5f929
ba40923
3d72cf2
3f39a5b
7cf208f
520cae3
88d8650
d96d6b0
3508a9f
2c16f74
485dcfc
5896e01
b276196
a1a6764
c9e8a92
39b5c91
92fab59
f653e96
8b406ab
8d782fb
09e6773
a545602
0439322
b392f32
fca428c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
|
@@ -1189,19 +1189,32 @@ cdef class TextReader: | |||||||
return result, na_count | ||||||||
|
||||||||
elif is_integer_dtype(dtype): | ||||||||
try: | ||||||||
result, na_count = _try_int64(self.parser, i, start, | ||||||||
end, na_filter, na_hashset) | ||||||||
if user_dtype and na_count is not None: | ||||||||
if na_count > 0: | ||||||||
raise ValueError(f"Integer column has NA values in column {i}") | ||||||||
except OverflowError: | ||||||||
result = _try_uint64(self.parser, i, start, end, | ||||||||
na_filter, na_hashset) | ||||||||
if user_dtype and dtype == "uint64": | ||||||||
result = _try_uint64(self.parser, i, start, | ||||||||
end, na_filter, na_hashset) | ||||||||
na_count = 0 | ||||||||
else: | ||||||||
try: | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the pattern here is to _try_dtype then if that fails try another one, can you just do that here (rather than all of this if/then logic). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I tried to stay with the pattern There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hm I agree with @jreback, this is hard to read There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I also agree and added a |
||||||||
result, na_count = _try_int64(self.parser, i, start, | ||||||||
end, na_filter, na_hashset) | ||||||||
except OverflowError as err: | ||||||||
if user_dtype and dtype == "int64": | ||||||||
raise err | ||||||||
result = _try_uint64(self.parser, i, start, | ||||||||
end, na_filter, na_hashset) | ||||||||
na_count = 0 | ||||||||
else: | ||||||||
if user_dtype and (na_count is not None) and (na_count > 0): | ||||||||
raise ValueError(f"Integer column has NA values in column {i}") | ||||||||
|
||||||||
if result is not None and dtype != "int64": | ||||||||
result = result.astype(dtype) | ||||||||
if result is not None and dtype not in ("int64", "uint64"): | ||||||||
casted = result.astype(dtype) | ||||||||
if (casted == result).all(): | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks expensive. Can you run asvs? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point, thanks. I thought the same thing, when I saw exactly this check being applied for parsing (and more generally setting) with the nullable integer extension dtypes pandas/pandas/core/arrays/integer.py Lines 53 to 55 in 3bf2cb1
Just to clarify: The additional check is never performed when parsing with automatically inferred dtype (since TextReader.dtype_cast_order contains only int64 and no other integer dtypes), nor with user specified dtype int64. I just comitted another change that prevents unnecessarily running into this check when parsing with user specified dtype uint64. Just to be sure I ran some existing asvs that seemed most relevant and found no significant change:
I added a simple asv benchmark that triggers the relevant check. It is also not affected at all
A separate timeit on the comparison There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Some additional remarks: For running the asvs in io.csv I had to:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jbrockmendel should There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. dtype_can_hold_range is specific to range objects. looks like we have an ndarray here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, comparing ndarrays here. I could change it to np.array_equal for readability, but apart from some safety boilerplate the same check is performed there: https://github.com/numpy/numpy/blob/50a74fb65fc752e77a2f9e9e2b7227629c2ba953/numpy/core/numeric.py#L2468 |
||||||||
result = casted | ||||||||
else: | ||||||||
raise TypeError( | ||||||||
f"cannot safely cast non-equivalent {result.dtype} to {dtype}" | ||||||||
) | ||||||||
|
||||||||
return result, na_count | ||||||||
|
||||||||
|
Uh oh!
There was an error while loading. Please reload this page.