-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Disallow non-increasing multi-index header arguments #47314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -61,6 +61,20 @@ def test_negative_multi_index_header(all_parsers, header): | |
parser.read_csv(StringIO(data), header=header) | ||
|
||
|
||
@pytest.mark.parametrize("header", [([0, 0]), ([1, 0])]) | ||
def test_nonincreasing_multi_index_header(all_parsers, header): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I might be missing something, but the
Is there a reason why we don't try to make it work instead of raising an error? I get the code complexity argument, but I haven't looked if this would really add complexity. Also we should probably deprecate first. I would consider this a bug on our side right now, raising is an unexpected change in behavior. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @phofl Thanks for taking a look at my PR. I suspect that the complexity of handling decreasing header arguments is beyond my current pandas skills. The current code handles all the permutations of named/unnamed Index/MultiIndex on both axes in both the Python and C parsers, and I'd probably break it while trying to consume header rows out-of-sequence. Since the user can .swaplevel today, and since the current code fails silently on decreasing header arguments, I think raising a ValueError is an improvement. A more clever person may still implement decreasing header arguments in the future, being careful to replace On the other hand, since header=[n, n] does work maybe we should keep it. I can't imagine a use case for it, but that could be a lack of creativity on my part. Would you prefer we keep it so we don't have to log a deprecation and follow up later? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since passing the same object twice works right now (also gives the correct result), we would have to deprecate before removing. Can't think of an example either, but this does not mean that it is not out there. Decreasing headers work already for the python engine, so we would only have to fix the c engine. So I am -1 on raising Edit: Sorry, they don't work, but it is a trivial fix. Replacing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The c engine fix is equally simple, so I would propose just fixing it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Excellent! I'll close this PR and attempt the fix. Thanks for the investigation/encouragement. |
||
# see gh-47011 | ||
parser = all_parsers | ||
data = """1,2,3,4,5 | ||
6,7,8,9,10 | ||
11,12,13,14,15 | ||
""" | ||
with pytest.raises( | ||
ValueError, match="multi-index header elements must be increasing" | ||
): | ||
parser.read_csv(StringIO(data), header=header) | ||
|
||
|
||
@pytest.mark.parametrize("header", [True, False]) | ||
def test_bool_header_arg(all_parsers, header): | ||
# see gh-6114 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe
any(h1 >= h2 for h1, h2 in zip(header, header[:1])
so this check can short circuitThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The short circuit is a good idea. If header is already sorted, then
list(header) != sorted(set(header))
is either somewhat faster or comparable in speed toany(h1 >= h2 for h1, h2 in zip(header, header[:1])
, but if header is not sorted than the short circuit makes a big difference. (I got curious and tested up to len(header) = 10**6.)