Skip to content

BUG: Disallow non-increasing multi-index header arguments #47314

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.5.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -850,6 +850,7 @@ I/O
- Bug in :func:`read_sas` returned ``None`` rather than an empty DataFrame for SAS7BDAT files with zero rows (:issue:`18198`)
- Bug in :class:`StataWriter` where value labels were always written with default encoding (:issue:`46750`)
- Bug in :class:`StataWriterUTF8` where some valid characters were removed from variable names (:issue:`47276`)
- Bug affecting :func:`read_csv`, :func:`read_excel`, :func:`read_html`, and :func:`read_table`, which accepted decreasing multi-index header arguments even though they don't work (:issue:`47011`)

Period
^^^^^^
Expand Down
2 changes: 2 additions & 0 deletions pandas/io/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -198,6 +198,8 @@ def validate_header_arg(header: object) -> None:
raise ValueError("header must be integer or list of integers")
if any(i < 0 for i in header):
raise ValueError("cannot specify multi-index header with negative integers")
if list(header) != sorted(set(header)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe any(h1 >= h2 for h1, h2 in zip(header, header[:1]) so this check can short circuit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The short circuit is a good idea. If header is already sorted, then list(header) != sorted(set(header)) is either somewhat faster or comparable in speed to any(h1 >= h2 for h1, h2 in zip(header, header[:1]), but if header is not sorted than the short circuit makes a big difference. (I got curious and tested up to len(header) = 10**6.)

raise ValueError("multi-index header elements must be increasing")
return
if is_bool(header):
raise TypeError(
Expand Down
14 changes: 14 additions & 0 deletions pandas/tests/io/parser/test_header.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,20 @@ def test_negative_multi_index_header(all_parsers, header):
parser.read_csv(StringIO(data), header=header)


@pytest.mark.parametrize("header", [([0, 0]), ([1, 0])])
def test_nonincreasing_multi_index_header(all_parsers, header):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something, but the [0, 0] case seems to work

    1   2   3   4   5
    1   2   3   4   5
0   6   7   8   9  10
1  11  12  13  14  15

Is there a reason why we don't try to make it work instead of raising an error? I get the code complexity argument, but I haven't looked if this would really add complexity.

Also we should probably deprecate first. I would consider this a bug on our side right now, raising is an unexpected change in behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phofl Thanks for taking a look at my PR.

I suspect that the complexity of handling decreasing header arguments is beyond my current pandas skills. The current code handles all the permutations of named/unnamed Index/MultiIndex on both axes in both the Python and C parsers, and I'd probably break it while trying to consume header rows out-of-sequence. Since the user can .swaplevel today, and since the current code fails silently on decreasing header arguments, I think raising a ValueError is an improvement. A more clever person may still implement decreasing header arguments in the future, being careful to replace header[-1] with max(header) in the current code.

On the other hand, since header=[n, n] does work maybe we should keep it. I can't imagine a use case for it, but that could be a lack of creativity on my part. Would you prefer we keep it so we don't have to log a deprecation and follow up later?

Copy link
Member

@phofl phofl Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since passing the same object twice works right now (also gives the correct result), we would have to deprecate before removing. Can't think of an example either, but this does not mean that it is not out there.

Decreasing headers work already for the python engine, so we would only have to fix the c engine. So I am -1 on raising

Edit: Sorry, they don't work, but it is a trivial fix. Replacing [header[-1] + 1] with [max(header) + 1] fixes it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The c engine fix is equally simple, so I would propose just fixing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! I'll close this PR and attempt the fix. Thanks for the investigation/encouragement.

# see gh-47011
parser = all_parsers
data = """1,2,3,4,5
6,7,8,9,10
11,12,13,14,15
"""
with pytest.raises(
ValueError, match="multi-index header elements must be increasing"
):
parser.read_csv(StringIO(data), header=header)


@pytest.mark.parametrize("header", [True, False])
def test_bool_header_arg(all_parsers, header):
# see gh-6114
Expand Down