Skip to content

BUG: read_excel with Workbook and engine="openpyxl" raises ValueError #39528

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
ajkaijanaho opened this issue Feb 1, 2021 · 10 comments · Fixed by #39586
Closed
2 of 3 tasks

BUG: read_excel with Workbook and engine="openpyxl" raises ValueError #39528

ajkaijanaho opened this issue Feb 1, 2021 · 10 comments · Fixed by #39586
Assignees
Labels
IO Excel read_excel, to_excel Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@ajkaijanaho
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample

from openpyxl import load_workbook
from pandas import read_excel

wb = load_workbook("testfile.xlsx")
read_excel(wb, engine="openpyxl")

Problem description

In pandas 1.1.5, the above code completes with no problems.

In pandas 1.2.1, it causes the following exception:

Traceback (most recent call last):
  File "c:/Users/akaijanaho/scratch/pandas-openpyxl-bug/bug.py", line 5, in <module>
    read_excel(wb, engine="openpyxl")
  File "C:\Users\akaijanaho\scratch\pandas-openpyxl-bug\venv\lib\site-packages\pandas\util\_decorators.py", line 299, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\akaijanaho\scratch\pandas-openpyxl-bug\venv\lib\site-packages\pandas\io\excel\_base.py", line 336, in read_excel
    io = ExcelFile(io, storage_options=storage_options, engine=engine)
  File "C:\Users\akaijanaho\scratch\pandas-openpyxl-bug\venv\lib\site-packages\pandas\io\excel\_base.py", line 1057, in __init__
    ext = inspect_excel_format(
  File "C:\Users\akaijanaho\scratch\pandas-openpyxl-bug\venv\lib\site-packages\pandas\io\excel\_base.py", line 938, in inspect_excel_format
    with get_handle(
  File "C:\Users\akaijanaho\scratch\pandas-openpyxl-bug\venv\lib\site-packages\pandas\io\common.py", line 558, in get_handle
    ioargs = _get_filepath_or_buffer(
  File "C:\Users\akaijanaho\scratch\pandas-openpyxl-bug\venv\lib\site-packages\pandas\io\common.py", line 371, in _get_filepath_or_buffer
    raise ValueError(msg)
ValueError: Invalid file path or buffer object type: <class 'openpyxl.workbook.workbook.Workbook'>

The documentation does not specify Workbook as an acceptable value type for io, but supporting it seems reasonable and accords with the 1.1.5 behavior.

In my use case, we mainly parse an Excel file with openpyxl but use pandas with a specific sub-problem. We would like to reuse the same Workbook instead of having pandas re-read the file.

@ajkaijanaho ajkaijanaho added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 1, 2021
@twoertwein
Copy link
Member

twoertwein commented Feb 1, 2021

I will have time later today to look into this. Can you open/wrap the workbook with ExcelFile and then pass it to read_excel? I'm not too familiar with the ExcelFile syntax, something like the code below might avoid opening the file twice as a workaround for 1.2.1.

import pandas as pd

with pd.ExcelFile('test.xlsx', engine="openpyxl") as excel:
    dataframe = pd.read_excel(excel)

@lithomas1
Copy link
Member

Can reproduce on master.

>>> from openpyxl import load_workbook
>>> from pandas import read_excel
>>> wb = load_workbook("pandas/tests/io/data/excel/blank.xlsx") 
>>> read_excel(wb, engine="openpyxl")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\liwende\pandas\pandas\util\_decorators.py", line 299, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\liwende\pandas\pandas\io\excel\_base.py", line 340, in read_excel
    io = ExcelFile(io, storage_options=storage_options, engine=engine)
  File "C:\Users\liwende\pandas\pandas\io\excel\_base.py", line 1063, in __init__
    content_or_path=path_or_buffer, storage_options=storage_options
  File "C:\Users\liwende\pandas\pandas\io\excel\_base.py", line 945, in inspect_excel_format
    content_or_path, "rb", storage_options=storage_options, is_text=False
  File "C:\Users\liwende\pandas\pandas\io\common.py", line 576, in get_handle
    storage_options=storage_options,
  File "C:\Users\liwende\pandas\pandas\io\common.py", line 378, in _get_filepath_or_buffer
    raise ValueError(msg)
ValueError: Invalid file path or buffer object type: <class 'openpyxl.workbook.workbook.Workbook'>

@lithomas1
Copy link
Member

lithomas1 commented Feb 2, 2021

@twoertwein
I suspect that https://github.com/pandas-dev/pandas/blob/master/pandas/io/excel/_base.py#L1072-L1077 (master) and https://github.com/pandas-dev/pandas/blob/1.2.x/pandas/io/excel/_base.py#L1065-L1070 (for pandas 1.2) should probably be only executed with no engine specified.
However, the bigger issue is that inspect_excel_format should be able to pick up that this is an openpyxl workbook, which it doesn't since it errors in get_handle(which makes sense as an openpyxl workbook is not a buffer I think).

@lithomas1 lithomas1 added the IO Excel read_excel, to_excel label Feb 2, 2021
@lithomas1 lithomas1 added this to the Contributions Welcome milestone Feb 2, 2021
@lithomas1 lithomas1 removed the Needs Triage Issue that has not been reviewed by a pandas team member label Feb 2, 2021
@twoertwein
Copy link
Member

twoertwein commented Feb 2, 2021

Yes, it could probably be fixed by something like:

diff --git a/pandas/io/excel/_base.py b/pandas/io/excel/_base.py
index 213be7c..fb6ca55 100644
--- a/pandas/io/excel/_base.py
+++ b/pandas/io/excel/_base.py
@@ -1071,7 +1071,7 @@ class ExcelFile:

         if xlrd_version is not None and isinstance(path_or_buffer, xlrd.Book):
             ext = "xls"
-        else:
+        elif engine in (None, "xlrd", "auto"):
             ext = inspect_excel_format(
                 content_or_path=path_or_buffer, storage_options=storage_options
             )

@rhshadrach rhshadrach added Regression Functionality that used to work in a prior pandas version and removed Bug labels Feb 2, 2021
@rhshadrach rhshadrach modified the milestones: Contributions Welcome, 1.2.2 Feb 2, 2021
@rhshadrach
Copy link
Member

Thanks for the report @ajkaijanaho! @twoertwein - seems to me like the patch should be improving inspect_excel_format itself instead of avoiding calling it.

@ajkaijanaho
Copy link
Author

I will have time later today to look into this. Can you open/wrap the workbook with ExcelFile and then pass it to read_excel? I'm not too familiar with the ExcelFile syntax, something like the code below might avoid opening the file twice as a workaround for 1.2.1.

import pandas as pd

with pd.ExcelFile('test.xlsx', engine="openpyxl") as excel:
    dataframe = pd.read_excel(excel)

Not really useful for us, because in our code the Workbook object is created somewhere else and passed around the call chain quite a bit before we get to Pandas. Rewriting all those calls to pass around ExcelFile is not really worth the effort. Our workaround is to stay with 1.1.5 for now, and it works well enough.

@ajkaijanaho ajkaijanaho changed the title BUG: read_excel with Workbook and engine="openpxl" raises ValueError BUG: read_excel with Workbook and engine="openpyxl" raises ValueError Feb 2, 2021
@rhshadrach
Copy link
Member

rhshadrach commented Feb 2, 2021

@twoertwein - I was wrong above, your suggested patch is much better. We should really avoid calling inspect_excel_format unless necessary in order to not have to check (and maintain) all sorts of dependency instances.

@twoertwein
Copy link
Member

@twoertwein - I was wrong above, your suggested patch is much better. We should really avoid calling inspect_excel_format unless necessary in order to not have to check (and maintain) all sorts of dependency instances.

Depending on which behavior is expected, this simple elif-patch is probably not enough. If a user provides a workbook compatible with one of the engines but does not specify an engine explicitly, do we need to auto-detect the engine from the workbook type? If we need that, it should probably go into inspect_excel_format.

@rhshadrach
Copy link
Member

@twoertwein: Comparing to 1.1.x:

        if engine is None:
            engine = "xlrd"
            if isinstance(path_or_buffer, (BufferedIOBase, RawIOBase)):
                if _is_ods_stream(path_or_buffer):
                    engine = "odf"
            else:
                ext = os.path.splitext(str(path_or_buffer))[-1]
                if ext == ".ods":
                    engine = "odf"
        if engine not in self._engines:
            raise ValueError(f"Unknown engine: {engine}")

It appears to me that passing a Workbook with engine=None would raise (need to test this though). Assuming that, the patch should be a minimal change: only restore support for a openpyxl Workbook when passing engine='openpyxl', and in more generality, avoid the unnecessary type-detection when engine is specified.

Supporting engine=None with Workbooks would be an additional feature, which might be welcome, and should go into 1.3 (or later). My only hesitation here is that it seems the implementation would have to special-case all engines and their individual workbook (and worksheet?) types which I think is a bit undesirable. If there is an implementation that is free of this, then I'd certainly be +1.

@ajkaijanaho
Copy link
Author

Depending on which behavior is expected, this simple elif-patch is probably not enough. If a user provides a workbook compatible with one of the engines but does not specify an engine explicitly, do we need to auto-detect the engine from the workbook type? If we need that, it should probably go into inspect_excel_format.

The read_excel documentation is quite explicit about this point: "engine str, default None If io is not a buffer or path, this must be set to identify io"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Excel read_excel, to_excel Regression Functionality that used to work in a prior pandas version
Projects
None yet
4 participants