BUG: read_excel with Workbook and engine="openpyxl" raises ValueError #39528

ajkaijanaho · 2021-02-01T13:16:35Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample

from openpyxl import load_workbook
from pandas import read_excel

wb = load_workbook("testfile.xlsx")
read_excel(wb, engine="openpyxl")

Problem description

In pandas 1.1.5, the above code completes with no problems.

In pandas 1.2.1, it causes the following exception:

Traceback (most recent call last):
  File "c:/Users/akaijanaho/scratch/pandas-openpyxl-bug/bug.py", line 5, in <module>
    read_excel(wb, engine="openpyxl")
  File "C:\Users\akaijanaho\scratch\pandas-openpyxl-bug\venv\lib\site-packages\pandas\util\_decorators.py", line 299, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\akaijanaho\scratch\pandas-openpyxl-bug\venv\lib\site-packages\pandas\io\excel\_base.py", line 336, in read_excel
    io = ExcelFile(io, storage_options=storage_options, engine=engine)
  File "C:\Users\akaijanaho\scratch\pandas-openpyxl-bug\venv\lib\site-packages\pandas\io\excel\_base.py", line 1057, in __init__
    ext = inspect_excel_format(
  File "C:\Users\akaijanaho\scratch\pandas-openpyxl-bug\venv\lib\site-packages\pandas\io\excel\_base.py", line 938, in inspect_excel_format
    with get_handle(
  File "C:\Users\akaijanaho\scratch\pandas-openpyxl-bug\venv\lib\site-packages\pandas\io\common.py", line 558, in get_handle
    ioargs = _get_filepath_or_buffer(
  File "C:\Users\akaijanaho\scratch\pandas-openpyxl-bug\venv\lib\site-packages\pandas\io\common.py", line 371, in _get_filepath_or_buffer
    raise ValueError(msg)
ValueError: Invalid file path or buffer object type: <class 'openpyxl.workbook.workbook.Workbook'>

The documentation does not specify Workbook as an acceptable value type for io, but supporting it seems reasonable and accords with the 1.1.5 behavior.

In my use case, we mainly parse an Excel file with openpyxl but use pandas with a specific sub-problem. We would like to reuse the same Workbook instead of having pandas re-read the file.

The text was updated successfully, but these errors were encountered:

twoertwein · 2021-02-01T17:02:54Z

I will have time later today to look into this. Can you open/wrap the workbook with ExcelFile and then pass it to read_excel? I'm not too familiar with the ExcelFile syntax, something like the code below might avoid opening the file twice as a workaround for 1.2.1.

import pandas as pd

with pd.ExcelFile('test.xlsx', engine="openpyxl") as excel:
    dataframe = pd.read_excel(excel)

lithomas1 · 2021-02-02T02:04:34Z

Can reproduce on master.

>>> from openpyxl import load_workbook
>>> from pandas import read_excel
>>> wb = load_workbook("pandas/tests/io/data/excel/blank.xlsx") 
>>> read_excel(wb, engine="openpyxl")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\liwende\pandas\pandas\util\_decorators.py", line 299, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\liwende\pandas\pandas\io\excel\_base.py", line 340, in read_excel
    io = ExcelFile(io, storage_options=storage_options, engine=engine)
  File "C:\Users\liwende\pandas\pandas\io\excel\_base.py", line 1063, in __init__
    content_or_path=path_or_buffer, storage_options=storage_options
  File "C:\Users\liwende\pandas\pandas\io\excel\_base.py", line 945, in inspect_excel_format
    content_or_path, "rb", storage_options=storage_options, is_text=False
  File "C:\Users\liwende\pandas\pandas\io\common.py", line 576, in get_handle
    storage_options=storage_options,
  File "C:\Users\liwende\pandas\pandas\io\common.py", line 378, in _get_filepath_or_buffer
    raise ValueError(msg)
ValueError: Invalid file path or buffer object type: <class 'openpyxl.workbook.workbook.Workbook'>

lithomas1 · 2021-02-02T02:10:21Z

@twoertwein
I suspect that https://github.com/pandas-dev/pandas/blob/master/pandas/io/excel/_base.py#L1072-L1077 (master) and https://github.com/pandas-dev/pandas/blob/1.2.x/pandas/io/excel/_base.py#L1065-L1070 (for pandas 1.2) should probably be only executed with no engine specified.
However, the bigger issue is that inspect_excel_format should be able to pick up that this is an openpyxl workbook, which it doesn't since it errors in get_handle(which makes sense as an openpyxl workbook is not a buffer I think).

twoertwein · 2021-02-02T02:41:52Z

Yes, it could probably be fixed by something like:

diff --git a/pandas/io/excel/_base.py b/pandas/io/excel/_base.py
index 213be7c..fb6ca55 100644
--- a/pandas/io/excel/_base.py
+++ b/pandas/io/excel/_base.py
@@ -1071,7 +1071,7 @@ class ExcelFile:

         if xlrd_version is not None and isinstance(path_or_buffer, xlrd.Book):
             ext = "xls"
-        else:
+        elif engine in (None, "xlrd", "auto"):
             ext = inspect_excel_format(
                 content_or_path=path_or_buffer, storage_options=storage_options
             )

rhshadrach · 2021-02-02T04:25:23Z

Thanks for the report @ajkaijanaho! @twoertwein - seems to me like the patch should be improving inspect_excel_format itself instead of avoiding calling it.

ajkaijanaho · 2021-02-02T07:36:51Z

I will have time later today to look into this. Can you open/wrap the workbook with ExcelFile and then pass it to read_excel? I'm not too familiar with the ExcelFile syntax, something like the code below might avoid opening the file twice as a workaround for 1.2.1.
import pandas as pd

with pd.ExcelFile('test.xlsx', engine="openpyxl") as excel:
    dataframe = pd.read_excel(excel)

Not really useful for us, because in our code the Workbook object is created somewhere else and passed around the call chain quite a bit before we get to Pandas. Rewriting all those calls to pass around ExcelFile is not really worth the effort. Our workaround is to stay with 1.1.5 for now, and it works well enough.

rhshadrach · 2021-02-02T23:38:33Z

@twoertwein - I was wrong above, your suggested patch is much better. We should really avoid calling inspect_excel_format unless necessary in order to not have to check (and maintain) all sorts of dependency instances.

twoertwein · 2021-02-03T02:06:13Z

@twoertwein - I was wrong above, your suggested patch is much better. We should really avoid calling inspect_excel_format unless necessary in order to not have to check (and maintain) all sorts of dependency instances.

Depending on which behavior is expected, this simple elif-patch is probably not enough. If a user provides a workbook compatible with one of the engines but does not specify an engine explicitly, do we need to auto-detect the engine from the workbook type? If we need that, it should probably go into inspect_excel_format.

rhshadrach · 2021-02-03T03:37:07Z

@twoertwein: Comparing to 1.1.x:

        if engine is None:
            engine = "xlrd"
            if isinstance(path_or_buffer, (BufferedIOBase, RawIOBase)):
                if _is_ods_stream(path_or_buffer):
                    engine = "odf"
            else:
                ext = os.path.splitext(str(path_or_buffer))[-1]
                if ext == ".ods":
                    engine = "odf"
        if engine not in self._engines:
            raise ValueError(f"Unknown engine: {engine}")

It appears to me that passing a Workbook with engine=None would raise (need to test this though). Assuming that, the patch should be a minimal change: only restore support for a openpyxl Workbook when passing engine='openpyxl', and in more generality, avoid the unnecessary type-detection when engine is specified.

Supporting engine=None with Workbooks would be an additional feature, which might be welcome, and should go into 1.3 (or later). My only hesitation here is that it seems the implementation would have to special-case all engines and their individual workbook (and worksheet?) types which I think is a bit undesirable. If there is an implementation that is free of this, then I'd certainly be +1.

ajkaijanaho · 2021-02-03T06:30:02Z

Depending on which behavior is expected, this simple elif-patch is probably not enough. If a user provides a workbook compatible with one of the engines but does not specify an engine explicitly, do we need to auto-detect the engine from the workbook type? If we need that, it should probably go into inspect_excel_format.

The read_excel documentation is quite explicit about this point: "engine str, default None If io is not a buffer or path, this must be set to identify io"

ajkaijanaho added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 1, 2021

lithomas1 added the IO Excel read_excel, to_excel label Feb 2, 2021

lithomas1 added this to the Contributions Welcome milestone Feb 2, 2021

lithomas1 removed the Needs Triage Issue that has not been reviewed by a pandas team member label Feb 2, 2021

rhshadrach added Regression Functionality that used to work in a prior pandas version and removed Bug labels Feb 2, 2021

rhshadrach modified the milestones: Contributions Welcome, 1.2.2 Feb 2, 2021

ajkaijanaho changed the title ~~BUG: read_excel with Workbook and engine="openpxl" raises ValueError~~ BUG: read_excel with Workbook and engine="openpyxl" raises ValueError Feb 2, 2021

rhshadrach self-assigned this Feb 3, 2021

rhshadrach mentioned this issue Feb 3, 2021

REG: read_excel with engine specified raises on non-path/non-buffer #39586

Merged

4 tasks

twoertwein mentioned this issue Feb 4, 2021

REGR: do not try to infer the excel type from a workbook object #39587

Closed

4 tasks

jreback closed this as completed in #39586 Feb 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: read_excel with Workbook and engine="openpyxl" raises ValueError #39528

BUG: read_excel with Workbook and engine="openpyxl" raises ValueError #39528

ajkaijanaho commented Feb 1, 2021

twoertwein commented Feb 1, 2021 •

edited

Loading

Uh oh!

lithomas1 commented Feb 2, 2021

Uh oh!

lithomas1 commented Feb 2, 2021 •

edited

Loading

Uh oh!

twoertwein commented Feb 2, 2021 •

edited

Loading

Uh oh!

rhshadrach commented Feb 2, 2021

Uh oh!

ajkaijanaho commented Feb 2, 2021

Uh oh!

rhshadrach commented Feb 2, 2021 •

edited

Loading

Uh oh!

twoertwein commented Feb 3, 2021

Uh oh!

rhshadrach commented Feb 3, 2021

Uh oh!

ajkaijanaho commented Feb 3, 2021

Uh oh!

Uh oh!

BUG: read_excel with Workbook and engine="openpyxl" raises ValueError #39528

BUG: read_excel with Workbook and engine="openpyxl" raises ValueError #39528

Comments

ajkaijanaho commented Feb 1, 2021

Code Sample

Problem description

twoertwein commented Feb 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lithomas1 commented Feb 2, 2021

Uh oh!

lithomas1 commented Feb 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

twoertwein commented Feb 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhshadrach commented Feb 2, 2021

Uh oh!

ajkaijanaho commented Feb 2, 2021

Uh oh!

rhshadrach commented Feb 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

twoertwein commented Feb 3, 2021

Uh oh!

rhshadrach commented Feb 3, 2021

Uh oh!

ajkaijanaho commented Feb 3, 2021

Uh oh!

twoertwein commented Feb 1, 2021 •

edited

Loading

lithomas1 commented Feb 2, 2021 •

edited

Loading

twoertwein commented Feb 2, 2021 •

edited

Loading

rhshadrach commented Feb 2, 2021 •

edited

Loading