Skip to content

Index Dtype Not Preserved During read_fwf #21555

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jlandercy opened this issue Jun 20, 2018 · 3 comments
Closed

Index Dtype Not Preserved During read_fwf #21555

jlandercy opened this issue Jun 20, 2018 · 3 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Duplicate Report Duplicate issue or pull request IO Data IO issues that don't fit into a more specific label

Comments

@jlandercy
Copy link

jlandercy commented Jun 20, 2018

Code Sample (copy-pastable, MCVE)

Consider the following code:

import io
import pandas as pd

# Trial FWF file:
data = io.StringIO('x10011\nx10012\nx10013\nx10024\nx20025\nx20026\nx20037\nx20038\n')

# Read and cast:
df1 = pd.read_fwf(data, widths=[2,3,1], header=None, dtype={0: str, 1: str, 2: int})
# Then index:
df1.set_index(1, inplace=True)

# Read, cast and index at once:
data.seek(0)
df2 = pd.read_fwf(data, widths=[2,3,1], header=None, dtype={0: str, 1: str, 2: int}, index_col=1)

Problem description

As I understand the documentation about control switches:

dtype : Type name or dict of column -> type, default None
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32}
Use str or object together with suitable na_values settings to preserve and not interpret dtype.
If converters are specified, they will be applied INSTEAD of dtype conversion.

index_col : int or sequence or False, default None
Column to use as the row labels of the DataFrame.
If a sequence is given, a MultiIndex is used. If you have a malformed file with delimiters
at the end of each line, you might consider index_col=False to force pandas to not
use the first column as the index (row names)

Both output should be equal but it is not.

When indexing at once using index_col switch, column is inferred to be int and casted, making the switch dtype useless in this case.

>>> df1.index
Index(['001', '001', '001', '002', '002', '002', '003', '003'], dtype='object', name=1)

>>> df2.index
Int64Index([1, 1, 1, 2, 2, 2, 3, 3], dtype='int64', name=1)

>>> df1.equals(df2)
False

Expected Output

I think the expected output of:

df2 = pd.read_fwf(data, widths=[2,3,1], header=None, dtype={0: str, 1: str, 2: int}, index_col=1)

Should be equal to:

df1 = pd.read_fwf(data, widths=[2,3,1], header=None, dtype={0: str, 1: str, 2: int})
df1.set_index(1, inplace=True)

If not, it just makes no sense to be able to protect columns from casting using dtype switch.
For this reason, I think it is a kind of slight bug or inconsistency.

Anyway, as provided in MCVE above, there exists a solution to circonvolve the problem.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.0
pytest: 3.5.1
pip: 10.0.1
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
@jlandercy jlandercy changed the title Switch index_col override dtype casting when reading data with read_fwf Switch index_col override dtype casting when reading data with read_fwf Jun 20, 2018
@WillAyd
Copy link
Member

WillAyd commented Jun 20, 2018

Hmm that does look buggy. Investigation and PRs are always welcome!

@WillAyd WillAyd added Bug IO Data IO issues that don't fit into a more specific label Dtype Conversions Unexpected or buggy dtype conversions labels Jun 20, 2018
@WillAyd WillAyd changed the title Switch index_col override dtype casting when reading data with read_fwf Index Dtype Not Preserved During read_fwf Jun 20, 2018
@jreback
Copy link
Contributor

jreback commented Jun 20, 2018

i think might be an open issue about this

@WillAyd
Copy link
Member

WillAyd commented Apr 24, 2019

Closing as a duplicate of #9435

@WillAyd WillAyd closed this as completed Apr 24, 2019
@WillAyd WillAyd added the Duplicate Report Duplicate issue or pull request label Apr 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Duplicate Report Duplicate issue or pull request IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

No branches or pull requests

3 participants