Index Dtype Not Preserved During read_fwf #21555

jlandercy · 2018-06-20T12:04:43Z

Code Sample (copy-pastable, MCVE)

Consider the following code:

import io
import pandas as pd

# Trial FWF file:
data = io.StringIO('x10011\nx10012\nx10013\nx10024\nx20025\nx20026\nx20037\nx20038\n')

# Read and cast:
df1 = pd.read_fwf(data, widths=[2,3,1], header=None, dtype={0: str, 1: str, 2: int})
# Then index:
df1.set_index(1, inplace=True)

# Read, cast and index at once:
data.seek(0)
df2 = pd.read_fwf(data, widths=[2,3,1], header=None, dtype={0: str, 1: str, 2: int}, index_col=1)

Problem description

As I understand the documentation about control switches:

dtype : Type name or dict of column -> type, default None
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32}
Use str or object together with suitable na_values settings to preserve and not interpret dtype.
If converters are specified, they will be applied INSTEAD of dtype conversion.

index_col : int or sequence or False, default None
Column to use as the row labels of the DataFrame.
If a sequence is given, a MultiIndex is used. If you have a malformed file with delimiters
at the end of each line, you might consider index_col=False to force pandas to not
use the first column as the index (row names)

Both output should be equal but it is not.

When indexing at once using index_col switch, column is inferred to be int and casted, making the switch dtype useless in this case.

>>> df1.index
Index(['001', '001', '001', '002', '002', '002', '003', '003'], dtype='object', name=1)

>>> df2.index
Int64Index([1, 1, 1, 2, 2, 2, 3, 3], dtype='int64', name=1)

>>> df1.equals(df2)
False

Expected Output

I think the expected output of:

df2 = pd.read_fwf(data, widths=[2,3,1], header=None, dtype={0: str, 1: str, 2: int}, index_col=1)

Should be equal to:

df1 = pd.read_fwf(data, widths=[2,3,1], header=None, dtype={0: str, 1: str, 2: int})
df1.set_index(1, inplace=True)

If not, it just makes no sense to be able to protect columns from casting using dtype switch.
For this reason, I think it is a kind of slight bug or inconsistency.

Anyway, as provided in MCVE above, there exists a solution to circonvolve the problem.

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.0
pytest: 3.5.1
pip: 10.0.1
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2018-06-20T13:05:57Z

Hmm that does look buggy. Investigation and PRs are always welcome!

jreback · 2018-06-20T23:00:00Z

i think might be an open issue about this

WillAyd · 2019-04-24T16:34:03Z

Closing as a duplicate of #9435

jlandercy changed the title ~~Switch index_col override dtype casting when reading data with read_fwf~~ Switch index_col override dtype casting when reading data with read_fwf Jun 20, 2018

WillAyd added Bug IO Data IO issues that don't fit into a more specific label Dtype Conversions Unexpected or buggy dtype conversions labels Jun 20, 2018

WillAyd changed the title ~~Switch index_col override dtype casting when reading data with read_fwf~~ Index Dtype Not Preserved During read_fwf Jun 20, 2018

WillAyd closed this as completed Apr 24, 2019

WillAyd added the Duplicate Report Duplicate issue or pull request label Apr 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index Dtype Not Preserved During read_fwf #21555

Index Dtype Not Preserved During read_fwf #21555

jlandercy commented Jun 20, 2018 •

edited

Loading

WillAyd commented Jun 20, 2018

jreback commented Jun 20, 2018

WillAyd commented Apr 24, 2019

Index Dtype Not Preserved During read_fwf #21555

Index Dtype Not Preserved During read_fwf #21555

Comments

jlandercy commented Jun 20, 2018 • edited Loading

Code Sample (copy-pastable, MCVE)

Problem description

Expected Output

Output of pd.show_versions()

WillAyd commented Jun 20, 2018

jreback commented Jun 20, 2018

WillAyd commented Apr 24, 2019

jlandercy commented Jun 20, 2018 •

edited

Loading

Output of `pd.show_versions()`