Skip to content

Commit 3d26bd6

Browse files
committed
ENH: add to/from_parquet with pyarrow & fastparquet
1 parent ecaeea1 commit 3d26bd6

14 files changed

+470
-4
lines changed

ci/requirements-2.7.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,4 @@ source activate pandas
44

55
echo "install 27"
66

7-
conda install -n pandas -c conda-forge feather-format
7+
conda install -n pandas -c conda-forge feather-format pyarrow fastparquet

ci/requirements-3.5.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,4 @@ source activate pandas
44

55
echo "install 35"
66

7-
conda install -n pandas -c conda-forge feather-format
7+
conda install -n pandas -c conda-forge feather-format pyarrow

ci/requirements-3.5_DOC.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,6 @@ echo "[install DOC_BUILD deps]"
66

77
pip install pandas-gbq
88

9-
conda install -n pandas -c conda-forge feather-format
9+
conda install -n pandas -c conda-forge feather-format pyarrow fastparquet
1010

1111
conda install -n pandas -c r r rpy2 --yes

ci/requirements-3.5_OSX.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,4 @@ source activate pandas
44

55
echo "install 35_OSX"
66

7-
conda install -n pandas -c conda-forge feather-format
7+
conda install -n pandas -c conda-forge feather-format fastparquet

ci/requirements-3.6.pip

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
brotlipy

ci/requirements-3.6.run

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,9 @@ jinja2
1515
sqlalchemy
1616
pymysql
1717
feather-format
18+
pyarrow
19+
python-snappy
20+
fastparquet
1821
# psycopg2 (not avail on defaults ATM)
1922
beautifulsoup4
2023
s3fs

ci/requirements-3.6_WIN.run

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,6 @@ numexpr
1111
pytables
1212
matplotlib
1313
blosc
14+
fastparquet
15+
# not supported currently
16+
# pyarrow

doc/source/install.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -236,6 +236,7 @@ Optional Dependencies
236236
* `xarray <http://xarray.pydata.org>`__: pandas like handling for > 2 dims, needed for converting Panels to xarray objects. Version 0.7.0 or higher is recommended.
237237
* `PyTables <http://www.pytables.org>`__: necessary for HDF5-based storage. Version 3.0.0 or higher required, Version 3.2.1 or higher highly recommended.
238238
* `Feather Format <https://github.com/wesm/feather>`__: necessary for feather-based storage, version 0.3.1 or higher.
239+
* ``Parquet Format``, either `pyarrow <https://pyarrow.readthedocs.io/en/latest/>`__ or `fastparquet <https://fastparquet.readthedocs.io/en/latest/necessary>`__ for parquet-based storage. The `snappy <https://pypi.python.org/pypi/python-snappy>`__ and `brotli <https://pypi.python.org/pypi/brotlipy>`__ are available for compression support.
239240
* `SQLAlchemy <http://www.sqlalchemy.org>`__: for SQL database support. Version 0.8.1 or higher recommended. Besides SQLAlchemy, you also need a database specific driver. You can find an overview of supported drivers for each SQL dialect in the `SQLAlchemy docs <http://docs.sqlalchemy.org/en/latest/dialects/index.html>`__. Some common drivers are:
240241

241242
* `psycopg2 <http://initd.org/psycopg/>`__: for PostgreSQL

doc/source/io.rst

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ object. The corresponding ``writer`` functions are object methods that are acces
4343
binary;`MS Excel <https://en.wikipedia.org/wiki/Microsoft_Excel>`__;:ref:`read_excel<io.excel_reader>`;:ref:`to_excel<io.excel_writer>`
4444
binary;`HDF5 Format <https://support.hdfgroup.org/HDF5/whatishdf5.html>`__;:ref:`read_hdf<io.hdf5>`;:ref:`to_hdf<io.hdf5>`
4545
binary;`Feather Format <https://github.com/wesm/feather>`__;:ref:`read_feather<io.feather>`;:ref:`to_feather<io.feather>`
46+
binary;`Parquet Format <https://parquet.apache.org/>`__;:ref:`read_parquet<io.parquet>`;:ref:`to_parquet<io.parquet>`
4647
binary;`Msgpack <http://msgpack.org/index.html>`__;:ref:`read_msgpack<io.msgpack>`;:ref:`to_msgpack<io.msgpack>`
4748
binary;`Stata <https://en.wikipedia.org/wiki/Stata>`__;:ref:`read_stata<io.stata_reader>`;:ref:`to_stata<io.stata_writer>`
4849
binary;`SAS <https://en.wikipedia.org/wiki/SAS_(software)>`__;:ref:`read_sas<io.sas_reader>`;
@@ -4505,6 +4506,69 @@ Read from a feather file.
45054506
import os
45064507
os.remove('example.feather')
45074508
4509+
4510+
.. _io.parquet:
4511+
4512+
Parquet
4513+
-------
4514+
4515+
.. versionadded:: 0.20.0
4516+
4517+
Parquet provides a sharded binary columnar serialization for data frames. It is designed to make reading and writing data
4518+
frames efficient, and to make sharing data across data analysis languages easy.
4519+
4520+
Parquet is designed to faithfully serialize and de-serialize DataFrames, supporting all of the pandas
4521+
dtypes, including extension dtypes such as categorical and datetime with tz.
4522+
4523+
Several caveats.
4524+
4525+
- The format will NOT write an ``Index``, or ``MultiIndex`` for the ``DataFrame`` and will raise an
4526+
error if a non-default one is provided. You can simply ``.reset_index()`` in order to store the index.
4527+
- Duplicate column names and non-string columns names are not supported
4528+
- Non supported types include ``Period`` and actual python object types. These will raise a helpful error message
4529+
on an attempt at serialization.
4530+
4531+
See the documentation for `pyarrow <https://pyarrow.readthedocs.io/en/latest/` and `fastparquet <https://fastparquet.readthedocs.io/en/latest/necessary>`
4532+
4533+
.. ipython:: python
4534+
4535+
df = pd.DataFrame({'a': list('abc'),
4536+
'b': list(range(1, 4)),
4537+
'c': np.arange(3, 6).astype('u1'),
4538+
'd': np.arange(4.0, 7.0, dtype='float64'),
4539+
'e': [True, False, True],
4540+
'f': pd.Categorical(list('abc')),
4541+
'g': pd.date_range('20130101', periods=3),
4542+
'h': pd.date_range('20130101', periods=3, tz='US/Eastern'),
4543+
'i': pd.date_range('20130101', periods=3, freq='ns')})
4544+
4545+
df
4546+
df.dtypes
4547+
4548+
Write to a parquet file.
4549+
4550+
.. ipython:: python
4551+
4552+
df.to_parquet('example_pa.pq', engine='pyarrow')
4553+
df.to_parquet('example_fp.pq', engine='fastparquet')
4554+
4555+
Read from a parquet file.
4556+
4557+
.. ipython:: python
4558+
4559+
result = pd.read_parquet('example_pa.pq')
4560+
result = pd.read_parquet('example_fp.pq')
4561+
4562+
# we preserve dtypes
4563+
result.dtypes
4564+
4565+
.. ipython:: python
4566+
:suppress:
4567+
4568+
import os
4569+
os.remove('example_pa.pq')
4570+
os.remove('example_fp.pq')
4571+
45084572
.. _io.sql:
45094573

45104574
SQL Queries

doc/source/whatsnew/v0.20.0.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -274,6 +274,7 @@ Other Enhancements
274274
^^^^^^^^^^^^^^^^^^
275275

276276
- Integration with the ``feather-format``, including a new top-level ``pd.read_feather()`` and ``DataFrame.to_feather()`` method, see :ref:`here <io.feather>`.
277+
- Integration with the ``parquet-format``, including a new top-level ``pd.read_parquet()`` and ``DataFrame.to_parquet()`` method, see :ref:`here <io.parquet>`.
277278
- ``Series.str.replace()`` now accepts a callable, as replacement, which is passed to ``re.sub`` (:issue:`15055`)
278279
- ``Series.str.replace()`` now accepts a compiled regular expression as a pattern (:issue:`15446`)
279280

pandas/core/frame.py

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1520,6 +1520,25 @@ def to_feather(self, fname):
15201520
from pandas.io.feather_format import to_feather
15211521
to_feather(self, fname)
15221522

1523+
def to_parquet(self, fname, engine, compression=None):
1524+
"""
1525+
write out the binary parquet for DataFrames
1526+
1527+
.. versionadded:: 0.20.0
1528+
1529+
Parameters
1530+
----------
1531+
fname : str
1532+
string file path
1533+
engine : parquet engine
1534+
supported are {'pyarrow', 'fastparquet'}
1535+
compression : str, optional
1536+
compression method, includes {'gzip', 'snappy', 'brotli'}
1537+
1538+
"""
1539+
from pandas.io.parquet import to_parquet
1540+
to_parquet(self, fname, engine, compression=compression)
1541+
15231542
@Substitution(header='Write out column names. If a list of string is given, \
15241543
it is assumed to be aliases for the column names')
15251544
@Appender(fmt.docstring_to_string, indents=1)

pandas/io/parquet.py

Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
""" parquet compat """
2+
3+
from warnings import catch_warnings
4+
from pandas import DataFrame, RangeIndex, Int64Index
5+
from pandas.compat import range
6+
7+
8+
def _try_import_pyarrow():
9+
# since pandas is a dependency of pyarrow
10+
# we need to import on first use
11+
12+
try:
13+
import pyarrow
14+
except ImportError:
15+
raise ImportError("pyarrow is required for parquet support\n\n"
16+
"you can install via conda\n"
17+
"conda install pyarrow -c conda-forge\n"
18+
"\nor via pip\n"
19+
"pip install pyarrow\n")
20+
21+
return pyarrow
22+
23+
24+
def _try_import_fastparquet():
25+
# since pandas is a dependency of fastparquet
26+
# we need to import on first use
27+
28+
try:
29+
import fastparquet
30+
except ImportError:
31+
raise ImportError("fastparquet is required for parquet support\n\n"
32+
"you can install via conda\n"
33+
"conda install fastparquet -c conda-forge\n"
34+
"\nor via pip\n"
35+
"pip install fastparquet")
36+
37+
return fastparquet
38+
39+
40+
def _validate_engine(engine):
41+
if engine not in ['pyarrow', 'fastparquet']:
42+
raise ValueError("engine must be one of 'pyarrow', 'fastparquet'")
43+
44+
45+
def to_parquet(df, path, engine, compression=None):
46+
"""
47+
Write a DataFrame to the pyarrow
48+
49+
Parameters
50+
----------
51+
df : DataFrame
52+
path : string
53+
File path
54+
engine : parquet engine
55+
supported are {'pyarrow', 'fastparquet'}
56+
compression : str, optional
57+
compression method, includes {'gzip', 'snappy', 'brotli'}
58+
"""
59+
60+
_validate_engine(engine)
61+
62+
if not isinstance(df, DataFrame):
63+
raise ValueError("to_parquet only support IO with DataFrames")
64+
65+
valid_types = {'string', 'unicode'}
66+
67+
# validate index
68+
# --------------
69+
70+
# validate that we have only a default index
71+
# raise on anything else as we don't serialize the index
72+
73+
if not isinstance(df.index, Int64Index):
74+
raise ValueError("parquet does not serializing {} "
75+
"for the index; you can .reset_index()"
76+
"to make the index into column(s)".format(
77+
type(df.index)))
78+
79+
if not df.index.equals(RangeIndex.from_range(range(len(df)))):
80+
raise ValueError("parquet does not serializing a non-default index "
81+
"for the index; you can .reset_index()"
82+
"to make the index into column(s)")
83+
84+
if df.index.name is not None:
85+
raise ValueError("parquet does not serialize index meta-data on a "
86+
"default index")
87+
88+
# validate columns
89+
# ----------------
90+
91+
# must have value column names (strings only)
92+
if df.columns.inferred_type not in valid_types:
93+
raise ValueError("parquet must have string column names")
94+
95+
if engine == 'pyarrow':
96+
pyarrow = _try_import_pyarrow()
97+
from pyarrow import parquet as pq
98+
99+
table = pyarrow.Table.from_pandas(df)
100+
pq.write_table(table, path, compression=compression)
101+
102+
elif engine == 'fastparquet':
103+
fastparquet = _try_import_fastparquet()
104+
105+
# thriftpy/protocol/compact.py:339:
106+
# DeprecationWarning: tostring() is deprecated.
107+
# Use tobytes() instead.
108+
with catch_warnings(record=True):
109+
fastparquet.write(path, df, compression=compression)
110+
111+
112+
def read_parquet(path, engine):
113+
"""
114+
Load a parquet object from the file path
115+
116+
.. versionadded 0.20.0
117+
118+
Parameters
119+
----------
120+
path : string
121+
File path
122+
engine : parquet engine
123+
supported are {'pyarrow', 'fastparquet'}
124+
125+
Returns
126+
-------
127+
type of object stored in file
128+
129+
"""
130+
131+
_validate_engine(engine)
132+
133+
if engine == 'pyarrow':
134+
pyarrow = _try_import_pyarrow()
135+
return pyarrow.parquet.read_table(path).to_pandas()
136+
137+
elif engine == 'fastparquet':
138+
fastparquet = _try_import_fastparquet()
139+
pf = fastparquet.ParquetFile(path)
140+
return pf.to_pandas()

0 commit comments

Comments
 (0)