From bd47a7dae002ed81b559152c009245e51ecf6889 Mon Sep 17 00:00:00 2001 From: Marc Garcia Date: Thu, 14 Jun 2018 15:55:34 +0100 Subject: [PATCH] Fixing documentation lists indentation --- doc/source/api.rst | 6 +- doc/source/basics.rst | 43 ++- doc/source/categorical.rst | 10 +- doc/source/comparison_with_r.rst | 10 +- doc/source/computation.rst | 46 +-- doc/source/contributing.rst | 70 ++--- doc/source/contributing_docstring.rst | 76 ++--- doc/source/developer.rst | 14 +- doc/source/dsintro.rst | 26 +- doc/source/ecosystem.rst | 16 +- doc/source/enhancingperf.rst | 42 +-- doc/source/extending.rst | 6 +- doc/source/gotchas.rst | 4 +- doc/source/groupby.rst | 54 ++-- doc/source/indexing.rst | 82 +++--- doc/source/install.rst | 14 +- doc/source/internals.rst | 38 +-- doc/source/io.rst | 404 +++++++++++++------------- doc/source/merging.rst | 72 ++--- doc/source/options.rst | 8 +- doc/source/overview.rst | 26 +- doc/source/reshaping.rst | 48 +-- doc/source/sparse.rst | 6 +- doc/source/timeseries.rst | 40 +-- doc/source/tutorials.rst | 164 +++++------ doc/source/visualization.rst | 6 +- 26 files changed, 665 insertions(+), 666 deletions(-) diff --git a/doc/source/api.rst b/doc/source/api.rst index 4faec93490fde..f2c00d5d12031 100644 --- a/doc/source/api.rst +++ b/doc/source/api.rst @@ -1200,9 +1200,9 @@ Attributes and underlying data ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Axes** - * **items**: axis 0; each item corresponds to a DataFrame contained inside - * **major_axis**: axis 1; the index (rows) of each of the DataFrames - * **minor_axis**: axis 2; the columns of each of the DataFrames +* **items**: axis 0; each item corresponds to a DataFrame contained inside +* **major_axis**: axis 1; the index (rows) of each of the DataFrames +* **minor_axis**: axis 2; the columns of each of the DataFrames .. autosummary:: :toctree: generated/ diff --git a/doc/source/basics.rst b/doc/source/basics.rst index 74f1d80c6fd3d..c460b19640f46 100644 --- a/doc/source/basics.rst +++ b/doc/source/basics.rst @@ -50,9 +50,8 @@ Attributes and the raw ndarray(s) pandas objects have a number of attributes enabling you to access the metadata - * **shape**: gives the axis dimensions of the object, consistent with ndarray - * Axis labels - +* **shape**: gives the axis dimensions of the object, consistent with ndarray +* Axis labels * **Series**: *index* (only axis) * **DataFrame**: *index* (rows) and *columns* * **Panel**: *items*, *major_axis*, and *minor_axis* @@ -131,9 +130,9 @@ Flexible binary operations With binary operations between pandas data structures, there are two key points of interest: - * Broadcasting behavior between higher- (e.g. DataFrame) and - lower-dimensional (e.g. Series) objects. - * Missing data in computations. +* Broadcasting behavior between higher- (e.g. DataFrame) and + lower-dimensional (e.g. Series) objects. +* Missing data in computations. We will demonstrate how to manage these issues independently, though they can be handled simultaneously. @@ -462,10 +461,10 @@ produce an object of the same size. Generally speaking, these methods take an **axis** argument, just like *ndarray.{sum, std, ...}*, but the axis can be specified by name or integer: - - **Series**: no axis argument needed - - **DataFrame**: "index" (axis=0, default), "columns" (axis=1) - - **Panel**: "items" (axis=0), "major" (axis=1, default), "minor" - (axis=2) +* **Series**: no axis argument needed +* **DataFrame**: "index" (axis=0, default), "columns" (axis=1) +* **Panel**: "items" (axis=0), "major" (axis=1, default), "minor" + (axis=2) For example: @@ -1187,11 +1186,11 @@ It is used to implement nearly all other features relying on label-alignment functionality. To *reindex* means to conform the data to match a given set of labels along a particular axis. This accomplishes several things: - * Reorders the existing data to match a new set of labels - * Inserts missing value (NA) markers in label locations where no data for - that label existed - * If specified, **fill** data for missing labels using logic (highly relevant - to working with time series data) +* Reorders the existing data to match a new set of labels +* Inserts missing value (NA) markers in label locations where no data for + that label existed +* If specified, **fill** data for missing labels using logic (highly relevant + to working with time series data) Here is a simple example: @@ -1911,10 +1910,10 @@ the axis indexes, since they are immutable) and returns a new object. Note that **it is seldom necessary to copy objects**. For example, there are only a handful of ways to alter a DataFrame *in-place*: - * Inserting, deleting, or modifying a column. - * Assigning to the ``index`` or ``columns`` attributes. - * For homogeneous data, directly modifying the values via the ``values`` - attribute or advanced indexing. +* Inserting, deleting, or modifying a column. +* Assigning to the ``index`` or ``columns`` attributes. +* For homogeneous data, directly modifying the values via the ``values`` + attribute or advanced indexing. To be clear, no pandas method has the side effect of modifying your data; almost every method returns a new object, leaving the original object @@ -2112,14 +2111,14 @@ Because the data was transposed the original inference stored all columns as obj The following functions are available for one dimensional object arrays or scalars to perform hard conversion of objects to a specified type: -- :meth:`~pandas.to_numeric` (conversion to numeric dtypes) +* :meth:`~pandas.to_numeric` (conversion to numeric dtypes) .. ipython:: python m = ['1.1', 2, 3] pd.to_numeric(m) -- :meth:`~pandas.to_datetime` (conversion to datetime objects) +* :meth:`~pandas.to_datetime` (conversion to datetime objects) .. ipython:: python @@ -2127,7 +2126,7 @@ hard conversion of objects to a specified type: m = ['2016-07-09', datetime.datetime(2016, 3, 2)] pd.to_datetime(m) -- :meth:`~pandas.to_timedelta` (conversion to timedelta objects) +* :meth:`~pandas.to_timedelta` (conversion to timedelta objects) .. ipython:: python diff --git a/doc/source/categorical.rst b/doc/source/categorical.rst index c6827f67a390b..acab9de905540 100644 --- a/doc/source/categorical.rst +++ b/doc/source/categorical.rst @@ -542,11 +542,11 @@ Comparisons Comparing categorical data with other objects is possible in three cases: - * Comparing equality (``==`` and ``!=``) to a list-like object (list, Series, array, - ...) of the same length as the categorical data. - * All comparisons (``==``, ``!=``, ``>``, ``>=``, ``<``, and ``<=``) of categorical data to - another categorical Series, when ``ordered==True`` and the `categories` are the same. - * All comparisons of a categorical data to a scalar. +* Comparing equality (``==`` and ``!=``) to a list-like object (list, Series, array, + ...) of the same length as the categorical data. +* All comparisons (``==``, ``!=``, ``>``, ``>=``, ``<``, and ``<=``) of categorical data to + another categorical Series, when ``ordered==True`` and the `categories` are the same. +* All comparisons of a categorical data to a scalar. All other comparisons, especially "non-equality" comparisons of two categoricals with different categories or a categorical with any list-like object, will raise a ``TypeError``. diff --git a/doc/source/comparison_with_r.rst b/doc/source/comparison_with_r.rst index a7586f623a160..eecacde8ad14e 100644 --- a/doc/source/comparison_with_r.rst +++ b/doc/source/comparison_with_r.rst @@ -18,11 +18,11 @@ was started to provide a more detailed look at the `R language party libraries as they relate to ``pandas``. In comparisons with R and CRAN libraries, we care about the following things: - - **Functionality / flexibility**: what can/cannot be done with each tool - - **Performance**: how fast are operations. Hard numbers/benchmarks are - preferable - - **Ease-of-use**: Is one tool easier/harder to use (you may have to be - the judge of this, given side-by-side code comparisons) +* **Functionality / flexibility**: what can/cannot be done with each tool +* **Performance**: how fast are operations. Hard numbers/benchmarks are + preferable +* **Ease-of-use**: Is one tool easier/harder to use (you may have to be + the judge of this, given side-by-side code comparisons) This page is also here to offer a bit of a translation guide for users of these R packages. diff --git a/doc/source/computation.rst b/doc/source/computation.rst index ff06c369e1897..5e7b8be5f8af0 100644 --- a/doc/source/computation.rst +++ b/doc/source/computation.rst @@ -344,20 +344,20 @@ The weights used in the window are specified by the ``win_type`` keyword. The list of recognized types are the `scipy.signal window functions `__: -- ``boxcar`` -- ``triang`` -- ``blackman`` -- ``hamming`` -- ``bartlett`` -- ``parzen`` -- ``bohman`` -- ``blackmanharris`` -- ``nuttall`` -- ``barthann`` -- ``kaiser`` (needs beta) -- ``gaussian`` (needs std) -- ``general_gaussian`` (needs power, width) -- ``slepian`` (needs width). +* ``boxcar`` +* ``triang`` +* ``blackman`` +* ``hamming`` +* ``bartlett`` +* ``parzen`` +* ``bohman`` +* ``blackmanharris`` +* ``nuttall`` +* ``barthann`` +* ``kaiser`` (needs beta) +* ``gaussian`` (needs std) +* ``general_gaussian`` (needs power, width) +* ``slepian`` (needs width). .. ipython:: python @@ -537,10 +537,10 @@ Binary Window Functions two ``Series`` or any combination of ``DataFrame/Series`` or ``DataFrame/DataFrame``. Here is the behavior in each case: -- two ``Series``: compute the statistic for the pairing. -- ``DataFrame/Series``: compute the statistics for each column of the DataFrame +* two ``Series``: compute the statistic for the pairing. +* ``DataFrame/Series``: compute the statistics for each column of the DataFrame with the passed Series, thus returning a DataFrame. -- ``DataFrame/DataFrame``: by default compute the statistic for matching column +* ``DataFrame/DataFrame``: by default compute the statistic for matching column names, returning a DataFrame. If the keyword argument ``pairwise=True`` is passed then computes the statistic for each pair of columns, returning a ``MultiIndexed DataFrame`` whose ``index`` are the dates in question (see :ref:`the next section @@ -741,10 +741,10 @@ Aside from not having a ``window`` parameter, these functions have the same interfaces as their ``.rolling`` counterparts. Like above, the parameters they all accept are: -- ``min_periods``: threshold of non-null data points to require. Defaults to +* ``min_periods``: threshold of non-null data points to require. Defaults to minimum needed to compute statistic. No ``NaNs`` will be output once ``min_periods`` non-null data points have been seen. -- ``center``: boolean, whether to set the labels at the center (default is False). +* ``center``: boolean, whether to set the labels at the center (default is False). .. _stats.moments.expanding.note: .. note:: @@ -903,12 +903,12 @@ of an EW moment: One must specify precisely one of **span**, **center of mass**, **half-life** and **alpha** to the EW functions: -- **Span** corresponds to what is commonly called an "N-day EW moving average". -- **Center of mass** has a more physical interpretation and can be thought of +* **Span** corresponds to what is commonly called an "N-day EW moving average". +* **Center of mass** has a more physical interpretation and can be thought of in terms of span: :math:`c = (s - 1) / 2`. -- **Half-life** is the period of time for the exponential weight to reduce to +* **Half-life** is the period of time for the exponential weight to reduce to one half. -- **Alpha** specifies the smoothing factor directly. +* **Alpha** specifies the smoothing factor directly. Here is an example for a univariate time series: diff --git a/doc/source/contributing.rst b/doc/source/contributing.rst index 6ae93ba46fa5c..ff06d024740bf 100644 --- a/doc/source/contributing.rst +++ b/doc/source/contributing.rst @@ -138,11 +138,11 @@ steps; you only need to install the compiler. For Windows developers, the following links may be helpful. -- https://blogs.msdn.microsoft.com/pythonengineering/2016/04/11/unable-to-find-vcvarsall-bat/ -- https://github.com/conda/conda-recipes/wiki/Building-from-Source-on-Windows-32-bit-and-64-bit -- https://cowboyprogrammer.org/building-python-wheels-for-windows/ -- https://blog.ionelmc.ro/2014/12/21/compiling-python-extensions-on-windows/ -- https://support.enthought.com/hc/en-us/articles/204469260-Building-Python-extensions-with-Canopy +* https://blogs.msdn.microsoft.com/pythonengineering/2016/04/11/unable-to-find-vcvarsall-bat/ +* https://github.com/conda/conda-recipes/wiki/Building-from-Source-on-Windows-32-bit-and-64-bit +* https://cowboyprogrammer.org/building-python-wheels-for-windows/ +* https://blog.ionelmc.ro/2014/12/21/compiling-python-extensions-on-windows/ +* https://support.enthought.com/hc/en-us/articles/204469260-Building-Python-extensions-with-Canopy Let us know if you have any difficulties by opening an issue or reaching out on `Gitter`_. @@ -155,11 +155,11 @@ Creating a Python Environment Now that you have a C compiler, create an isolated pandas development environment: -- Install either `Anaconda `_ or `miniconda +* Install either `Anaconda `_ or `miniconda `_ -- Make sure your conda is up to date (``conda update conda``) -- Make sure that you have :ref:`cloned the repository ` -- ``cd`` to the *pandas* source directory +* Make sure your conda is up to date (``conda update conda``) +* Make sure that you have :ref:`cloned the repository ` +* ``cd`` to the *pandas* source directory We'll now kick off a three-step process: @@ -286,7 +286,7 @@ complex changes to the documentation as well. Some other important things to know about the docs: -- The *pandas* documentation consists of two parts: the docstrings in the code +* The *pandas* documentation consists of two parts: the docstrings in the code itself and the docs in this folder ``pandas/doc/``. The docstrings provide a clear explanation of the usage of the individual @@ -294,7 +294,7 @@ Some other important things to know about the docs: overviews per topic together with some other information (what's new, installation, etc). -- The docstrings follow a pandas convention, based on the **Numpy Docstring +* The docstrings follow a pandas convention, based on the **Numpy Docstring Standard**. Follow the :ref:`pandas docstring guide ` for detailed instructions on how to write a correct docstring. @@ -303,7 +303,7 @@ Some other important things to know about the docs: contributing_docstring.rst -- The tutorials make heavy use of the `ipython directive +* The tutorials make heavy use of the `ipython directive `_ sphinx extension. This directive lets you put code in the documentation which will be run during the doc build. For example:: @@ -324,7 +324,7 @@ Some other important things to know about the docs: doc build. This approach means that code examples will always be up to date, but it does make the doc building a bit more complex. -- Our API documentation in ``doc/source/api.rst`` houses the auto-generated +* Our API documentation in ``doc/source/api.rst`` houses the auto-generated documentation from the docstrings. For classes, there are a few subtleties around controlling which methods and attributes have pages auto-generated. @@ -488,8 +488,8 @@ standard. Google provides an open source style checker called ``cpplint``, but w use a fork of it that can be found `here `__. Here are *some* of the more common ``cpplint`` issues: - - we restrict line-length to 80 characters to promote readability - - every header file must include a header guard to avoid name collisions if re-included +* we restrict line-length to 80 characters to promote readability +* every header file must include a header guard to avoid name collisions if re-included :ref:`Continuous Integration ` will run the `cpplint `_ tool @@ -536,8 +536,8 @@ Python (PEP8) There are several tools to ensure you abide by this standard. Here are *some* of the more common ``PEP8`` issues: - - we restrict line-length to 79 characters to promote readability - - passing arguments should have spaces after commas, e.g. ``foo(arg1, arg2, kw1='bar')`` +* we restrict line-length to 79 characters to promote readability +* passing arguments should have spaces after commas, e.g. ``foo(arg1, arg2, kw1='bar')`` :ref:`Continuous Integration ` will run the `flake8 `_ tool @@ -715,14 +715,14 @@ Using ``pytest`` Here is an example of a self-contained set of tests that illustrate multiple features that we like to use. -- functional style: tests are like ``test_*`` and *only* take arguments that are either fixtures or parameters -- ``pytest.mark`` can be used to set metadata on test functions, e.g. ``skip`` or ``xfail``. -- using ``parametrize``: allow testing of multiple cases -- to set a mark on a parameter, ``pytest.param(..., marks=...)`` syntax should be used -- ``fixture``, code for object construction, on a per-test basis -- using bare ``assert`` for scalars and truth-testing -- ``tm.assert_series_equal`` (and its counter part ``tm.assert_frame_equal``), for pandas object comparisons. -- the typical pattern of constructing an ``expected`` and comparing versus the ``result`` +* functional style: tests are like ``test_*`` and *only* take arguments that are either fixtures or parameters +* ``pytest.mark`` can be used to set metadata on test functions, e.g. ``skip`` or ``xfail``. +* using ``parametrize``: allow testing of multiple cases +* to set a mark on a parameter, ``pytest.param(..., marks=...)`` syntax should be used +* ``fixture``, code for object construction, on a per-test basis +* using bare ``assert`` for scalars and truth-testing +* ``tm.assert_series_equal`` (and its counter part ``tm.assert_frame_equal``), for pandas object comparisons. +* the typical pattern of constructing an ``expected`` and comparing versus the ``result`` We would name this file ``test_cool_feature.py`` and put in an appropriate place in the ``pandas/tests/`` structure. @@ -969,21 +969,21 @@ Finally, commit your changes to your local repository with an explanatory messag uses a convention for commit message prefixes and layout. Here are some common prefixes along with general guidelines for when to use them: - * ENH: Enhancement, new functionality - * BUG: Bug fix - * DOC: Additions/updates to documentation - * TST: Additions/updates to tests - * BLD: Updates to the build process/scripts - * PERF: Performance improvement - * CLN: Code cleanup +* ENH: Enhancement, new functionality +* BUG: Bug fix +* DOC: Additions/updates to documentation +* TST: Additions/updates to tests +* BLD: Updates to the build process/scripts +* PERF: Performance improvement +* CLN: Code cleanup The following defines how a commit message should be structured. Please reference the relevant GitHub issues in your commit message using GH1234 or #1234. Either style is fine, but the former is generally preferred: - * a subject line with `< 80` chars. - * One blank line. - * Optionally, a commit message body. +* a subject line with `< 80` chars. +* One blank line. +* Optionally, a commit message body. Now you can commit your changes in your local repository:: diff --git a/doc/source/contributing_docstring.rst b/doc/source/contributing_docstring.rst index 4dec2a23facca..afb554aeffbc3 100644 --- a/doc/source/contributing_docstring.rst +++ b/doc/source/contributing_docstring.rst @@ -68,7 +68,7 @@ As PEP-257 is quite open, and some other standards exist on top of it. In the case of pandas, the numpy docstring convention is followed. The conventions is explained in this document: -- `numpydoc docstring guide `_ +* `numpydoc docstring guide `_ (which is based in the original `Guide to NumPy/SciPy documentation `_) @@ -78,9 +78,9 @@ The standard uses reStructuredText (reST). reStructuredText is a markup language that allows encoding styles in plain text files. Documentation about reStructuredText can be found in: -- `Sphinx reStructuredText primer `_ -- `Quick reStructuredText reference `_ -- `Full reStructuredText specification `_ +* `Sphinx reStructuredText primer `_ +* `Quick reStructuredText reference `_ +* `Full reStructuredText specification `_ Pandas has some helpers for sharing docstrings between related classes, see :ref:`docstring.sharing`. @@ -107,12 +107,12 @@ In rare occasions reST styles like bold text or italics will be used in docstrings, but is it common to have inline code, which is presented between backticks. It is considered inline code: -- The name of a parameter -- Python code, a module, function, built-in, type, literal... (e.g. ``os``, +* The name of a parameter +* Python code, a module, function, built-in, type, literal... (e.g. ``os``, ``list``, ``numpy.abs``, ``datetime.date``, ``True``) -- A pandas class (in the form ``:class:`pandas.Series```) -- A pandas method (in the form ``:meth:`pandas.Series.sum```) -- A pandas function (in the form ``:func:`pandas.to_datetime```) +* A pandas class (in the form ``:class:`pandas.Series```) +* A pandas method (in the form ``:meth:`pandas.Series.sum```) +* A pandas function (in the form ``:func:`pandas.to_datetime```) .. note:: To display only the last component of the linked class, method or @@ -352,71 +352,71 @@ When specifying the parameter types, Python built-in data types can be used directly (the Python type is preferred to the more verbose string, integer, boolean, etc): -- int -- float -- str -- bool +* int +* float +* str +* bool For complex types, define the subtypes. For `dict` and `tuple`, as more than one type is present, we use the brackets to help read the type (curly brackets for `dict` and normal brackets for `tuple`): -- list of int -- dict of {str : int} -- tuple of (str, int, int) -- tuple of (str,) -- set of str +* list of int +* dict of {str : int} +* tuple of (str, int, int) +* tuple of (str,) +* set of str In case where there are just a set of values allowed, list them in curly brackets and separated by commas (followed by a space). If the values are ordinal and they have an order, list them in this order. Otherwise, list the default value first, if there is one: -- {0, 10, 25} -- {'simple', 'advanced'} -- {'low', 'medium', 'high'} -- {'cat', 'dog', 'bird'} +* {0, 10, 25} +* {'simple', 'advanced'} +* {'low', 'medium', 'high'} +* {'cat', 'dog', 'bird'} If the type is defined in a Python module, the module must be specified: -- datetime.date -- datetime.datetime -- decimal.Decimal +* datetime.date +* datetime.datetime +* decimal.Decimal If the type is in a package, the module must be also specified: -- numpy.ndarray -- scipy.sparse.coo_matrix +* numpy.ndarray +* scipy.sparse.coo_matrix If the type is a pandas type, also specify pandas except for Series and DataFrame: -- Series -- DataFrame -- pandas.Index -- pandas.Categorical -- pandas.SparseArray +* Series +* DataFrame +* pandas.Index +* pandas.Categorical +* pandas.SparseArray If the exact type is not relevant, but must be compatible with a numpy array, array-like can be specified. If Any type that can be iterated is accepted, iterable can be used: -- array-like -- iterable +* array-like +* iterable If more than one type is accepted, separate them by commas, except the last two types, that need to be separated by the word 'or': -- int or float -- float, decimal.Decimal or None -- str or list of str +* int or float +* float, decimal.Decimal or None +* str or list of str If ``None`` is one of the accepted values, it always needs to be the last in the list. For axis, the convention is to use something like: -- axis : {0 or 'index', 1 or 'columns', None}, default None +* axis : {0 or 'index', 1 or 'columns', None}, default None .. _docstring.returns: diff --git a/doc/source/developer.rst b/doc/source/developer.rst index b8bb2b2fcbe2f..f76af394abc48 100644 --- a/doc/source/developer.rst +++ b/doc/source/developer.rst @@ -81,20 +81,20 @@ The ``metadata`` field is ``None`` except for: omitted it is assumed to be nanoseconds. * ``categorical``: ``{'num_categories': K, 'ordered': is_ordered, 'type': $TYPE}`` - * Here ``'type'`` is optional, and can be a nested pandas type specification - here (but not categorical) + * Here ``'type'`` is optional, and can be a nested pandas type specification + here (but not categorical) * ``unicode``: ``{'encoding': encoding}`` - * The encoding is optional, and if not present is UTF-8 + * The encoding is optional, and if not present is UTF-8 * ``object``: ``{'encoding': encoding}``. Objects can be serialized and stored in ``BYTE_ARRAY`` Parquet columns. The encoding can be one of: - * ``'pickle'`` - * ``'msgpack'`` - * ``'bson'`` - * ``'json'`` + * ``'pickle'`` + * ``'msgpack'`` + * ``'bson'`` + * ``'json'`` * ``timedelta``: ``{'unit': 'ns'}``. The ``'unit'`` is optional, and if omitted it is assumed to be nanoseconds. This metadata is optional altogether diff --git a/doc/source/dsintro.rst b/doc/source/dsintro.rst index 4d8e7979060f4..efa52a6f7cfe2 100644 --- a/doc/source/dsintro.rst +++ b/doc/source/dsintro.rst @@ -51,9 +51,9 @@ labels are collectively referred to as the **index**. The basic method to create Here, ``data`` can be many different things: - - a Python dict - - an ndarray - - a scalar value (like 5) +* a Python dict +* an ndarray +* a scalar value (like 5) The passed **index** is a list of axis labels. Thus, this separates into a few cases depending on what **data is**: @@ -246,12 +246,12 @@ potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input: - - Dict of 1D ndarrays, lists, dicts, or Series - - 2-D numpy.ndarray - - `Structured or record - `__ ndarray - - A ``Series`` - - Another ``DataFrame`` +* Dict of 1D ndarrays, lists, dicts, or Series +* 2-D numpy.ndarray +* `Structured or record + `__ ndarray +* A ``Series`` +* Another ``DataFrame`` Along with the data, you can optionally pass **index** (row labels) and **columns** (column labels) arguments. If you pass an index and / or columns, @@ -842,10 +842,10 @@ econometric analysis of panel data. However, for the strict purposes of slicing and dicing a collection of DataFrame objects, you may find the axis names slightly arbitrary: - - **items**: axis 0, each item corresponds to a DataFrame contained inside - - **major_axis**: axis 1, it is the **index** (rows) of each of the - DataFrames - - **minor_axis**: axis 2, it is the **columns** of each of the DataFrames +* **items**: axis 0, each item corresponds to a DataFrame contained inside +* **major_axis**: axis 1, it is the **index** (rows) of each of the + DataFrames +* **minor_axis**: axis 2, it is the **columns** of each of the DataFrames Construction of Panels works about like you would expect: diff --git a/doc/source/ecosystem.rst b/doc/source/ecosystem.rst index f683fd6892ea5..4e15f9069de67 100644 --- a/doc/source/ecosystem.rst +++ b/doc/source/ecosystem.rst @@ -159,14 +159,14 @@ See more in the `pandas-datareader docs `__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/doc/source/enhancingperf.rst b/doc/source/enhancingperf.rst index 979d025111df1..8f8a9fe3e50e0 100644 --- a/doc/source/enhancingperf.rst +++ b/doc/source/enhancingperf.rst @@ -461,15 +461,15 @@ Supported Syntax These operations are supported by :func:`pandas.eval`: -- Arithmetic operations except for the left shift (``<<``) and right shift +* Arithmetic operations except for the left shift (``<<``) and right shift (``>>``) operators, e.g., ``df + 2 * pi / s ** 4 % 42 - the_golden_ratio`` -- Comparison operations, including chained comparisons, e.g., ``2 < df < df2`` -- Boolean operations, e.g., ``df < df2 and df3 < df4 or not df_bool`` -- ``list`` and ``tuple`` literals, e.g., ``[1, 2]`` or ``(1, 2)`` -- Attribute access, e.g., ``df.a`` -- Subscript expressions, e.g., ``df[0]`` -- Simple variable evaluation, e.g., ``pd.eval('df')`` (this is not very useful) -- Math functions: `sin`, `cos`, `exp`, `log`, `expm1`, `log1p`, +* Comparison operations, including chained comparisons, e.g., ``2 < df < df2`` +* Boolean operations, e.g., ``df < df2 and df3 < df4 or not df_bool`` +* ``list`` and ``tuple`` literals, e.g., ``[1, 2]`` or ``(1, 2)`` +* Attribute access, e.g., ``df.a`` +* Subscript expressions, e.g., ``df[0]`` +* Simple variable evaluation, e.g., ``pd.eval('df')`` (this is not very useful) +* Math functions: `sin`, `cos`, `exp`, `log`, `expm1`, `log1p`, `sqrt`, `sinh`, `cosh`, `tanh`, `arcsin`, `arccos`, `arctan`, `arccosh`, `arcsinh`, `arctanh`, `abs` and `arctan2`. @@ -477,22 +477,22 @@ This Python syntax is **not** allowed: * Expressions - - Function calls other than math functions. - - ``is``/``is not`` operations - - ``if`` expressions - - ``lambda`` expressions - - ``list``/``set``/``dict`` comprehensions - - Literal ``dict`` and ``set`` expressions - - ``yield`` expressions - - Generator expressions - - Boolean expressions consisting of only scalar values + * Function calls other than math functions. + * ``is``/``is not`` operations + * ``if`` expressions + * ``lambda`` expressions + * ``list``/``set``/``dict`` comprehensions + * Literal ``dict`` and ``set`` expressions + * ``yield`` expressions + * Generator expressions + * Boolean expressions consisting of only scalar values * Statements - - Neither `simple `__ - nor `compound `__ - statements are allowed. This includes things like ``for``, ``while``, and - ``if``. + * Neither `simple `__ + nor `compound `__ + statements are allowed. This includes things like ``for``, ``while``, and + ``if``. diff --git a/doc/source/extending.rst b/doc/source/extending.rst index 431c69bc0b6b5..8018d35770924 100644 --- a/doc/source/extending.rst +++ b/doc/source/extending.rst @@ -167,9 +167,9 @@ you can retain subclasses through ``pandas`` data manipulations. There are 3 constructor properties to be defined: -- ``_constructor``: Used when a manipulation result has the same dimensions as the original. -- ``_constructor_sliced``: Used when a manipulation result has one lower dimension(s) as the original, such as ``DataFrame`` single columns slicing. -- ``_constructor_expanddim``: Used when a manipulation result has one higher dimension as the original, such as ``Series.to_frame()`` and ``DataFrame.to_panel()``. +* ``_constructor``: Used when a manipulation result has the same dimensions as the original. +* ``_constructor_sliced``: Used when a manipulation result has one lower dimension(s) as the original, such as ``DataFrame`` single columns slicing. +* ``_constructor_expanddim``: Used when a manipulation result has one higher dimension as the original, such as ``Series.to_frame()`` and ``DataFrame.to_panel()``. Following table shows how ``pandas`` data structures define constructor properties by default. diff --git a/doc/source/gotchas.rst b/doc/source/gotchas.rst index b7042ef390018..79e312ca12833 100644 --- a/doc/source/gotchas.rst +++ b/doc/source/gotchas.rst @@ -193,9 +193,9 @@ Choice of ``NA`` representation For lack of ``NA`` (missing) support from the ground up in NumPy and Python in general, we were given the difficult choice between either: -- A *masked array* solution: an array of data and an array of boolean values +* A *masked array* solution: an array of data and an array of boolean values indicating whether a value is there or is missing. -- Using a special sentinel value, bit pattern, or set of sentinel values to +* Using a special sentinel value, bit pattern, or set of sentinel values to denote ``NA`` across the dtypes. For many reasons we chose the latter. After years of production use it has diff --git a/doc/source/groupby.rst b/doc/source/groupby.rst index 1c4c3f93726a9..299fbfd12baa8 100644 --- a/doc/source/groupby.rst +++ b/doc/source/groupby.rst @@ -22,36 +22,36 @@ Group By: split-apply-combine By "group by" we are referring to a process involving one or more of the following steps: - - **Splitting** the data into groups based on some criteria. - - **Applying** a function to each group independently. - - **Combining** the results into a data structure. +* **Splitting** the data into groups based on some criteria. +* **Applying** a function to each group independently. +* **Combining** the results into a data structure. Out of these, the split step is the most straightforward. In fact, in many situations we may wish to split the data set into groups and do something with those groups. In the apply step, we might wish to one of the following: - - **Aggregation**: compute a summary statistic (or statistics) for each - group. Some examples: +* **Aggregation**: compute a summary statistic (or statistics) for each + group. Some examples: - - Compute group sums or means. - - Compute group sizes / counts. + * Compute group sums or means. + * Compute group sizes / counts. - - **Transformation**: perform some group-specific computations and return a - like-indexed object. Some examples: +* **Transformation**: perform some group-specific computations and return a + like-indexed object. Some examples: - - Standardize data (zscore) within a group. - - Filling NAs within groups with a value derived from each group. + * Standardize data (zscore) within a group. + * Filling NAs within groups with a value derived from each group. - - **Filtration**: discard some groups, according to a group-wise computation - that evaluates True or False. Some examples: +* **Filtration**: discard some groups, according to a group-wise computation + that evaluates True or False. Some examples: - - Discard data that belongs to groups with only a few members. - - Filter out data based on the group sum or mean. + * Discard data that belongs to groups with only a few members. + * Filter out data based on the group sum or mean. - - Some combination of the above: GroupBy will examine the results of the apply - step and try to return a sensibly combined result if it doesn't fit into - either of the above two categories. +* Some combination of the above: GroupBy will examine the results of the apply + step and try to return a sensibly combined result if it doesn't fit into + either of the above two categories. Since the set of object instance methods on pandas data structures are generally rich and expressive, we often simply want to invoke, say, a DataFrame function @@ -88,15 +88,15 @@ object (more on what the GroupBy object is later), you may do the following: The mapping can be specified many different ways: - - A Python function, to be called on each of the axis labels. - - A list or NumPy array of the same length as the selected axis. - - A dict or ``Series``, providing a ``label -> group name`` mapping. - - For ``DataFrame`` objects, a string indicating a column to be used to group. - Of course ``df.groupby('A')`` is just syntactic sugar for - ``df.groupby(df['A'])``, but it makes life simpler. - - For ``DataFrame`` objects, a string indicating an index level to be used to - group. - - A list of any of the above things. +* A Python function, to be called on each of the axis labels. +* A list or NumPy array of the same length as the selected axis. +* A dict or ``Series``, providing a ``label -> group name`` mapping. +* For ``DataFrame`` objects, a string indicating a column to be used to group. + Of course ``df.groupby('A')`` is just syntactic sugar for + ``df.groupby(df['A'])``, but it makes life simpler. +* For ``DataFrame`` objects, a string indicating an index level to be used to + group. +* A list of any of the above things. Collectively we refer to the grouping objects as the **keys**. For example, consider the following ``DataFrame``: diff --git a/doc/source/indexing.rst b/doc/source/indexing.rst index 2b9fcf874ef22..1c63acce6e3fa 100644 --- a/doc/source/indexing.rst +++ b/doc/source/indexing.rst @@ -17,10 +17,10 @@ Indexing and Selecting Data The axis labeling information in pandas objects serves many purposes: - - Identifies data (i.e. provides *metadata*) using known indicators, - important for analysis, visualization, and interactive console display. - - Enables automatic and explicit data alignment. - - Allows intuitive getting and setting of subsets of the data set. +* Identifies data (i.e. provides *metadata*) using known indicators, + important for analysis, visualization, and interactive console display. +* Enables automatic and explicit data alignment. +* Allows intuitive getting and setting of subsets of the data set. In this section, we will focus on the final point: namely, how to slice, dice, and generally get and set subsets of pandas objects. The primary focus will be @@ -62,37 +62,37 @@ Object selection has had a number of user-requested additions in order to support more explicit location based indexing. Pandas now supports three types of multi-axis indexing. -- ``.loc`` is primarily label based, but may also be used with a boolean array. ``.loc`` will raise ``KeyError`` when the items are not found. Allowed inputs are: +* ``.loc`` is primarily label based, but may also be used with a boolean array. ``.loc`` will raise ``KeyError`` when the items are not found. Allowed inputs are: - - A single label, e.g. ``5`` or ``'a'`` (Note that ``5`` is interpreted as a - *label* of the index. This use is **not** an integer position along the - index.). - - A list or array of labels ``['a', 'b', 'c']``. - - A slice object with labels ``'a':'f'`` (Note that contrary to usual python - slices, **both** the start and the stop are included, when present in the - index! See :ref:`Slicing with labels - `.). - - A boolean array - - A ``callable`` function with one argument (the calling Series, DataFrame or Panel) and - that returns valid output for indexing (one of the above). + * A single label, e.g. ``5`` or ``'a'`` (Note that ``5`` is interpreted as a + *label* of the index. This use is **not** an integer position along the + index.). + * A list or array of labels ``['a', 'b', 'c']``. + * A slice object with labels ``'a':'f'`` (Note that contrary to usual python + slices, **both** the start and the stop are included, when present in the + index! See :ref:`Slicing with labels + `.). + * A boolean array + * A ``callable`` function with one argument (the calling Series, DataFrame or Panel) and + that returns valid output for indexing (one of the above). .. versionadded:: 0.18.1 See more at :ref:`Selection by Label `. -- ``.iloc`` is primarily integer position based (from ``0`` to +* ``.iloc`` is primarily integer position based (from ``0`` to ``length-1`` of the axis), but may also be used with a boolean array. ``.iloc`` will raise ``IndexError`` if a requested indexer is out-of-bounds, except *slice* indexers which allow out-of-bounds indexing. (this conforms with Python/NumPy *slice* semantics). Allowed inputs are: - - An integer e.g. ``5``. - - A list or array of integers ``[4, 3, 0]``. - - A slice object with ints ``1:7``. - - A boolean array. - - A ``callable`` function with one argument (the calling Series, DataFrame or Panel) and - that returns valid output for indexing (one of the above). + * An integer e.g. ``5``. + * A list or array of integers ``[4, 3, 0]``. + * A slice object with ints ``1:7``. + * A boolean array. + * A ``callable`` function with one argument (the calling Series, DataFrame or Panel) and + that returns valid output for indexing (one of the above). .. versionadded:: 0.18.1 @@ -100,7 +100,7 @@ of multi-axis indexing. :ref:`Advanced Indexing ` and :ref:`Advanced Hierarchical `. -- ``.loc``, ``.iloc``, and also ``[]`` indexing can accept a ``callable`` as indexer. See more at :ref:`Selection By Callable `. +* ``.loc``, ``.iloc``, and also ``[]`` indexing can accept a ``callable`` as indexer. See more at :ref:`Selection By Callable `. Getting values from an object with multi-axes selection uses the following notation (using ``.loc`` as an example, but the following applies to ``.iloc`` as @@ -343,14 +343,14 @@ Integers are valid labels, but they refer to the label **and not the position**. The ``.loc`` attribute is the primary access method. The following are valid inputs: -- A single label, e.g. ``5`` or ``'a'`` (Note that ``5`` is interpreted as a *label* of the index. This use is **not** an integer position along the index.). -- A list or array of labels ``['a', 'b', 'c']``. -- A slice object with labels ``'a':'f'`` (Note that contrary to usual python +* A single label, e.g. ``5`` or ``'a'`` (Note that ``5`` is interpreted as a *label* of the index. This use is **not** an integer position along the index.). +* A list or array of labels ``['a', 'b', 'c']``. +* A slice object with labels ``'a':'f'`` (Note that contrary to usual python slices, **both** the start and the stop are included, when present in the index! See :ref:`Slicing with labels `.). -- A boolean array. -- A ``callable``, see :ref:`Selection By Callable `. +* A boolean array. +* A ``callable``, see :ref:`Selection By Callable `. .. ipython:: python @@ -445,11 +445,11 @@ Pandas provides a suite of methods in order to get **purely integer based indexi The ``.iloc`` attribute is the primary access method. The following are valid inputs: -- An integer e.g. ``5``. -- A list or array of integers ``[4, 3, 0]``. -- A slice object with ints ``1:7``. -- A boolean array. -- A ``callable``, see :ref:`Selection By Callable `. +* An integer e.g. ``5``. +* A list or array of integers ``[4, 3, 0]``. +* A slice object with ints ``1:7``. +* A boolean array. +* A ``callable``, see :ref:`Selection By Callable `. .. ipython:: python @@ -599,8 +599,8 @@ bit of user confusion over the years. The recommended methods of indexing are: -- ``.loc`` if you want to *label* index. -- ``.iloc`` if you want to *positionally* index. +* ``.loc`` if you want to *label* index. +* ``.iloc`` if you want to *positionally* index. .. ipython:: python @@ -1455,15 +1455,15 @@ If you want to identify and remove duplicate rows in a DataFrame, there are two methods that will help: ``duplicated`` and ``drop_duplicates``. Each takes as an argument the columns to use to identify duplicated rows. -- ``duplicated`` returns a boolean vector whose length is the number of rows, and which indicates whether a row is duplicated. -- ``drop_duplicates`` removes duplicate rows. +* ``duplicated`` returns a boolean vector whose length is the number of rows, and which indicates whether a row is duplicated. +* ``drop_duplicates`` removes duplicate rows. By default, the first observed row of a duplicate set is considered unique, but each method has a ``keep`` parameter to specify targets to be kept. -- ``keep='first'`` (default): mark / drop duplicates except for the first occurrence. -- ``keep='last'``: mark / drop duplicates except for the last occurrence. -- ``keep=False``: mark / drop all duplicates. +* ``keep='first'`` (default): mark / drop duplicates except for the first occurrence. +* ``keep='last'``: mark / drop duplicates except for the last occurrence. +* ``keep=False``: mark / drop all duplicates. .. ipython:: python diff --git a/doc/source/install.rst b/doc/source/install.rst index e655136904920..87d1b63914635 100644 --- a/doc/source/install.rst +++ b/doc/source/install.rst @@ -261,17 +261,17 @@ Optional Dependencies * `Apache Parquet `__, either `pyarrow `__ (>= 0.4.1) or `fastparquet `__ (>= 0.0.6) for parquet-based storage. The `snappy `__ and `brotli `__ are available for compression support. * `SQLAlchemy `__: for SQL database support. Version 0.8.1 or higher recommended. Besides SQLAlchemy, you also need a database specific driver. You can find an overview of supported drivers for each SQL dialect in the `SQLAlchemy docs `__. Some common drivers are: - * `psycopg2 `__: for PostgreSQL - * `pymysql `__: for MySQL. - * `SQLite `__: for SQLite, this is included in Python's standard library by default. + * `psycopg2 `__: for PostgreSQL + * `pymysql `__: for MySQL. + * `SQLite `__: for SQLite, this is included in Python's standard library by default. * `matplotlib `__: for plotting, Version 1.4.3 or higher. * For Excel I/O: - * `xlrd/xlwt `__: Excel reading (xlrd) and writing (xlwt) - * `openpyxl `__: openpyxl version 2.4.0 - for writing .xlsx files (xlrd >= 0.9.0) - * `XlsxWriter `__: Alternative Excel writer + * `xlrd/xlwt `__: Excel reading (xlrd) and writing (xlwt) + * `openpyxl `__: openpyxl version 2.4.0 + for writing .xlsx files (xlrd >= 0.9.0) + * `XlsxWriter `__: Alternative Excel writer * `Jinja2 `__: Template engine for conditional HTML formatting. * `s3fs `__: necessary for Amazon S3 access (s3fs >= 0.0.7). diff --git a/doc/source/internals.rst b/doc/source/internals.rst index caf5790fb24c6..fce99fc633440 100644 --- a/doc/source/internals.rst +++ b/doc/source/internals.rst @@ -24,23 +24,23 @@ Indexing In pandas there are a few objects implemented which can serve as valid containers for the axis labels: -- ``Index``: the generic "ordered set" object, an ndarray of object dtype +* ``Index``: the generic "ordered set" object, an ndarray of object dtype assuming nothing about its contents. The labels must be hashable (and likely immutable) and unique. Populates a dict of label to location in Cython to do ``O(1)`` lookups. -- ``Int64Index``: a version of ``Index`` highly optimized for 64-bit integer +* ``Int64Index``: a version of ``Index`` highly optimized for 64-bit integer data, such as time stamps -- ``Float64Index``: a version of ``Index`` highly optimized for 64-bit float data -- ``MultiIndex``: the standard hierarchical index object -- ``DatetimeIndex``: An Index object with ``Timestamp`` boxed elements (impl are the int64 values) -- ``TimedeltaIndex``: An Index object with ``Timedelta`` boxed elements (impl are the in64 values) -- ``PeriodIndex``: An Index object with Period elements +* ``Float64Index``: a version of ``Index`` highly optimized for 64-bit float data +* ``MultiIndex``: the standard hierarchical index object +* ``DatetimeIndex``: An Index object with ``Timestamp`` boxed elements (impl are the int64 values) +* ``TimedeltaIndex``: An Index object with ``Timedelta`` boxed elements (impl are the in64 values) +* ``PeriodIndex``: An Index object with Period elements There are functions that make the creation of a regular index easy: -- ``date_range``: fixed frequency date range generated from a time rule or +* ``date_range``: fixed frequency date range generated from a time rule or DateOffset. An ndarray of Python datetime objects -- ``period_range``: fixed frequency date range generated from a time rule or +* ``period_range``: fixed frequency date range generated from a time rule or DateOffset. An ndarray of ``Period`` objects, representing timespans The motivation for having an ``Index`` class in the first place was to enable @@ -52,22 +52,22 @@ From an internal implementation point of view, the relevant methods that an ``Index`` must define are one or more of the following (depending on how incompatible the new object internals are with the ``Index`` functions): -- ``get_loc``: returns an "indexer" (an integer, or in some cases a +* ``get_loc``: returns an "indexer" (an integer, or in some cases a slice object) for a label -- ``slice_locs``: returns the "range" to slice between two labels -- ``get_indexer``: Computes the indexing vector for reindexing / data +* ``slice_locs``: returns the "range" to slice between two labels +* ``get_indexer``: Computes the indexing vector for reindexing / data alignment purposes. See the source / docstrings for more on this -- ``get_indexer_non_unique``: Computes the indexing vector for reindexing / data +* ``get_indexer_non_unique``: Computes the indexing vector for reindexing / data alignment purposes when the index is non-unique. See the source / docstrings for more on this -- ``reindex``: Does any pre-conversion of the input index then calls +* ``reindex``: Does any pre-conversion of the input index then calls ``get_indexer`` -- ``union``, ``intersection``: computes the union or intersection of two +* ``union``, ``intersection``: computes the union or intersection of two Index objects -- ``insert``: Inserts a new label into an Index, yielding a new object -- ``delete``: Delete a label, yielding a new object -- ``drop``: Deletes a set of labels -- ``take``: Analogous to ndarray.take +* ``insert``: Inserts a new label into an Index, yielding a new object +* ``delete``: Delete a label, yielding a new object +* ``drop``: Deletes a set of labels +* ``take``: Analogous to ndarray.take MultiIndex ~~~~~~~~~~ diff --git a/doc/source/io.rst b/doc/source/io.rst index 658b9ff15783d..ae6c4f12f04f7 100644 --- a/doc/source/io.rst +++ b/doc/source/io.rst @@ -252,12 +252,12 @@ Datetime Handling +++++++++++++++++ parse_dates : boolean or list of ints or names or list of lists or dict, default ``False``. - - If ``True`` -> try parsing the index. - - If ``[1, 2, 3]`` -> try parsing columns 1, 2, 3 each as a separate date + * If ``True`` -> try parsing the index. + * If ``[1, 2, 3]`` -> try parsing columns 1, 2, 3 each as a separate date column. - - If ``[[1, 3]]`` -> combine columns 1 and 3 and parse as a single date + * If ``[[1, 3]]`` -> combine columns 1 and 3 and parse as a single date column. - - If ``{'foo': [1, 3]}`` -> parse columns 1, 3 as date and call result 'foo'. + * If ``{'foo': [1, 3]}`` -> parse columns 1, 3 as date and call result 'foo'. A fast-path exists for iso8601-formatted dates. infer_datetime_format : boolean, default ``False`` If ``True`` and parse_dates is enabled for a column, attempt to infer the @@ -961,12 +961,12 @@ negative consequences if enabled. Here are some examples of datetime strings that can be guessed (All representing December 30th, 2011 at 00:00:00): -- "20111230" -- "2011/12/30" -- "20111230 00:00:00" -- "12/30/2011 00:00:00" -- "30/Dec/2011 00:00:00" -- "30/December/2011 00:00:00" +* "20111230" +* "2011/12/30" +* "20111230 00:00:00" +* "12/30/2011 00:00:00" +* "30/Dec/2011 00:00:00" +* "30/December/2011 00:00:00" Note that ``infer_datetime_format`` is sensitive to ``dayfirst``. With ``dayfirst=True``, it will guess "01/12/2011" to be December 1st. With @@ -1303,16 +1303,16 @@ with data files that have known and fixed column widths. The function parameters to ``read_fwf`` are largely the same as `read_csv` with two extra parameters, and a different usage of the ``delimiter`` parameter: - - ``colspecs``: A list of pairs (tuples) giving the extents of the - fixed-width fields of each line as half-open intervals (i.e., [from, to[ ). - String value 'infer' can be used to instruct the parser to try detecting - the column specifications from the first 100 rows of the data. Default - behavior, if not specified, is to infer. - - ``widths``: A list of field widths which can be used instead of 'colspecs' - if the intervals are contiguous. - - ``delimiter``: Characters to consider as filler characters in the fixed-width file. - Can be used to specify the filler character of the fields - if it is not spaces (e.g., '~'). +* ``colspecs``: A list of pairs (tuples) giving the extents of the + fixed-width fields of each line as half-open intervals (i.e., [from, to[ ). + String value 'infer' can be used to instruct the parser to try detecting + the column specifications from the first 100 rows of the data. Default + behavior, if not specified, is to infer. +* ``widths``: A list of field widths which can be used instead of 'colspecs' + if the intervals are contiguous. +* ``delimiter``: Characters to consider as filler characters in the fixed-width file. + Can be used to specify the filler character of the fields + if it is not spaces (e.g., '~'). .. ipython:: python :suppress: @@ -1566,9 +1566,9 @@ possible pandas uses the C parser (specified as ``engine='c'``), but may fall back to Python if C-unsupported options are specified. Currently, C-unsupported options include: -- ``sep`` other than a single character (e.g. regex separators) -- ``skipfooter`` -- ``sep=None`` with ``delim_whitespace=False`` +* ``sep`` other than a single character (e.g. regex separators) +* ``skipfooter`` +* ``sep=None`` with ``delim_whitespace=False`` Specifying any of the above options will produce a ``ParserWarning`` unless the python engine is selected explicitly using ``engine='python'``. @@ -1602,29 +1602,29 @@ The ``Series`` and ``DataFrame`` objects have an instance method ``to_csv`` whic allows storing the contents of the object as a comma-separated-values file. The function takes a number of arguments. Only the first is required. - - ``path_or_buf``: A string path to the file to write or a StringIO - - ``sep`` : Field delimiter for the output file (default ",") - - ``na_rep``: A string representation of a missing value (default '') - - ``float_format``: Format string for floating point numbers - - ``cols``: Columns to write (default None) - - ``header``: Whether to write out the column names (default True) - - ``index``: whether to write row (index) names (default True) - - ``index_label``: Column label(s) for index column(s) if desired. If None - (default), and `header` and `index` are True, then the index names are - used. (A sequence should be given if the ``DataFrame`` uses MultiIndex). - - ``mode`` : Python write mode, default 'w' - - ``encoding``: a string representing the encoding to use if the contents are - non-ASCII, for Python versions prior to 3 - - ``line_terminator``: Character sequence denoting line end (default '\\n') - - ``quoting``: Set quoting rules as in csv module (default csv.QUOTE_MINIMAL). Note that if you have set a `float_format` then floats are converted to strings and csv.QUOTE_NONNUMERIC will treat them as non-numeric - - ``quotechar``: Character used to quote fields (default '"') - - ``doublequote``: Control quoting of ``quotechar`` in fields (default True) - - ``escapechar``: Character used to escape ``sep`` and ``quotechar`` when - appropriate (default None) - - ``chunksize``: Number of rows to write at a time - - ``tupleize_cols``: If False (default), write as a list of tuples, otherwise - write in an expanded line format suitable for ``read_csv`` - - ``date_format``: Format string for datetime objects +* ``path_or_buf``: A string path to the file to write or a StringIO +* ``sep`` : Field delimiter for the output file (default ",") +* ``na_rep``: A string representation of a missing value (default '') +* ``float_format``: Format string for floating point numbers +* ``cols``: Columns to write (default None) +* ``header``: Whether to write out the column names (default True) +* ``index``: whether to write row (index) names (default True) +* ``index_label``: Column label(s) for index column(s) if desired. If None + (default), and `header` and `index` are True, then the index names are + used. (A sequence should be given if the ``DataFrame`` uses MultiIndex). +* ``mode`` : Python write mode, default 'w' +* ``encoding``: a string representing the encoding to use if the contents are + non-ASCII, for Python versions prior to 3 +* ``line_terminator``: Character sequence denoting line end (default '\\n') +* ``quoting``: Set quoting rules as in csv module (default csv.QUOTE_MINIMAL). Note that if you have set a `float_format` then floats are converted to strings and csv.QUOTE_NONNUMERIC will treat them as non-numeric +* ``quotechar``: Character used to quote fields (default '"') +* ``doublequote``: Control quoting of ``quotechar`` in fields (default True) +* ``escapechar``: Character used to escape ``sep`` and ``quotechar`` when + appropriate (default None) +* ``chunksize``: Number of rows to write at a time +* ``tupleize_cols``: If False (default), write as a list of tuples, otherwise + write in an expanded line format suitable for ``read_csv`` +* ``date_format``: Format string for datetime objects Writing a formatted string ++++++++++++++++++++++++++ @@ -1634,22 +1634,22 @@ Writing a formatted string The ``DataFrame`` object has an instance method ``to_string`` which allows control over the string representation of the object. All arguments are optional: - - ``buf`` default None, for example a StringIO object - - ``columns`` default None, which columns to write - - ``col_space`` default None, minimum width of each column. - - ``na_rep`` default ``NaN``, representation of NA value - - ``formatters`` default None, a dictionary (by column) of functions each of - which takes a single argument and returns a formatted string - - ``float_format`` default None, a function which takes a single (float) - argument and returns a formatted string; to be applied to floats in the - ``DataFrame``. - - ``sparsify`` default True, set to False for a ``DataFrame`` with a hierarchical - index to print every MultiIndex key at each row. - - ``index_names`` default True, will print the names of the indices - - ``index`` default True, will print the index (ie, row labels) - - ``header`` default True, will print the column labels - - ``justify`` default ``left``, will print column headers left- or - right-justified +* ``buf`` default None, for example a StringIO object +* ``columns`` default None, which columns to write +* ``col_space`` default None, minimum width of each column. +* ``na_rep`` default ``NaN``, representation of NA value +* ``formatters`` default None, a dictionary (by column) of functions each of + which takes a single argument and returns a formatted string +* ``float_format`` default None, a function which takes a single (float) + argument and returns a formatted string; to be applied to floats in the + ``DataFrame``. +* ``sparsify`` default True, set to False for a ``DataFrame`` with a hierarchical + index to print every MultiIndex key at each row. +* ``index_names`` default True, will print the names of the indices +* ``index`` default True, will print the index (ie, row labels) +* ``header`` default True, will print the column labels +* ``justify`` default ``left``, will print column headers left- or + right-justified The ``Series`` object also has a ``to_string`` method, but with only the ``buf``, ``na_rep``, ``float_format`` arguments. There is also a ``length`` argument @@ -1670,17 +1670,17 @@ Writing JSON A ``Series`` or ``DataFrame`` can be converted to a valid JSON string. Use ``to_json`` with optional parameters: -- ``path_or_buf`` : the pathname or buffer to write the output +* ``path_or_buf`` : the pathname or buffer to write the output This can be ``None`` in which case a JSON string is returned -- ``orient`` : +* ``orient`` : ``Series``: - - default is ``index`` - - allowed values are {``split``, ``records``, ``index``} + * default is ``index`` + * allowed values are {``split``, ``records``, ``index``} ``DataFrame``: - - default is ``columns`` - - allowed values are {``split``, ``records``, ``index``, ``columns``, ``values``, ``table``} + * default is ``columns`` + * allowed values are {``split``, ``records``, ``index``, ``columns``, ``values``, ``table``} The format of the JSON string @@ -1694,12 +1694,12 @@ with optional parameters: ``columns``; dict like {column -> {index -> value}} ``values``; just the values array -- ``date_format`` : string, type of date conversion, 'epoch' for timestamp, 'iso' for ISO8601. -- ``double_precision`` : The number of decimal places to use when encoding floating point values, default 10. -- ``force_ascii`` : force encoded string to be ASCII, default True. -- ``date_unit`` : The time unit to encode to, governs timestamp and ISO8601 precision. One of 's', 'ms', 'us' or 'ns' for seconds, milliseconds, microseconds and nanoseconds respectively. Default 'ms'. -- ``default_handler`` : The handler to call if an object cannot otherwise be converted to a suitable format for JSON. Takes a single argument, which is the object to convert, and returns a serializable object. -- ``lines`` : If ``records`` orient, then will write each record per line as json. +* ``date_format`` : string, type of date conversion, 'epoch' for timestamp, 'iso' for ISO8601. +* ``double_precision`` : The number of decimal places to use when encoding floating point values, default 10. +* ``force_ascii`` : force encoded string to be ASCII, default True. +* ``date_unit`` : The time unit to encode to, governs timestamp and ISO8601 precision. One of 's', 'ms', 'us' or 'ns' for seconds, milliseconds, microseconds and nanoseconds respectively. Default 'ms'. +* ``default_handler`` : The handler to call if an object cannot otherwise be converted to a suitable format for JSON. Takes a single argument, which is the object to convert, and returns a serializable object. +* ``lines`` : If ``records`` orient, then will write each record per line as json. Note ``NaN``'s, ``NaT``'s and ``None`` will be converted to ``null`` and ``datetime`` objects will be converted based on the ``date_format`` and ``date_unit`` parameters. @@ -1818,19 +1818,19 @@ Fallback Behavior If the JSON serializer cannot handle the container contents directly it will fall back in the following manner: -- if the dtype is unsupported (e.g. ``np.complex``) then the ``default_handler``, if provided, will be called +* if the dtype is unsupported (e.g. ``np.complex``) then the ``default_handler``, if provided, will be called for each value, otherwise an exception is raised. -- if an object is unsupported it will attempt the following: +* if an object is unsupported it will attempt the following: - * check if the object has defined a ``toDict`` method and call it. - A ``toDict`` method should return a ``dict`` which will then be JSON serialized. + * check if the object has defined a ``toDict`` method and call it. + A ``toDict`` method should return a ``dict`` which will then be JSON serialized. - * invoke the ``default_handler`` if one was provided. + * invoke the ``default_handler`` if one was provided. - * convert the object to a ``dict`` by traversing its contents. However this will often fail - with an ``OverflowError`` or give unexpected results. + * convert the object to a ``dict`` by traversing its contents. However this will often fail + with an ``OverflowError`` or give unexpected results. In general the best approach for unsupported objects or dtypes is to provide a ``default_handler``. For example: @@ -1856,20 +1856,20 @@ Reading a JSON string to pandas object can take a number of parameters. The parser will try to parse a ``DataFrame`` if ``typ`` is not supplied or is ``None``. To explicitly force ``Series`` parsing, pass ``typ=series`` -- ``filepath_or_buffer`` : a **VALID** JSON string or file handle / StringIO. The string could be +* ``filepath_or_buffer`` : a **VALID** JSON string or file handle / StringIO. The string could be a URL. Valid URL schemes include http, ftp, S3, and file. For file URLs, a host is expected. For instance, a local file could be file ://localhost/path/to/table.json -- ``typ`` : type of object to recover (series or frame), default 'frame' -- ``orient`` : +* ``typ`` : type of object to recover (series or frame), default 'frame' +* ``orient`` : Series : - - default is ``index`` - - allowed values are {``split``, ``records``, ``index``} + * default is ``index`` + * allowed values are {``split``, ``records``, ``index``} DataFrame - - default is ``columns`` - - allowed values are {``split``, ``records``, ``index``, ``columns``, ``values``, ``table``} + * default is ``columns`` + * allowed values are {``split``, ``records``, ``index``, ``columns``, ``values``, ``table``} The format of the JSON string @@ -1885,20 +1885,20 @@ is ``None``. To explicitly force ``Series`` parsing, pass ``typ=series`` ``table``; adhering to the JSON `Table Schema`_ -- ``dtype`` : if True, infer dtypes, if a dict of column to dtype, then use those, if ``False``, then don't infer dtypes at all, default is True, apply only to the data. -- ``convert_axes`` : boolean, try to convert the axes to the proper dtypes, default is ``True`` -- ``convert_dates`` : a list of columns to parse for dates; If ``True``, then try to parse date-like columns, default is ``True``. -- ``keep_default_dates`` : boolean, default ``True``. If parsing dates, then parse the default date-like columns. -- ``numpy`` : direct decoding to NumPy arrays. default is ``False``; +* ``dtype`` : if True, infer dtypes, if a dict of column to dtype, then use those, if ``False``, then don't infer dtypes at all, default is True, apply only to the data. +* ``convert_axes`` : boolean, try to convert the axes to the proper dtypes, default is ``True`` +* ``convert_dates`` : a list of columns to parse for dates; If ``True``, then try to parse date-like columns, default is ``True``. +* ``keep_default_dates`` : boolean, default ``True``. If parsing dates, then parse the default date-like columns. +* ``numpy`` : direct decoding to NumPy arrays. default is ``False``; Supports numeric data only, although labels may be non-numeric. Also note that the JSON ordering **MUST** be the same for each term if ``numpy=True``. -- ``precise_float`` : boolean, default ``False``. Set to enable usage of higher precision (strtod) function when decoding string to double values. Default (``False``) is to use fast but less precise builtin functionality. -- ``date_unit`` : string, the timestamp unit to detect if converting dates. Default +* ``precise_float`` : boolean, default ``False``. Set to enable usage of higher precision (strtod) function when decoding string to double values. Default (``False``) is to use fast but less precise builtin functionality. +* ``date_unit`` : string, the timestamp unit to detect if converting dates. Default None. By default the timestamp precision will be detected, if this is not desired then pass one of 's', 'ms', 'us' or 'ns' to force timestamp precision to seconds, milliseconds, microseconds or nanoseconds respectively. -- ``lines`` : reads file as one json object per line. -- ``encoding`` : The encoding to use to decode py3 bytes. -- ``chunksize`` : when used in combination with ``lines=True``, return a JsonReader which reads in ``chunksize`` lines per iteration. +* ``lines`` : reads file as one json object per line. +* ``encoding`` : The encoding to use to decode py3 bytes. +* ``chunksize`` : when used in combination with ``lines=True``, return a JsonReader which reads in ``chunksize`` lines per iteration. The parser will raise one of ``ValueError/TypeError/AssertionError`` if the JSON is not parseable. @@ -2175,10 +2175,10 @@ object str A few notes on the generated table schema: -- The ``schema`` object contains a ``pandas_version`` field. This contains +* The ``schema`` object contains a ``pandas_version`` field. This contains the version of pandas' dialect of the schema, and will be incremented with each revision. -- All dates are converted to UTC when serializing. Even timezone naive values, +* All dates are converted to UTC when serializing. Even timezone naive values, which are treated as UTC with an offset of 0. .. ipython:: python @@ -2187,7 +2187,7 @@ A few notes on the generated table schema: s = pd.Series(pd.date_range('2016', periods=4)) build_table_schema(s) -- datetimes with a timezone (before serializing), include an additional field +* datetimes with a timezone (before serializing), include an additional field ``tz`` with the time zone name (e.g. ``'US/Central'``). .. ipython:: python @@ -2196,7 +2196,7 @@ A few notes on the generated table schema: tz='US/Central')) build_table_schema(s_tz) -- Periods are converted to timestamps before serialization, and so have the +* Periods are converted to timestamps before serialization, and so have the same behavior of being converted to UTC. In addition, periods will contain and additional field ``freq`` with the period's frequency, e.g. ``'A-DEC'``. @@ -2206,7 +2206,7 @@ A few notes on the generated table schema: periods=4)) build_table_schema(s_per) -- Categoricals use the ``any`` type and an ``enum`` constraint listing +* Categoricals use the ``any`` type and an ``enum`` constraint listing the set of possible values. Additionally, an ``ordered`` field is included: .. ipython:: python @@ -2214,7 +2214,7 @@ A few notes on the generated table schema: s_cat = pd.Series(pd.Categorical(['a', 'b', 'a'])) build_table_schema(s_cat) -- A ``primaryKey`` field, containing an array of labels, is included +* A ``primaryKey`` field, containing an array of labels, is included *if the index is unique*: .. ipython:: python @@ -2222,7 +2222,7 @@ A few notes on the generated table schema: s_dupe = pd.Series([1, 2], index=[1, 1]) build_table_schema(s_dupe) -- The ``primaryKey`` behavior is the same with MultiIndexes, but in this +* The ``primaryKey`` behavior is the same with MultiIndexes, but in this case the ``primaryKey`` is an array: .. ipython:: python @@ -2231,15 +2231,15 @@ A few notes on the generated table schema: (0, 1)])) build_table_schema(s_multi) -- The default naming roughly follows these rules: +* The default naming roughly follows these rules: - + For series, the ``object.name`` is used. If that's none, then the - name is ``values`` - + For ``DataFrames``, the stringified version of the column name is used - + For ``Index`` (not ``MultiIndex``), ``index.name`` is used, with a - fallback to ``index`` if that is None. - + For ``MultiIndex``, ``mi.names`` is used. If any level has no name, - then ``level_`` is used. + * For series, the ``object.name`` is used. If that's none, then the + name is ``values`` + * For ``DataFrames``, the stringified version of the column name is used + * For ``Index`` (not ``MultiIndex``), ``index.name`` is used, with a + fallback to ``index`` if that is None. + * For ``MultiIndex``, ``mi.names`` is used. If any level has no name, + then ``level_`` is used. .. versionadded:: 0.23.0 @@ -2601,55 +2601,55 @@ parse HTML tables in the top-level pandas io function ``read_html``. **Issues with** |lxml|_ - * Benefits +* Benefits - * |lxml|_ is very fast. + * |lxml|_ is very fast. - * |lxml|_ requires Cython to install correctly. + * |lxml|_ requires Cython to install correctly. - * Drawbacks +* Drawbacks - * |lxml|_ does *not* make any guarantees about the results of its parse - *unless* it is given |svm|_. + * |lxml|_ does *not* make any guarantees about the results of its parse + *unless* it is given |svm|_. - * In light of the above, we have chosen to allow you, the user, to use the - |lxml|_ backend, but **this backend will use** |html5lib|_ if |lxml|_ - fails to parse + * In light of the above, we have chosen to allow you, the user, to use the + |lxml|_ backend, but **this backend will use** |html5lib|_ if |lxml|_ + fails to parse - * It is therefore *highly recommended* that you install both - |BeautifulSoup4|_ and |html5lib|_, so that you will still get a valid - result (provided everything else is valid) even if |lxml|_ fails. + * It is therefore *highly recommended* that you install both + |BeautifulSoup4|_ and |html5lib|_, so that you will still get a valid + result (provided everything else is valid) even if |lxml|_ fails. **Issues with** |BeautifulSoup4|_ **using** |lxml|_ **as a backend** - * The above issues hold here as well since |BeautifulSoup4|_ is essentially - just a wrapper around a parser backend. +* The above issues hold here as well since |BeautifulSoup4|_ is essentially + just a wrapper around a parser backend. **Issues with** |BeautifulSoup4|_ **using** |html5lib|_ **as a backend** - * Benefits +* Benefits - * |html5lib|_ is far more lenient than |lxml|_ and consequently deals - with *real-life markup* in a much saner way rather than just, e.g., - dropping an element without notifying you. + * |html5lib|_ is far more lenient than |lxml|_ and consequently deals + with *real-life markup* in a much saner way rather than just, e.g., + dropping an element without notifying you. - * |html5lib|_ *generates valid HTML5 markup from invalid markup - automatically*. This is extremely important for parsing HTML tables, - since it guarantees a valid document. However, that does NOT mean that - it is "correct", since the process of fixing markup does not have a - single definition. + * |html5lib|_ *generates valid HTML5 markup from invalid markup + automatically*. This is extremely important for parsing HTML tables, + since it guarantees a valid document. However, that does NOT mean that + it is "correct", since the process of fixing markup does not have a + single definition. - * |html5lib|_ is pure Python and requires no additional build steps beyond - its own installation. + * |html5lib|_ is pure Python and requires no additional build steps beyond + its own installation. - * Drawbacks +* Drawbacks - * The biggest drawback to using |html5lib|_ is that it is slow as - molasses. However consider the fact that many tables on the web are not - big enough for the parsing algorithm runtime to matter. It is more - likely that the bottleneck will be in the process of reading the raw - text from the URL over the web, i.e., IO (input-output). For very large - tables, this might not be true. + * The biggest drawback to using |html5lib|_ is that it is slow as + molasses. However consider the fact that many tables on the web are not + big enough for the parsing algorithm runtime to matter. It is more + likely that the bottleneck will be in the process of reading the raw + text from the URL over the web, i.e., IO (input-output). For very large + tables, this might not be true. .. |svm| replace:: **strictly valid markup** @@ -2753,13 +2753,13 @@ Specifying Sheets .. note :: An ExcelFile's attribute ``sheet_names`` provides access to a list of sheets. -- The arguments ``sheet_name`` allows specifying the sheet or sheets to read. -- The default value for ``sheet_name`` is 0, indicating to read the first sheet -- Pass a string to refer to the name of a particular sheet in the workbook. -- Pass an integer to refer to the index of a sheet. Indices follow Python +* The arguments ``sheet_name`` allows specifying the sheet or sheets to read. +* The default value for ``sheet_name`` is 0, indicating to read the first sheet +* Pass a string to refer to the name of a particular sheet in the workbook. +* Pass an integer to refer to the index of a sheet. Indices follow Python convention, beginning at 0. -- Pass a list of either strings or integers, to return a dictionary of specified sheets. -- Pass a ``None`` to return a dictionary of all available sheets. +* Pass a list of either strings or integers, to return a dictionary of specified sheets. +* Pass a ``None`` to return a dictionary of all available sheets. .. code-block:: python @@ -3030,9 +3030,9 @@ files if `Xlsxwriter`_ is not available. To specify which writer you want to use, you can pass an engine keyword argument to ``to_excel`` and to ``ExcelWriter``. The built-in engines are: -- ``openpyxl``: version 2.4 or higher is required -- ``xlsxwriter`` -- ``xlwt`` +* ``openpyxl``: version 2.4 or higher is required +* ``xlsxwriter`` +* ``xlwt`` .. code-block:: python @@ -3055,8 +3055,8 @@ Style and Formatting The look and feel of Excel worksheets created from pandas can be modified using the following parameters on the ``DataFrame``'s ``to_excel`` method. -- ``float_format`` : Format string for floating point numbers (default ``None``). -- ``freeze_panes`` : A tuple of two integers representing the bottommost row and rightmost column to freeze. Each of these parameters is one-based, so (1, 1) will freeze the first row and first column (default ``None``). +* ``float_format`` : Format string for floating point numbers (default ``None``). +* ``freeze_panes`` : A tuple of two integers representing the bottommost row and rightmost column to freeze. Each of these parameters is one-based, so (1, 1) will freeze the first row and first column (default ``None``). @@ -3654,10 +3654,10 @@ data. A query is specified using the ``Term`` class under the hood, as a boolean expression. -- ``index`` and ``columns`` are supported indexers of a ``DataFrames``. -- ``major_axis``, ``minor_axis``, and ``items`` are supported indexers of +* ``index`` and ``columns`` are supported indexers of a ``DataFrames``. +* ``major_axis``, ``minor_axis``, and ``items`` are supported indexers of the Panel. -- if ``data_columns`` are specified, these can be used as additional indexers. +* if ``data_columns`` are specified, these can be used as additional indexers. Valid comparison operators are: @@ -3665,9 +3665,9 @@ Valid comparison operators are: Valid boolean expressions are combined with: -- ``|`` : or -- ``&`` : and -- ``(`` and ``)`` : for grouping +* ``|`` : or +* ``&`` : and +* ``(`` and ``)`` : for grouping These rules are similar to how boolean expressions are used in pandas for indexing. @@ -3680,16 +3680,16 @@ These rules are similar to how boolean expressions are used in pandas for indexi The following are valid expressions: -- ``'index >= date'`` -- ``"columns = ['A', 'D']"`` -- ``"columns in ['A', 'D']"`` -- ``'columns = A'`` -- ``'columns == A'`` -- ``"~(columns = ['A', 'B'])"`` -- ``'index > df.index[3] & string = "bar"'`` -- ``'(index > df.index[3] & index <= df.index[6]) | string = "bar"'`` -- ``"ts >= Timestamp('2012-02-01')"`` -- ``"major_axis>=20130101"`` +* ``'index >= date'`` +* ``"columns = ['A', 'D']"`` +* ``"columns in ['A', 'D']"`` +* ``'columns = A'`` +* ``'columns == A'`` +* ``"~(columns = ['A', 'B'])"`` +* ``'index > df.index[3] & string = "bar"'`` +* ``'(index > df.index[3] & index <= df.index[6]) | string = "bar"'`` +* ``"ts >= Timestamp('2012-02-01')"`` +* ``"major_axis>=20130101"`` The ``indexers`` are on the left-hand side of the sub-expression: @@ -3697,11 +3697,11 @@ The ``indexers`` are on the left-hand side of the sub-expression: The right-hand side of the sub-expression (after a comparison operator) can be: -- functions that will be evaluated, e.g. ``Timestamp('2012-02-01')`` -- strings, e.g. ``"bar"`` -- date-like, e.g. ``20130101``, or ``"20130101"`` -- lists, e.g. ``"['A', 'B']"`` -- variables that are defined in the local names space, e.g. ``date`` +* functions that will be evaluated, e.g. ``Timestamp('2012-02-01')`` +* strings, e.g. ``"bar"`` +* date-like, e.g. ``20130101``, or ``"20130101"`` +* lists, e.g. ``"['A', 'B']"`` +* variables that are defined in the local names space, e.g. ``date`` .. note:: @@ -4080,15 +4080,15 @@ simple use case. You store panel-type data, with dates in the ``major_axis`` and ids in the ``minor_axis``. The data is then interleaved like this: -- date_1 - - id_1 - - id_2 - - . - - id_n -- date_2 - - id_1 - - . - - id_n +* date_1 + * id_1 + * id_2 + * . + * id_n +* date_2 + * id_1 + * . + * id_n It should be clear that a delete operation on the ``major_axis`` will be fairly quick, as one chunk is removed, then the following data moved. On @@ -4216,12 +4216,12 @@ Caveats need to serialize these operations in a single thread in a single process. You will corrupt your data otherwise. See the (:issue:`2397`) for more information. -- If you use locks to manage write access between multiple processes, you +* If you use locks to manage write access between multiple processes, you may want to use :py:func:`~os.fsync` before releasing write locks. For convenience you can use ``store.flush(fsync=True)`` to do this for you. -- Once a ``table`` is created its items (Panel) / columns (DataFrame) +* Once a ``table`` is created its items (Panel) / columns (DataFrame) are fixed; only exactly the same columns can be appended -- Be aware that timezones (e.g., ``pytz.timezone('US/Eastern')``) +* Be aware that timezones (e.g., ``pytz.timezone('US/Eastern')``) are not necessarily equal across timezone versions. So if data is localized to a specific timezone in the HDFStore using one version of a timezone library and that data is updated with another version, the data @@ -4438,21 +4438,21 @@ Now you can import the ``DataFrame`` into R: Performance ''''''''''' -- ``tables`` format come with a writing performance penalty as compared to +* ``tables`` format come with a writing performance penalty as compared to ``fixed`` stores. The benefit is the ability to append/delete and query (potentially very large amounts of data). Write times are generally longer as compared with regular stores. Query times can be quite fast, especially on an indexed axis. -- You can pass ``chunksize=`` to ``append``, specifying the +* You can pass ``chunksize=`` to ``append``, specifying the write chunksize (default is 50000). This will significantly lower your memory usage on writing. -- You can pass ``expectedrows=`` to the first ``append``, +* You can pass ``expectedrows=`` to the first ``append``, to set the TOTAL number of expected rows that ``PyTables`` will expected. This will optimize read/write performance. -- Duplicate rows can be written to tables, but are filtered out in +* Duplicate rows can be written to tables, but are filtered out in selection (with the last items being selected; thus a table is unique on major, minor pairs) -- A ``PerformanceWarning`` will be raised if you are attempting to +* A ``PerformanceWarning`` will be raised if you are attempting to store types that will be pickled by PyTables (rather than stored as endemic types). See `Here `__ @@ -4482,14 +4482,14 @@ dtypes, including extension dtypes such as categorical and datetime with tz. Several caveats. -- This is a newer library, and the format, though stable, is not guaranteed to be backward compatible +* This is a newer library, and the format, though stable, is not guaranteed to be backward compatible to the earlier versions. -- The format will NOT write an ``Index``, or ``MultiIndex`` for the +* The format will NOT write an ``Index``, or ``MultiIndex`` for the ``DataFrame`` and will raise an error if a non-default one is provided. You can ``.reset_index()`` to store the index or ``.reset_index(drop=True)`` to ignore it. -- Duplicate column names and non-string columns names are not supported -- Non supported types include ``Period`` and actual Python object types. These will raise a helpful error message +* Duplicate column names and non-string columns names are not supported +* Non supported types include ``Period`` and actual Python object types. These will raise a helpful error message on an attempt at serialization. See the `Full Documentation `__. @@ -4550,10 +4550,10 @@ dtypes, including extension dtypes such as datetime with tz. Several caveats. -- Duplicate column names and non-string columns names are not supported. -- Index level names, if specified, must be strings. -- Categorical dtypes can be serialized to parquet, but will de-serialize as ``object`` dtype. -- Non supported types include ``Period`` and actual Python object types. These will raise a helpful error message +* Duplicate column names and non-string columns names are not supported. +* Index level names, if specified, must be strings. +* Categorical dtypes can be serialized to parquet, but will de-serialize as ``object`` dtype. +* Non supported types include ``Period`` and actual Python object types. These will raise a helpful error message on an attempt at serialization. You can specify an ``engine`` to direct the serialization. This can be one of ``pyarrow``, or ``fastparquet``, or ``auto``. diff --git a/doc/source/merging.rst b/doc/source/merging.rst index 45944ba56d4e7..b2cb388e3cd03 100644 --- a/doc/source/merging.rst +++ b/doc/source/merging.rst @@ -81,33 +81,33 @@ some configurable handling of "what to do with the other axes": keys=None, levels=None, names=None, verify_integrity=False, copy=True) -- ``objs`` : a sequence or mapping of Series, DataFrame, or Panel objects. If a +* ``objs`` : a sequence or mapping of Series, DataFrame, or Panel objects. If a dict is passed, the sorted keys will be used as the `keys` argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised. -- ``axis`` : {0, 1, ...}, default 0. The axis to concatenate along. -- ``join`` : {'inner', 'outer'}, default 'outer'. How to handle indexes on +* ``axis`` : {0, 1, ...}, default 0. The axis to concatenate along. +* ``join`` : {'inner', 'outer'}, default 'outer'. How to handle indexes on other axis(es). Outer for union and inner for intersection. -- ``ignore_index`` : boolean, default False. If True, do not use the index +* ``ignore_index`` : boolean, default False. If True, do not use the index values on the concatenation axis. The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join. -- ``join_axes`` : list of Index objects. Specific indexes to use for the other +* ``join_axes`` : list of Index objects. Specific indexes to use for the other n - 1 axes instead of performing inner/outer set logic. -- ``keys`` : sequence, default None. Construct hierarchical index using the +* ``keys`` : sequence, default None. Construct hierarchical index using the passed keys as the outermost level. If multiple levels passed, should contain tuples. -- ``levels`` : list of sequences, default None. Specific levels (unique values) +* ``levels`` : list of sequences, default None. Specific levels (unique values) to use for constructing a MultiIndex. Otherwise they will be inferred from the keys. -- ``names`` : list, default None. Names for the levels in the resulting +* ``names`` : list, default None. Names for the levels in the resulting hierarchical index. -- ``verify_integrity`` : boolean, default False. Check whether the new +* ``verify_integrity`` : boolean, default False. Check whether the new concatenated axis contains duplicates. This can be very expensive relative to the actual data concatenation. -- ``copy`` : boolean, default True. If False, do not copy data unnecessarily. +* ``copy`` : boolean, default True. If False, do not copy data unnecessarily. Without a little bit of context many of these arguments don't make much sense. Let's revisit the above example. Suppose we wanted to associate specific keys @@ -156,10 +156,10 @@ When gluing together multiple DataFrames, you have a choice of how to handle the other axes (other than the one being concatenated). This can be done in the following three ways: -- Take the union of them all, ``join='outer'``. This is the default +* Take the union of them all, ``join='outer'``. This is the default option as it results in zero information loss. -- Take the intersection, ``join='inner'``. -- Use a specific index, as passed to the ``join_axes`` argument. +* Take the intersection, ``join='inner'``. +* Use a specific index, as passed to the ``join_axes`` argument. Here is an example of each of these methods. First, the default ``join='outer'`` behavior: @@ -531,52 +531,52 @@ all standard database join operations between ``DataFrame`` objects: suffixes=('_x', '_y'), copy=True, indicator=False, validate=None) -- ``left``: A DataFrame object. -- ``right``: Another DataFrame object. -- ``on``: Column or index level names to join on. Must be found in both the left +* ``left``: A DataFrame object. +* ``right``: Another DataFrame object. +* ``on``: Column or index level names to join on. Must be found in both the left and right DataFrame objects. If not passed and ``left_index`` and ``right_index`` are ``False``, the intersection of the columns in the DataFrames will be inferred to be the join keys. -- ``left_on``: Columns or index levels from the left DataFrame to use as +* ``left_on``: Columns or index levels from the left DataFrame to use as keys. Can either be column names, index level names, or arrays with length equal to the length of the DataFrame. -- ``right_on``: Columns or index levels from the right DataFrame to use as +* ``right_on``: Columns or index levels from the right DataFrame to use as keys. Can either be column names, index level names, or arrays with length equal to the length of the DataFrame. -- ``left_index``: If ``True``, use the index (row labels) from the left +* ``left_index``: If ``True``, use the index (row labels) from the left DataFrame as its join key(s). In the case of a DataFrame with a MultiIndex (hierarchical), the number of levels must match the number of join keys from the right DataFrame. -- ``right_index``: Same usage as ``left_index`` for the right DataFrame -- ``how``: One of ``'left'``, ``'right'``, ``'outer'``, ``'inner'``. Defaults +* ``right_index``: Same usage as ``left_index`` for the right DataFrame +* ``how``: One of ``'left'``, ``'right'``, ``'outer'``, ``'inner'``. Defaults to ``inner``. See below for more detailed description of each method. -- ``sort``: Sort the result DataFrame by the join keys in lexicographical +* ``sort``: Sort the result DataFrame by the join keys in lexicographical order. Defaults to ``True``, setting to ``False`` will improve performance substantially in many cases. -- ``suffixes``: A tuple of string suffixes to apply to overlapping +* ``suffixes``: A tuple of string suffixes to apply to overlapping columns. Defaults to ``('_x', '_y')``. -- ``copy``: Always copy data (default ``True``) from the passed DataFrame +* ``copy``: Always copy data (default ``True``) from the passed DataFrame objects, even when reindexing is not necessary. Cannot be avoided in many cases but may improve performance / memory usage. The cases where copying can be avoided are somewhat pathological but this option is provided nonetheless. -- ``indicator``: Add a column to the output DataFrame called ``_merge`` +* ``indicator``: Add a column to the output DataFrame called ``_merge`` with information on the source of each row. ``_merge`` is Categorical-type and takes on a value of ``left_only`` for observations whose merge key only appears in ``'left'`` DataFrame, ``right_only`` for observations whose merge key only appears in ``'right'`` DataFrame, and ``both`` if the observation's merge key is found in both. -- ``validate`` : string, default None. +* ``validate`` : string, default None. If specified, checks if merge is of specified type. - * "one_to_one" or "1:1": checks if merge keys are unique in both - left and right datasets. - * "one_to_many" or "1:m": checks if merge keys are unique in left - dataset. - * "many_to_one" or "m:1": checks if merge keys are unique in right - dataset. - * "many_to_many" or "m:m": allowed, but does not result in checks. + * "one_to_one" or "1:1": checks if merge keys are unique in both + left and right datasets. + * "one_to_many" or "1:m": checks if merge keys are unique in left + dataset. + * "many_to_one" or "m:1": checks if merge keys are unique in right + dataset. + * "many_to_many" or "m:m": allowed, but does not result in checks. .. versionadded:: 0.21.0 @@ -605,11 +605,11 @@ terminology used to describe join operations between two SQL-table like structures (``DataFrame`` objects). There are several cases to consider which are very important to understand: -- **one-to-one** joins: for example when joining two ``DataFrame`` objects on +* **one-to-one** joins: for example when joining two ``DataFrame`` objects on their indexes (which must contain unique values). -- **many-to-one** joins: for example when joining an index (unique) to one or +* **many-to-one** joins: for example when joining an index (unique) to one or more columns in a different ``DataFrame``. -- **many-to-many** joins: joining columns on columns. +* **many-to-many** joins: joining columns on columns. .. note:: diff --git a/doc/source/options.rst b/doc/source/options.rst index 697cc0682e39a..cbe0264f442bc 100644 --- a/doc/source/options.rst +++ b/doc/source/options.rst @@ -31,10 +31,10 @@ You can get/set options directly as attributes of the top-level ``options`` attr The API is composed of 5 relevant functions, available directly from the ``pandas`` namespace: -- :func:`~pandas.get_option` / :func:`~pandas.set_option` - get/set the value of a single option. -- :func:`~pandas.reset_option` - reset one or more options to their default value. -- :func:`~pandas.describe_option` - print the descriptions of one or more options. -- :func:`~pandas.option_context` - execute a codeblock with a set of options +* :func:`~pandas.get_option` / :func:`~pandas.set_option` - get/set the value of a single option. +* :func:`~pandas.reset_option` - reset one or more options to their default value. +* :func:`~pandas.describe_option` - print the descriptions of one or more options. +* :func:`~pandas.option_context` - execute a codeblock with a set of options that revert to prior settings after execution. **Note:** Developers can check out `pandas/core/config.py `_ for more information. diff --git a/doc/source/overview.rst b/doc/source/overview.rst index f86b1c67e6843..6ba9501ba0b5e 100644 --- a/doc/source/overview.rst +++ b/doc/source/overview.rst @@ -12,19 +12,19 @@ programming language. :mod:`pandas` consists of the following elements: - * A set of labeled array data structures, the primary of which are - Series and DataFrame. - * Index objects enabling both simple axis indexing and multi-level / - hierarchical axis indexing. - * An integrated group by engine for aggregating and transforming data sets. - * Date range generation (date_range) and custom date offsets enabling the - implementation of customized frequencies. - * Input/Output tools: loading tabular data from flat files (CSV, delimited, - Excel 2003), and saving and loading pandas objects from the fast and - efficient PyTables/HDF5 format. - * Memory-efficient "sparse" versions of the standard data structures for storing - data that is mostly missing or mostly constant (some fixed value). - * Moving window statistics (rolling mean, rolling standard deviation, etc.). +* A set of labeled array data structures, the primary of which are + Series and DataFrame. +* Index objects enabling both simple axis indexing and multi-level / + hierarchical axis indexing. +* An integrated group by engine for aggregating and transforming data sets. +* Date range generation (date_range) and custom date offsets enabling the + implementation of customized frequencies. +* Input/Output tools: loading tabular data from flat files (CSV, delimited, + Excel 2003), and saving and loading pandas objects from the fast and + efficient PyTables/HDF5 format. +* Memory-efficient "sparse" versions of the standard data structures for storing + data that is mostly missing or mostly constant (some fixed value). +* Moving window statistics (rolling mean, rolling standard deviation, etc.). Data Structures --------------- diff --git a/doc/source/reshaping.rst b/doc/source/reshaping.rst index 250a1808e496e..88b7114cf4101 100644 --- a/doc/source/reshaping.rst +++ b/doc/source/reshaping.rst @@ -106,12 +106,12 @@ Closely related to the :meth:`~DataFrame.pivot` method are the related ``MultiIndex`` objects (see the section on :ref:`hierarchical indexing `). Here are essentially what these methods do: - - ``stack``: "pivot" a level of the (possibly hierarchical) column labels, - returning a ``DataFrame`` with an index with a new inner-most level of row - labels. - - ``unstack``: (inverse operation of ``stack``) "pivot" a level of the - (possibly hierarchical) row index to the column axis, producing a reshaped - ``DataFrame`` with a new inner-most level of column labels. +* ``stack``: "pivot" a level of the (possibly hierarchical) column labels, + returning a ``DataFrame`` with an index with a new inner-most level of row + labels. +* ``unstack``: (inverse operation of ``stack``) "pivot" a level of the + (possibly hierarchical) row index to the column axis, producing a reshaped + ``DataFrame`` with a new inner-most level of column labels. .. image:: _static/reshaping_unstack.png @@ -132,8 +132,8 @@ from the hierarchical indexing section: The ``stack`` function "compresses" a level in the ``DataFrame``'s columns to produce either: - - A ``Series``, in the case of a simple column Index. - - A ``DataFrame``, in the case of a ``MultiIndex`` in the columns. +* A ``Series``, in the case of a simple column Index. +* A ``DataFrame``, in the case of a ``MultiIndex`` in the columns. If the columns have a ``MultiIndex``, you can choose which level to stack. The stacked level becomes the new lowest level in a ``MultiIndex`` on the columns: @@ -351,13 +351,13 @@ strategies. It takes a number of arguments: -- ``data``: a DataFrame object. -- ``values``: a column or a list of columns to aggregate. -- ``index``: a column, Grouper, array which has the same length as data, or list of them. +* ``data``: a DataFrame object. +* ``values``: a column or a list of columns to aggregate. +* ``index``: a column, Grouper, array which has the same length as data, or list of them. Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values. -- ``columns``: a column, Grouper, array which has the same length as data, or list of them. +* ``columns``: a column, Grouper, array which has the same length as data, or list of them. Keys to group by on the pivot table column. If an array is passed, it is being used as the same manner as column values. -- ``aggfunc``: function to use for aggregation, defaulting to ``numpy.mean``. +* ``aggfunc``: function to use for aggregation, defaulting to ``numpy.mean``. Consider a data set like this: @@ -431,17 +431,17 @@ unless an array of values and an aggregation function are passed. It takes a number of arguments -- ``index``: array-like, values to group by in the rows. -- ``columns``: array-like, values to group by in the columns. -- ``values``: array-like, optional, array of values to aggregate according to +* ``index``: array-like, values to group by in the rows. +* ``columns``: array-like, values to group by in the columns. +* ``values``: array-like, optional, array of values to aggregate according to the factors. -- ``aggfunc``: function, optional, If no values array is passed, computes a +* ``aggfunc``: function, optional, If no values array is passed, computes a frequency table. -- ``rownames``: sequence, default ``None``, must match number of row arrays passed. -- ``colnames``: sequence, default ``None``, if passed, must match number of column +* ``rownames``: sequence, default ``None``, must match number of row arrays passed. +* ``colnames``: sequence, default ``None``, if passed, must match number of column arrays passed. -- ``margins``: boolean, default ``False``, Add row/column margins (subtotals) -- ``normalize``: boolean, {'all', 'index', 'columns'}, or {0,1}, default ``False``. +* ``margins``: boolean, default ``False``, Add row/column margins (subtotals) +* ``normalize``: boolean, {'all', 'index', 'columns'}, or {0,1}, default ``False``. Normalize by dividing all values by the sum of values. @@ -615,10 +615,10 @@ As with the ``Series`` version, you can pass values for the ``prefix`` and ``prefix_sep``. By default the column name is used as the prefix, and '_' as the prefix separator. You can specify ``prefix`` and ``prefix_sep`` in 3 ways: -- string: Use the same value for ``prefix`` or ``prefix_sep`` for each column +* string: Use the same value for ``prefix`` or ``prefix_sep`` for each column to be encoded. -- list: Must be the same length as the number of columns being encoded. -- dict: Mapping column name to prefix. +* list: Must be the same length as the number of columns being encoded. +* dict: Mapping column name to prefix. .. ipython:: python diff --git a/doc/source/sparse.rst b/doc/source/sparse.rst index 260d8aa32ef52..2bb99dd1822b6 100644 --- a/doc/source/sparse.rst +++ b/doc/source/sparse.rst @@ -104,9 +104,9 @@ Sparse data should have the same dtype as its dense representation. Currently, ``float64``, ``int64`` and ``bool`` dtypes are supported. Depending on the original dtype, ``fill_value`` default changes: -- ``float64``: ``np.nan`` -- ``int64``: ``0`` -- ``bool``: ``False`` +* ``float64``: ``np.nan`` +* ``int64``: ``0`` +* ``bool``: ``False`` .. ipython:: python diff --git a/doc/source/timeseries.rst b/doc/source/timeseries.rst index ded54d2d355f1..ba58d65b00714 100644 --- a/doc/source/timeseries.rst +++ b/doc/source/timeseries.rst @@ -28,11 +28,11 @@ a tremendous amount of new functionality for manipulating time series data. In working with time series data, we will frequently seek to: - - generate sequences of fixed-frequency dates and time spans - - conform or convert time series to a particular frequency - - compute "relative" dates based on various non-standard time increments - (e.g. 5 business days before the last business day of the year), or "roll" - dates forward or backward +* generate sequences of fixed-frequency dates and time spans +* conform or convert time series to a particular frequency +* compute "relative" dates based on various non-standard time increments + (e.g. 5 business days before the last business day of the year), or "roll" + dates forward or backward pandas provides a relatively compact and self-contained set of tools for performing the above tasks. @@ -226,8 +226,8 @@ You can pass only the columns that you need to assemble. ``pd.to_datetime`` looks for standard designations of the datetime component in the column names, including: -- required: ``year``, ``month``, ``day`` -- optional: ``hour``, ``minute``, ``second``, ``millisecond``, ``microsecond``, ``nanosecond`` +* required: ``year``, ``month``, ``day`` +* optional: ``hour``, ``minute``, ``second``, ``millisecond``, ``microsecond``, ``nanosecond`` Invalid Data ~~~~~~~~~~~~ @@ -463,14 +463,14 @@ Indexing One of the main uses for ``DatetimeIndex`` is as an index for pandas objects. The ``DatetimeIndex`` class contains many time series related optimizations: - - A large range of dates for various offsets are pre-computed and cached - under the hood in order to make generating subsequent date ranges very fast - (just have to grab a slice). - - Fast shifting using the ``shift`` and ``tshift`` method on pandas objects. - - Unioning of overlapping ``DatetimeIndex`` objects with the same frequency is - very fast (important for fast data alignment). - - Quick access to date fields via properties such as ``year``, ``month``, etc. - - Regularization functions like ``snap`` and very fast ``asof`` logic. +* A large range of dates for various offsets are pre-computed and cached + under the hood in order to make generating subsequent date ranges very fast + (just have to grab a slice). +* Fast shifting using the ``shift`` and ``tshift`` method on pandas objects. +* Unioning of overlapping ``DatetimeIndex`` objects with the same frequency is + very fast (important for fast data alignment). +* Quick access to date fields via properties such as ``year``, ``month``, etc. +* Regularization functions like ``snap`` and very fast ``asof`` logic. ``DatetimeIndex`` objects have all the basic functionality of regular ``Index`` objects, and a smorgasbord of advanced time series specific methods for easy @@ -797,11 +797,11 @@ We could have done the same thing with ``DateOffset``: The key features of a ``DateOffset`` object are: -- It can be added / subtracted to/from a datetime object to obtain a +* It can be added / subtracted to/from a datetime object to obtain a shifted date. -- It can be multiplied by an integer (positive or negative) so that the +* It can be multiplied by an integer (positive or negative) so that the increment will be applied multiple times. -- It has :meth:`~pandas.DateOffset.rollforward` and +* It has :meth:`~pandas.DateOffset.rollforward` and :meth:`~pandas.DateOffset.rollback` methods for moving a date forward or backward to the next or previous "offset date". @@ -2064,9 +2064,9 @@ To supply the time zone, you can use the ``tz`` keyword to ``date_range`` and other functions. Dateutil time zone strings are distinguished from ``pytz`` time zones by starting with ``dateutil/``. -- In ``pytz`` you can find a list of common (and less common) time zones using +* In ``pytz`` you can find a list of common (and less common) time zones using ``from pytz import common_timezones, all_timezones``. -- ``dateutil`` uses the OS timezones so there isn't a fixed list available. For +* ``dateutil`` uses the OS timezones so there isn't a fixed list available. For common zones, the names are the same as ``pytz``. .. ipython:: python diff --git a/doc/source/tutorials.rst b/doc/source/tutorials.rst index 895fe595de205..381031fa128e6 100644 --- a/doc/source/tutorials.rst +++ b/doc/source/tutorials.rst @@ -28,33 +28,33 @@ repository `_. To run the examples in th clone the GitHub repository and get IPython Notebook running. See `How to use this cookbook `_. -- `A quick tour of the IPython Notebook: `_ +* `A quick tour of the IPython Notebook: `_ Shows off IPython's awesome tab completion and magic functions. -- `Chapter 1: `_ +* `Chapter 1: `_ Reading your data into pandas is pretty much the easiest thing. Even when the encoding is wrong! -- `Chapter 2: `_ +* `Chapter 2: `_ It's not totally obvious how to select data from a pandas dataframe. Here we explain the basics (how to take slices and get columns) -- `Chapter 3: `_ +* `Chapter 3: `_ Here we get into serious slicing and dicing and learn how to filter dataframes in complicated ways, really fast. -- `Chapter 4: `_ +* `Chapter 4: `_ Groupby/aggregate is seriously my favorite thing about pandas and I use it all the time. You should probably read this. -- `Chapter 5: `_ +* `Chapter 5: `_ Here you get to find out if it's cold in Montreal in the winter (spoiler: yes). Web scraping with pandas is fun! Here we combine dataframes. -- `Chapter 6: `_ +* `Chapter 6: `_ Strings with pandas are great. It has all these vectorized string operations and they're the best. We will turn a bunch of strings containing "Snow" into vectors of numbers in a trice. -- `Chapter 7: `_ +* `Chapter 7: `_ Cleaning up messy data is never a joy, but with pandas it's easier. -- `Chapter 8: `_ +* `Chapter 8: `_ Parsing Unix timestamps is confusing at first but it turns out to be really easy. -- `Chapter 9: `_ +* `Chapter 9: `_ Reading data from SQL databases. @@ -63,54 +63,54 @@ Lessons for new pandas users For more resources, please visit the main `repository `__. -- `01 - Lesson: `_ - - Importing libraries - - Creating data sets - - Creating data frames - - Reading from CSV - - Exporting to CSV - - Finding maximums - - Plotting data +* `01 - Lesson: `_ + * Importing libraries + * Creating data sets + * Creating data frames + * Reading from CSV + * Exporting to CSV + * Finding maximums + * Plotting data -- `02 - Lesson: `_ - - Reading from TXT - - Exporting to TXT - - Selecting top/bottom records - - Descriptive statistics - - Grouping/sorting data +* `02 - Lesson: `_ + * Reading from TXT + * Exporting to TXT + * Selecting top/bottom records + * Descriptive statistics + * Grouping/sorting data -- `03 - Lesson: `_ - - Creating functions - - Reading from EXCEL - - Exporting to EXCEL - - Outliers - - Lambda functions - - Slice and dice data +* `03 - Lesson: `_ + * Creating functions + * Reading from EXCEL + * Exporting to EXCEL + * Outliers + * Lambda functions + * Slice and dice data -- `04 - Lesson: `_ - - Adding/deleting columns - - Index operations +* `04 - Lesson: `_ + * Adding/deleting columns + * Index operations -- `05 - Lesson: `_ - - Stack/Unstack/Transpose functions +* `05 - Lesson: `_ + * Stack/Unstack/Transpose functions -- `06 - Lesson: `_ - - GroupBy function +* `06 - Lesson: `_ + * GroupBy function -- `07 - Lesson: `_ - - Ways to calculate outliers +* `07 - Lesson: `_ + * Ways to calculate outliers -- `08 - Lesson: `_ - - Read from Microsoft SQL databases +* `08 - Lesson: `_ + * Read from Microsoft SQL databases -- `09 - Lesson: `_ - - Export to CSV/EXCEL/TXT +* `09 - Lesson: `_ + * Export to CSV/EXCEL/TXT -- `10 - Lesson: `_ - - Converting between different kinds of formats +* `10 - Lesson: `_ + * Converting between different kinds of formats -- `11 - Lesson: `_ - - Combining data from various sources +* `11 - Lesson: `_ + * Combining data from various sources Practical data analysis with Python @@ -119,13 +119,13 @@ Practical data analysis with Python This `guide `_ is a comprehensive introduction to the data analysis process using the Python data ecosystem and an interesting open dataset. There are four sections covering selected topics as follows: -- `Munging Data `_ +* `Munging Data `_ -- `Aggregating Data `_ +* `Aggregating Data `_ -- `Visualizing Data `_ +* `Visualizing Data `_ -- `Time Series `_ +* `Time Series `_ .. _tutorial-exercises-new-users: @@ -134,25 +134,25 @@ Exercises for new users Practice your skills with real data sets and exercises. For more resources, please visit the main `repository `__. -- `01 - Getting & Knowing Your Data `_ +* `01 - Getting & Knowing Your Data `_ -- `02 - Filtering & Sorting `_ +* `02 - Filtering & Sorting `_ -- `03 - Grouping `_ +* `03 - Grouping `_ -- `04 - Apply `_ +* `04 - Apply `_ -- `05 - Merge `_ +* `05 - Merge `_ -- `06 - Stats `_ +* `06 - Stats `_ -- `07 - Visualization `_ +* `07 - Visualization `_ -- `08 - Creating Series and DataFrames `_ +* `08 - Creating Series and DataFrames `_ -- `09 - Time Series `_ +* `09 - Time Series `_ -- `10 - Deleting `_ +* `10 - Deleting `_ .. _tutorial-modern: @@ -164,29 +164,29 @@ Tutorial series written in 2016 by The source may be found in the GitHub repository `TomAugspurger/effective-pandas `_. -- `Modern Pandas `_ -- `Method Chaining `_ -- `Indexes `_ -- `Performance `_ -- `Tidy Data `_ -- `Visualization `_ -- `Timeseries `_ +* `Modern Pandas `_ +* `Method Chaining `_ +* `Indexes `_ +* `Performance `_ +* `Tidy Data `_ +* `Visualization `_ +* `Timeseries `_ Excel charts with pandas, vincent and xlsxwriter ------------------------------------------------ -- `Using Pandas and XlsxWriter to create Excel charts `_ +* `Using Pandas and XlsxWriter to create Excel charts `_ Video Tutorials --------------- -- `Pandas From The Ground Up `_ +* `Pandas From The Ground Up `_ (2015) (2:24) `GitHub repo `__ -- `Introduction Into Pandas `_ +* `Introduction Into Pandas `_ (2016) (1:28) `GitHub repo `__ -- `Pandas: .head() to .tail() `_ +* `Pandas: .head() to .tail() `_ (2016) (1:26) `GitHub repo `__ @@ -194,12 +194,12 @@ Video Tutorials Various Tutorials ----------------- -- `Wes McKinney's (pandas BDFL) blog `_ -- `Statistical analysis made easy in Python with SciPy and pandas DataFrames, by Randal Olson `_ -- `Statistical Data Analysis in Python, tutorial videos, by Christopher Fonnesbeck from SciPy 2013 `_ -- `Financial analysis in Python, by Thomas Wiecki `_ -- `Intro to pandas data structures, by Greg Reda `_ -- `Pandas and Python: Top 10, by Manish Amde `_ -- `Pandas Tutorial, by Mikhail Semeniuk `_ -- `Pandas DataFrames Tutorial, by Karlijn Willems `_ -- `A concise tutorial with real life examples `_ +* `Wes McKinney's (pandas BDFL) blog `_ +* `Statistical analysis made easy in Python with SciPy and pandas DataFrames, by Randal Olson `_ +* `Statistical Data Analysis in Python, tutorial videos, by Christopher Fonnesbeck from SciPy 2013 `_ +* `Financial analysis in Python, by Thomas Wiecki `_ +* `Intro to pandas data structures, by Greg Reda `_ +* `Pandas and Python: Top 10, by Manish Amde `_ +* `Pandas Tutorial, by Mikhail Semeniuk `_ +* `Pandas DataFrames Tutorial, by Karlijn Willems `_ +* `A concise tutorial with real life examples `_ diff --git a/doc/source/visualization.rst b/doc/source/visualization.rst index 17197b805e86a..569a6fb7b7a0d 100644 --- a/doc/source/visualization.rst +++ b/doc/source/visualization.rst @@ -1381,9 +1381,9 @@ Plotting with error bars is supported in :meth:`DataFrame.plot` and :meth:`Serie Horizontal and vertical error bars can be supplied to the ``xerr`` and ``yerr`` keyword arguments to :meth:`~DataFrame.plot()`. The error values can be specified using a variety of formats: -- As a :class:`DataFrame` or ``dict`` of errors with column names matching the ``columns`` attribute of the plotting :class:`DataFrame` or matching the ``name`` attribute of the :class:`Series`. -- As a ``str`` indicating which of the columns of plotting :class:`DataFrame` contain the error values. -- As raw values (``list``, ``tuple``, or ``np.ndarray``). Must be the same length as the plotting :class:`DataFrame`/:class:`Series`. +* As a :class:`DataFrame` or ``dict`` of errors with column names matching the ``columns`` attribute of the plotting :class:`DataFrame` or matching the ``name`` attribute of the :class:`Series`. +* As a ``str`` indicating which of the columns of plotting :class:`DataFrame` contain the error values. +* As raw values (``list``, ``tuple``, or ``np.ndarray``). Must be the same length as the plotting :class:`DataFrame`/:class:`Series`. Asymmetrical error bars are also supported, however raw error values must be provided in this case. For a ``M`` length :class:`Series`, a ``Mx2`` array should be provided indicating lower and upper (or left and right) errors. For a ``MxN`` :class:`DataFrame`, asymmetrical errors should be in a ``Mx2xN`` array.