diff --git a/notebooks/1_table_oriented.ipynb b/notebooks/1_table_oriented.ipynb new file mode 100644 index 0000000..432f6c6 --- /dev/null +++ b/notebooks/1_table_oriented.ipynb @@ -0,0 +1,476 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Pandas is table oriented" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> I want to start using Pandas" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To load the pandas package and start working with it, import the package. The community agreed alias for pandas is `pd`, so loading pandas as `pd` is assumed standard practice for all of the pandas documentation." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Pandas data table representation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![](../schemas/01_table_dataframe.svg)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> I want to store passenger data of the Titanic. For a number of passengers, I know the name (characters), age (integers) and sex (male/female) data." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
NameAgeSex
0Braund, Mr. Owen Harris22male
1Allen, Mr. William Henry35male
2Bonnell, Miss. Elizabeth58female
\n", + "
" + ], + "text/plain": [ + " Name Age Sex\n", + "0 Braund, Mr. Owen Harris 22 male\n", + "1 Allen, Mr. William Henry 35 male\n", + "2 Bonnell, Miss. Elizabeth 58 female" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = pd.DataFrame({\n", + " \"Name\": [\"Braund, Mr. Owen Harris\", \n", + " \"Allen, Mr. William Henry\", \n", + " \"Bonnell, Miss. Elizabeth\"], \n", + " \"Age\": [22, 35, 58],\n", + " \"Sex\": [\"male\", \"male\", \"female\"]}\n", + " )\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A `DataFrame` is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the `data.frame` in R. \n", + "\n", + "- The table has 3 columns, each of them with a column label. The column labels are respectively `Name`, `Age` and `Sex`.\n", + "- The column `Name` consists of textual data with each value a string, the column `Age` are numbers and the column `Sex` is textual data.\n", + "\n", + "In spreadsheet software, the table representation of our data would look very similar:\n", + "\n", + "![](../schemas/01_table_spreadsheet.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Each column in a `DataFrame` is a `Series`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![](../schemas/01_table_series.svg)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> I'm just interested in working with the data in the column `Age`" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 22\n", + "1 35\n", + "2 58\n", + "Name: Age, dtype: int64" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[\"Age\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When selecting a single column of a pandas `DataFrame`, the result is a pandas `Series`. To select the column, use the column label in between square brackets `[]`." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " \n", + "If you are familiar to Python :ref:`dictionaries `, the selection of a single column is very similar to selection of dictionary values based on the key.\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can create a `Series` from scratch as well:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 22\n", + "1 35\n", + "2 58\n", + "Name: Age, dtype: int64" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ages = pd.Series([22, 35, 58], name = \"Age\")\n", + "ages" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A pandas `Series` has no column labels, as it is just a single column of a `DataFrame`. A Series does have row labels." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Do something with a DataFrame or Series" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> I want to know the maximum Age of the passengers" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can do this on the `DataFrame` by selecting the `Age` column and applying `max()`:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "58" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[\"Age\"].max()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Or to the `Series`:" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "58" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ages.max()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As illustrated by the `max()` method, you can _do_ things with a `DataFrame` or `Series`. Pandas provides a lot of functionalities each of them a _method_ you can apply to a `DataFrame` or `Series`. As methods are functions, do not forget to use parentheses `()`." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> I'm interested in some basic statistics of the numerical data of my data table" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Age
count3.000000
mean38.333333
std18.230012
min22.000000
25%28.500000
50%35.000000
75%46.500000
max58.000000
\n", + "
" + ], + "text/plain": [ + " Age\n", + "count 3.000000\n", + "mean 38.333333\n", + "std 18.230012\n", + "min 22.000000\n", + "25% 28.500000\n", + "50% 35.000000\n", + "75% 46.500000\n", + "max 58.000000" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.describe()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `describe` method provides quick overview of the numerical data in a `DataFrame`. As the `Name` and `Sex` columns are textual data, these are by default not taken into account by the `describe` method. Many pandas operations return a `DataFrame` or a `Series`. The `describe` method is an example of a pandas operation returning a pandas `Series`.\n", + "\n", + "\n", + "__To user guide:__ check more options on `describe` :ref:`basics.describe`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " \n", + "__Note:__ This is just a starting point. Similar to spreadsheet software, pandas represents data as a table with columns and rows. Apart from the representation, also the data manipulations and calculations you would do in spreadsheet software are supported by pandas. Continue reading the next tutorials to get you started!\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## REMEMBER\n", + "\n", + "- Import the package, aka `import Pandas as pd`\n", + "- A table of data is stored as a pandas `DataFrame`\n", + "- Each column in a `DataFrame` is a `Series`\n", + "- You can do things by applying a method to a `DataFrame` or `Series`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "__To user guide:__ A more extended introduction to `DataFrame` and `Series` is provided in :ref:`dsintro`." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}