{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# 6. Pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "What is `pandas`?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "- Lot of answers, depends on your use case. \n",
    "- Primarily, pandas is a package built around the DataFrame\n",
    "- DataFrames are a more flexible, SQL-like data container"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Numpy vs Pandas\n",
    "\n",
    "When would we want to use `pandas` instead of a trusty `numpy` array?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "- When we need to index columns by name\n",
    "- When we need to index *rows* by name \n",
    "- When we want to store heterogeneous data (e.g., one column of strings, next column of floats)\n",
    "- When we have multiple tables we want to join together\n",
    "- When we want to perform complex filtering operations on a single table. \n",
    "- When the I/O for some data on disk is frustrating using `numpy`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Note that the first bullet point (indexing columns of an array by name) can be accomplished via some other data containers, including `Astropy` `Table`s and `numpy` `record arrays`. \n",
    "\n",
    "*However*, many of the other bullet points are not easily accomplished with these methods. *Furthermore*, `pandas` is the industry standard in data science and other fields. Learning it is a transferrable skill."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Introduction to Pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the purposes of this introduction, we need to learn about 2 `pandas` objects: \n",
    "- `pd.Series`, and \n",
    "- `pd.DataFrame`. \n",
    "\n",
    "DataFrames are constructed out of Series objects, but as it turns out, we don't need to define them explicitly on most occasions. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Let's set up a data frame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>measurement</th>\n",
       "      <th>quality</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Good</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Bad</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>Ugly</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   measurement quality\n",
       "0            1    Good\n",
       "1            2     Bad\n",
       "2            3    Ugly"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np \n",
    "import pandas as pd \n",
    "\n",
    "array1 = np.array([1,2,3])\n",
    "array2 = ['Good','Bad','Ugly']\n",
    "\n",
    "df = pd.DataFrame({'measurement':array1,'quality':array2})\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, one valid initialization is feeding in a dictionary with the desired column names as keys and the desired columns as values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Here's another way:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>measurement</th>\n",
       "      <th>quality</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Good</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Bad</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>Ugly</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   measurement quality\n",
       "0            1    Good\n",
       "1            2     Bad\n",
       "2            3    Ugly"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "array1 = np.array([1,2,3])\n",
    "array2 = ['Good','Bad','Ugly']\n",
    "\n",
    "df = pd.DataFrame()\n",
    "df['measurement'] = array1\n",
    "df['quality'] = array2\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "i.e., we can create empty `DataFrames`, and dynamically set new columns the way we are used to setting the values of dictionaries for a given key. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "It is worth noting that this \"dictionary-style\" indexing of DataFrame columns is not the only way. If our column names have no spaces, we can also use dot notation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1\n",
       "1    2\n",
       "2    3\n",
       "Name: measurement, dtype: int64"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['measurement']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1\n",
       "1    2\n",
       "2    3\n",
       "Name: measurement, dtype: int64"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.measurement"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, both forms return the same array. Notice, though, that this \"column\" isn't just an array... here we see our first pandas `Series`. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "You can think of a `Series` as a sort of wrapper/container around data arrays that provides extra functionality. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    Good\n",
       "1     Bad\n",
       "2    Ugly\n",
       "Name: quality, dtype: object"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.quality"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For example, we can see that this Series object has a `Name` (which it tells us when we print it), and it has an `index` --- that's the 0,1,2 over on the left. \n",
    "As a note, if we ever want to extract *just* the data values of a series, we run"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['Good', 'Bad', 'Ugly'], dtype=object)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.quality.values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "I've found that the dot-notation indexing of DataFrames can be very handy -- `.thing` is fewer characters to type than `['thing']`, and doesn't require your fingers to find the `[]` keys. \n",
    "\n",
    "Let's come back to our DataFrame, `df`. What's going on with this index on the left? "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>measurement</th>\n",
       "      <th>quality</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Good</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Bad</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>Ugly</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   measurement quality\n",
       "0            1    Good\n",
       "1            2     Bad\n",
       "2            3    Ugly"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "measurement       1\n",
       "quality        Good\n",
       "Name: 0, dtype: object"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.loc[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The **index** part of a DataFrame is a way to index specific *rows*, rather than specific *columns*. As you can see above, I used the `.loc[]` syntax, which tells pandas I'm inserting a row index, rather than a \"column index\". Pandas then spits out each column and associated value for that row."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "We can set the index of a DataFrame to be whatever we want. For example, let's say our measurements were taken on 3 dates:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.index = ['2021-10-07','2021-10-08','2021-10-09']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>measurement</th>\n",
       "      <th>quality</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2021-10-07</th>\n",
       "      <td>1</td>\n",
       "      <td>Good</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2021-10-08</th>\n",
       "      <td>2</td>\n",
       "      <td>Bad</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2021-10-09</th>\n",
       "      <td>3</td>\n",
       "      <td>Ugly</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            measurement quality\n",
       "2021-10-07            1    Good\n",
       "2021-10-08            2     Bad\n",
       "2021-10-09            3    Ugly"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now the `[0,1,2]` is replaced by our date strings. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Now, if I want to pull the measurement and measurement quality from `2021-10-08`, I would use"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "measurement      2\n",
       "quality        Bad\n",
       "Name: 2021-10-08, dtype: object"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.loc['2021-10-08']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "What if I know the numerical index of the row I want, but don't know the name of that row? Pandas doesn't *love* it when we are in this state, but there is a solution:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "measurement      2\n",
       "quality        Bad\n",
       "Name: 2021-10-08, dtype: object"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.iloc[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `df.iloc[]` syntax allows you to index a row by true index number, rather than whatever the name of that row is. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Filtering DataFrames"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "So now that we know how to create a simple DataFrame, how do we go about filtering it? Instead of creating external masks or using `np.where()`, `pandas` has a built in set of methods for accessing parts of a dataframe."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Sticking with our measurement df, what if I want a view of my DataFrame, but only for the rows where the `quality` is either `'Bad'` or `'Ugly'`. We can do that as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>measurement</th>\n",
       "      <th>quality</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2021-10-08</th>\n",
       "      <td>2</td>\n",
       "      <td>Bad</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2021-10-09</th>\n",
       "      <td>3</td>\n",
       "      <td>Ugly</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            measurement quality\n",
       "2021-10-08            2     Bad\n",
       "2021-10-09            3    Ugly"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.loc[(df.quality=='Bad')|(df.quality=='Ugly')]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that by putting our conditions inside `df.loc[]` the way we normally would in `np.where()`, we return a version of the DataFrame where only the bad and ugly measurement remains."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "What if I want to perform the same filtering, but I don't want the whole DataFrame out, I just want a specific column? We can specify this as well:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2021-10-08    2\n",
       "2021-10-09    3\n",
       "Name: measurement, dtype: int64"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.loc[(df.quality=='Bad')|(df.quality=='Ugly'),'measurement']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We add a comma after our conditionals, and then specify the name of the column we want."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "This is extensible to multiple columns; we only have two here but it would look like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>measurement</th>\n",
       "      <th>quality</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2021-10-08</th>\n",
       "      <td>2</td>\n",
       "      <td>Bad</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2021-10-09</th>\n",
       "      <td>3</td>\n",
       "      <td>Ugly</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            measurement quality\n",
       "2021-10-08            2     Bad\n",
       "2021-10-09            3    Ugly"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.loc[(df.quality=='Bad')|(df.quality=='Ugly'),['measurement','quality']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "i.e., we provide a list of column names we want to index, only where the conditions are met. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Let's say I want to set any 'Bad' or 'Ugly' measurements to `np.nan`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>measurement</th>\n",
       "      <th>quality</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2021-10-07</th>\n",
       "      <td>1.0</td>\n",
       "      <td>Good</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2021-10-08</th>\n",
       "      <td>NaN</td>\n",
       "      <td>Bad</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2021-10-09</th>\n",
       "      <td>NaN</td>\n",
       "      <td>Ugly</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            measurement quality\n",
       "2021-10-07          1.0    Good\n",
       "2021-10-08          NaN     Bad\n",
       "2021-10-09          NaN    Ugly"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.loc[(df.quality=='Bad')|(df.quality=='Ugly'),'measurement'] = np.nan\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "**Warning**. A common \"gotcha\" with DataFrames is to not use the `col_indexer` as above when setting values, but rather something like the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/ipasha/anaconda3/envs/a330/lib/python3.8/site-packages/pandas/core/generic.py:5516: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  self[name] = value\n"
     ]
    }
   ],
   "source": [
    "df.loc[(df.quality=='Bad')|(df.quality=='Ugly')].measurement = np.nan"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We get this error because running `.measurement` or indexing with `['measurement']` creates a copy of that slice of the dataframe for view -- meaning that if we try to set values in it, we are **not** affecting values in the underlying dataframe."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Joining DataFrames"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "A powerful strength of DataFrames is the ability to join/merge multiple dataframes on a given column. This behavior is inhereted from the way Databases (such as SQL) work. \n",
    "Let's create another DataFrame that has some anciliary data also taken on our three dates:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>weather</th>\n",
       "      <th>seeing</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2021-10-07</th>\n",
       "      <td>sunny</td>\n",
       "      <td>0.55</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2021-10-08</th>\n",
       "      <td>cloudy</td>\n",
       "      <td>2.30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2021-10-09</th>\n",
       "      <td>cloudy</td>\n",
       "      <td>3.20</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           weather  seeing\n",
       "2021-10-07   sunny    0.55\n",
       "2021-10-08  cloudy    2.30\n",
       "2021-10-09  cloudy    3.20"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2 = pd.DataFrame()\n",
    "df2['weather'] = ['sunny','cloudy','cloudy']\n",
    "df2['seeing'] = np.array([0.55,2.3,3.2])\n",
    "df2.index = ['2021-10-07','2021-10-08','2021-10-09']\n",
    "df2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "We can easily combine these two DataFrames:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>measurement</th>\n",
       "      <th>quality</th>\n",
       "      <th>weather</th>\n",
       "      <th>seeing</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2021-10-07</th>\n",
       "      <td>1.0</td>\n",
       "      <td>Good</td>\n",
       "      <td>sunny</td>\n",
       "      <td>0.55</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2021-10-08</th>\n",
       "      <td>NaN</td>\n",
       "      <td>Bad</td>\n",
       "      <td>cloudy</td>\n",
       "      <td>2.30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2021-10-09</th>\n",
       "      <td>NaN</td>\n",
       "      <td>Ugly</td>\n",
       "      <td>cloudy</td>\n",
       "      <td>3.20</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            measurement quality weather  seeing\n",
       "2021-10-07          1.0    Good   sunny    0.55\n",
       "2021-10-08          NaN     Bad  cloudy    2.30\n",
       "2021-10-09          NaN    Ugly  cloudy    3.20"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "new_df = pd.merge(df,df2,left_index=True,right_index=True)\n",
    "new_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When merging dataframes, we have to select which columns to merge on. Here, I've chosen a full join on the index. Let's talk about the options:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "From database parlance, we can join/merge two tables as long as there is a *unique identifer* column that is shared across both tables. In our case, the *index* is that \"column\". \n",
    "\n",
    "- A left (outer) join means that we take *all* rows of `df`, and if df2 has no entry for some rows, we'll fill it with NaNs. \n",
    "- A right (outer) join means we will keep *all* rows of `df2`, and if df has no entry for some rows, we'll fill it with NaNs.\n",
    "- A full (outer) join means we keep *all* rows of **both** dataframes, filling in NaNs where either is missing data. \n",
    "\n",
    "Alternatively, we could ask for just the rows that have matches in both dataframes, dropping all other columns. This would be an inner join."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Let's see this in action. I'm going to add a date, measurement, and quality to df:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>measurement</th>\n",
       "      <th>quality</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2021-10-07</th>\n",
       "      <td>1.0</td>\n",
       "      <td>Good</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2021-10-08</th>\n",
       "      <td>NaN</td>\n",
       "      <td>Bad</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2021-10-09</th>\n",
       "      <td>NaN</td>\n",
       "      <td>Ugly</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2021-10-10</th>\n",
       "      <td>5.0</td>\n",
       "      <td>Good</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            measurement quality\n",
       "2021-10-07          1.0    Good\n",
       "2021-10-08          NaN     Bad\n",
       "2021-10-09          NaN    Ugly\n",
       "2021-10-10          5.0    Good"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.loc['2021-10-10'] = [5,'Good']\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Now when we go to do our merge, `df2` won't have a row for that last index. But if I use `df.join()`, I can specify an outer join:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>measurement</th>\n",
       "      <th>quality</th>\n",
       "      <th>weather</th>\n",
       "      <th>seeing</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2021-10-07</th>\n",
       "      <td>1.0</td>\n",
       "      <td>Good</td>\n",
       "      <td>sunny</td>\n",
       "      <td>0.55</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2021-10-08</th>\n",
       "      <td>NaN</td>\n",
       "      <td>Bad</td>\n",
       "      <td>cloudy</td>\n",
       "      <td>2.30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2021-10-09</th>\n",
       "      <td>NaN</td>\n",
       "      <td>Ugly</td>\n",
       "      <td>cloudy</td>\n",
       "      <td>3.20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2021-10-10</th>\n",
       "      <td>5.0</td>\n",
       "      <td>Good</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            measurement quality weather  seeing\n",
       "2021-10-07          1.0    Good   sunny    0.55\n",
       "2021-10-08          NaN     Bad  cloudy    2.30\n",
       "2021-10-09          NaN    Ugly  cloudy    3.20\n",
       "2021-10-10          5.0    Good     NaN     NaN"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "new_df = df.join(df2,how='outer')\n",
    "new_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that I get NaN for the weather and seeing, as these weren't present in `df2`. But I do keep that row overall. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>measurement</th>\n",
       "      <th>quality</th>\n",
       "      <th>weather</th>\n",
       "      <th>seeing</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2021-10-07</th>\n",
       "      <td>1.0</td>\n",
       "      <td>Good</td>\n",
       "      <td>sunny</td>\n",
       "      <td>0.55</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2021-10-08</th>\n",
       "      <td>NaN</td>\n",
       "      <td>Bad</td>\n",
       "      <td>cloudy</td>\n",
       "      <td>2.30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2021-10-09</th>\n",
       "      <td>NaN</td>\n",
       "      <td>Ugly</td>\n",
       "      <td>cloudy</td>\n",
       "      <td>3.20</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            measurement quality weather  seeing\n",
       "2021-10-07          1.0    Good   sunny    0.55\n",
       "2021-10-08          NaN     Bad  cloudy    2.30\n",
       "2021-10-09          NaN    Ugly  cloudy    3.20"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "new_df = df.join(df2,how='inner')\n",
    "new_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alternatively, if I use an \"inner\" join, I only keep indices that were in *both* dataframes. So I lose that entire row. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>measurement</th>\n",
       "      <th>quality</th>\n",
       "      <th>weather</th>\n",
       "      <th>seeing</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2021-10-07</th>\n",
       "      <td>1.0</td>\n",
       "      <td>Good</td>\n",
       "      <td>sunny</td>\n",
       "      <td>0.55</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2021-10-08</th>\n",
       "      <td>NaN</td>\n",
       "      <td>Bad</td>\n",
       "      <td>cloudy</td>\n",
       "      <td>2.30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2021-10-09</th>\n",
       "      <td>NaN</td>\n",
       "      <td>Ugly</td>\n",
       "      <td>cloudy</td>\n",
       "      <td>3.20</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            measurement quality weather  seeing\n",
       "2021-10-07          1.0    Good   sunny    0.55\n",
       "2021-10-08          NaN     Bad  cloudy    2.30\n",
       "2021-10-09          NaN    Ugly  cloudy    3.20"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "new_df = df.join(df2,how='right')\n",
    "new_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This time, I've specified I want a \"right outer\" join. Because the \"righthand\" dataframe (`df2`) doesn't have a row for '2021-10-10', it's not part of the final df. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>measurement</th>\n",
       "      <th>quality</th>\n",
       "      <th>weather</th>\n",
       "      <th>seeing</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2021-10-07</th>\n",
       "      <td>1.0</td>\n",
       "      <td>Good</td>\n",
       "      <td>sunny</td>\n",
       "      <td>0.55</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2021-10-08</th>\n",
       "      <td>NaN</td>\n",
       "      <td>Bad</td>\n",
       "      <td>cloudy</td>\n",
       "      <td>2.30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2021-10-09</th>\n",
       "      <td>NaN</td>\n",
       "      <td>Ugly</td>\n",
       "      <td>cloudy</td>\n",
       "      <td>3.20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2021-10-10</th>\n",
       "      <td>5.0</td>\n",
       "      <td>Good</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            measurement quality weather  seeing\n",
       "2021-10-07          1.0    Good   sunny    0.55\n",
       "2021-10-08          NaN     Bad  cloudy    2.30\n",
       "2021-10-09          NaN    Ugly  cloudy    3.20\n",
       "2021-10-10          5.0    Good     NaN     NaN"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "new_df = df.join(df2,how='left')\n",
    "new_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wheras if I use 'left', I keep all rows of `df` including the new one, filling in NaNs for df2 where it has no values. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "You may have noticed that in the above examples, 'left' and 'outer' produced the same results. If so, good eye! They're the same output in this case because `df2` has no rows *not* present in `df`. But if it did, then the 'outer' join would keep both that row and '2021-10-10' from df, while 'left' would ignore the extra row in df2."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As a note, the join's we've been doing have *implicitly been using the index* as the join column, which is the default behavior. But if there were an actual column in the data that had the necessary properties, we could use the `pd.merge(df,df2,on='name')` syntax to join on that column instead."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Operations on Columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "We can carry out operations on columns of dataframes like we would with numpy arrays. For example, let's say I have a dataframe as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>galaxy_name</th>\n",
       "      <th>flux</th>\n",
       "      <th>distance</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>NGC 4300</td>\n",
       "      <td>1.100000e-12</td>\n",
       "      <td>10.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NGC 3055</td>\n",
       "      <td>1.500000e-13</td>\n",
       "      <td>33.7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NGC 3235</td>\n",
       "      <td>4.600000e-13</td>\n",
       "      <td>105.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NGC 6532</td>\n",
       "      <td>1.800000e-12</td>\n",
       "      <td>22.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  galaxy_name          flux  distance\n",
       "0    NGC 4300  1.100000e-12      10.5\n",
       "1    NGC 3055  1.500000e-13      33.7\n",
       "2    NGC 3235  4.600000e-13     105.0\n",
       "3    NGC 6532  1.800000e-12      22.0"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df3 = pd.DataFrame()\n",
    "df3['galaxy_name'] = ['NGC 4300','NGC 3055','NGC 3235','NGC 6532']\n",
    "df3['flux'] = np.array([1.1e-12,1.5e-13,4.6e-13,1.8e-12])\n",
    "df3['distance'] = np.array([10.5,33.7,105,22])\n",
    "df3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "I can calculate the luminosity of these galaxies and make a new column as follows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>galaxy_name</th>\n",
       "      <th>flux</th>\n",
       "      <th>distance</th>\n",
       "      <th>luminosity</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>NGC 4300</td>\n",
       "      <td>1.100000e-12</td>\n",
       "      <td>10.5</td>\n",
       "      <td>6.578710</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NGC 3055</td>\n",
       "      <td>1.500000e-13</td>\n",
       "      <td>33.7</td>\n",
       "      <td>6.726290</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NGC 3235</td>\n",
       "      <td>4.600000e-13</td>\n",
       "      <td>105.0</td>\n",
       "      <td>8.200075</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NGC 6532</td>\n",
       "      <td>1.800000e-12</td>\n",
       "      <td>22.0</td>\n",
       "      <td>7.435057</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  galaxy_name          flux  distance  luminosity\n",
       "0    NGC 4300  1.100000e-12      10.5    6.578710\n",
       "1    NGC 3055  1.500000e-13      33.7    6.726290\n",
       "2    NGC 3235  4.600000e-13     105.0    8.200075\n",
       "3    NGC 6532  1.800000e-12      22.0    7.435057"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import astropy.units as u\n",
    "lums = (df3.flux.values*u.erg/u.s/u.cm**2* 4 * np.pi * (df3.distance.values*u.Mpc)**2).to(u.Lsun).value\n",
    "df3['luminosity'] = np.log10(lums)\n",
    "df3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case, I use the `.values` to be sure things play nicely with `astropy`. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Dropping rows or columns \n",
    "\n",
    "So far, we've seen how to add things into dataframes and modify their values. But what if we want to drop a row or column entirely?\n",
    "\n",
    "For this, we need `df.drop()`. This command returns a version of the dataframe with the row or column dropped. As a note, the default behavior is to drop *rows*, so if you're specifying a column, you also need to add `axis=1` as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>weather</th>\n",
       "      <th>seeing</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2021-10-07</th>\n",
       "      <td>sunny</td>\n",
       "      <td>0.55</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2021-10-08</th>\n",
       "      <td>cloudy</td>\n",
       "      <td>2.30</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           weather  seeing\n",
       "2021-10-07   sunny    0.55\n",
       "2021-10-08  cloudy    2.30"
      ]
     },
     "execution_count": 76,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2.drop('2021-10-09')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "We specified an index there, let's try a column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>weather</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2021-10-07</th>\n",
       "      <td>sunny</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2021-10-08</th>\n",
       "      <td>cloudy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2021-10-09</th>\n",
       "      <td>cloudy</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           weather\n",
       "2021-10-07   sunny\n",
       "2021-10-08  cloudy\n",
       "2021-10-09  cloudy"
      ]
     },
     "execution_count": 77,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2.drop('seeing',axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Note that `df2.drop()` doesn't overwrite df2. To do this, we'd need do do either of the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df2 = df2.drop('seeing',axis=1)\n",
    "\n",
    "#OR \n",
    "\n",
    "df2.drop('seeing',inplace=True,axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I suggest the first, as it is easier to read and remember."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Other Common Operations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Here I'm going to provide a grab bag of common pandas methods that are useful to know:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "We can return the unique elements of a column (and the number of them) via \n",
    "`df['col2'].unique()`\n",
    "and \n",
    "`df['col2].nunique()`\n",
    "\n",
    "We can count how many times each value appears in a given column via \n",
    "`df['col2'].value_counts()` \n",
    "\n",
    "We can sort a dataframe by the values in some column via\n",
    "`df.sort_values('col2')`\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Data I/O"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "A final use for `pandas` is that it was designed to work with *messy* data. When loading data from, e.g., ascii files, there are often a lot of subtleties surrounding the spacings and columns, mixes of strings and floats, etc. \n",
    "\n",
    "Traditional `numpy` functions like `loadtxt()` or `genfromtxt()` struggle with these kinds of files, but often, `pandas` has no trouble with it. At the very least, then, we can use pandas to get the data into python, where we know how to mess with it until it meets our requirements."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "df = pd.read_csv('table2.dat')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>ion</th>\n",
       "      <th>wl</th>\n",
       "      <th>flux</th>\n",
       "      <th>rms</th>\n",
       "      <th>I</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>PN3m</td>\n",
       "      <td>[OII]</td>\n",
       "      <td>3727</td>\n",
       "      <td>373.9</td>\n",
       "      <td>58.6</td>\n",
       "      <td>517.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>PN3m</td>\n",
       "      <td>HeI</td>\n",
       "      <td>3805</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>PN3m</td>\n",
       "      <td>HI</td>\n",
       "      <td>3835</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>PN3m</td>\n",
       "      <td>[NeIII]</td>\n",
       "      <td>3869</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>PN3m</td>\n",
       "      <td>HeI</td>\n",
       "      <td>3889</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       name         ion    wl   flux   rms      I\n",
       "0  PN3m           [OII]  3727  373.9  58.6  517.3\n",
       "1  PN3m             HeI  3805    0.0   0.0    0.0\n",
       "2  PN3m              HI  3835    0.0   0.0    0.0\n",
       "3  PN3m         [NeIII]  3869    0.0   0.0    0.0\n",
       "4  PN3m             HeI  3889    0.0   0.0    0.0"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Above, I've run a one line, no modification call of pandas' `read_csv()` function. With no issue, it loaded up this file, which contains column names (first row), and a variety of data types. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Let's try the same thing with numpy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "could not convert string to float: 'name,ion,wl,flux,rms,I'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-72-3c744ba5b70f>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mx\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloadtxt\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'table2.dat'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m~/anaconda3/envs/a330/lib/python3.8/site-packages/numpy/lib/npyio.py\u001b[0m in \u001b[0;36mloadtxt\u001b[0;34m(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack, ndmin, encoding, max_rows, like)\u001b[0m\n\u001b[1;32m   1146\u001b[0m         \u001b[0;31m# converting the data\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1147\u001b[0m         \u001b[0mX\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1148\u001b[0;31m         \u001b[0;32mfor\u001b[0m \u001b[0mx\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mread_data\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0m_loadtxt_chunksize\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1149\u001b[0m             \u001b[0;32mif\u001b[0m \u001b[0mX\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1150\u001b[0m                 \u001b[0mX\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/anaconda3/envs/a330/lib/python3.8/site-packages/numpy/lib/npyio.py\u001b[0m in \u001b[0;36mread_data\u001b[0;34m(chunk_size)\u001b[0m\n\u001b[1;32m    997\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    998\u001b[0m             \u001b[0;31m# Convert each value according to its column and store\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 999\u001b[0;31m             \u001b[0mitems\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mconv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mval\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mconv\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mval\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mzip\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mconverters\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvals\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1000\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1001\u001b[0m             \u001b[0;31m# Then pack it according to the dtype's nesting\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/anaconda3/envs/a330/lib/python3.8/site-packages/numpy/lib/npyio.py\u001b[0m in \u001b[0;36m<listcomp>\u001b[0;34m(.0)\u001b[0m\n\u001b[1;32m    997\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    998\u001b[0m             \u001b[0;31m# Convert each value according to its column and store\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 999\u001b[0;31m             \u001b[0mitems\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mconv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mval\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mconv\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mval\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mzip\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mconverters\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvals\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1000\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1001\u001b[0m             \u001b[0;31m# Then pack it according to the dtype's nesting\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/anaconda3/envs/a330/lib/python3.8/site-packages/numpy/lib/npyio.py\u001b[0m in \u001b[0;36mfloatconv\u001b[0;34m(x)\u001b[0m\n\u001b[1;32m    734\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0;34m'0x'\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    735\u001b[0m             \u001b[0;32mreturn\u001b[0m \u001b[0mfloat\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfromhex\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 736\u001b[0;31m         \u001b[0;32mreturn\u001b[0m \u001b[0mfloat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    737\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    738\u001b[0m     \u001b[0mtyp\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtype\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mValueError\u001b[0m: could not convert string to float: 'name,ion,wl,flux,rms,I'"
     ]
    }
   ],
   "source": [
    "x = np.loadtxt('table2.dat')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Yep. Numpy doesn't like it. In part, having columns with strings and also columns with floats leaves us dead in the water. We want pandas for this.\n",
    "\n",
    "There are cases where pandas is annoying about reading things in and numpy is actually easier. My advice? Know both, and pick whichever one to read in the data causes the least headache. You can always turn an array into a dataframe (and vice versa, if the data are all the same type)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "We're going to use the following dataset in the lab, so I'd like to briefly familiarize you with it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>ion</th>\n",
       "      <th>wl</th>\n",
       "      <th>flux</th>\n",
       "      <th>rms</th>\n",
       "      <th>I</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>PN3m</td>\n",
       "      <td>[OII]</td>\n",
       "      <td>3727</td>\n",
       "      <td>373.9</td>\n",
       "      <td>58.6</td>\n",
       "      <td>517.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>PN3m</td>\n",
       "      <td>HeI</td>\n",
       "      <td>3805</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>PN3m</td>\n",
       "      <td>HI</td>\n",
       "      <td>3835</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>PN3m</td>\n",
       "      <td>[NeIII]</td>\n",
       "      <td>3869</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>PN3m</td>\n",
       "      <td>HeI</td>\n",
       "      <td>3889</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       name         ion    wl   flux   rms      I\n",
       "0  PN3m           [OII]  3727  373.9  58.6  517.3\n",
       "1  PN3m             HeI  3805    0.0   0.0    0.0\n",
       "2  PN3m              HI  3835    0.0   0.0    0.0\n",
       "3  PN3m         [NeIII]  3869    0.0   0.0    0.0\n",
       "4  PN3m             HeI  3889    0.0   0.0    0.0"
      ]
     },
     "execution_count": 74,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1634"
      ]
     },
     "execution_count": 75,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "      <th>mean</th>\n",
       "      <th>std</th>\n",
       "      <th>min</th>\n",
       "      <th>25%</th>\n",
       "      <th>50%</th>\n",
       "      <th>75%</th>\n",
       "      <th>max</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>name</th>\n",
       "      <th>ion</th>\n",
       "      <th>wl</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">HII123</th>\n",
       "      <th rowspan=\"5\" valign=\"top\">HI</th>\n",
       "      <th>3835</th>\n",
       "      <td>1.0</td>\n",
       "      <td>6.5</td>\n",
       "      <td>NaN</td>\n",
       "      <td>6.5</td>\n",
       "      <td>6.5</td>\n",
       "      <td>6.5</td>\n",
       "      <td>6.5</td>\n",
       "      <td>6.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4100</th>\n",
       "      <td>1.0</td>\n",
       "      <td>22.9</td>\n",
       "      <td>NaN</td>\n",
       "      <td>22.9</td>\n",
       "      <td>22.9</td>\n",
       "      <td>22.9</td>\n",
       "      <td>22.9</td>\n",
       "      <td>22.9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4340</th>\n",
       "      <td>1.0</td>\n",
       "      <td>42.8</td>\n",
       "      <td>NaN</td>\n",
       "      <td>42.8</td>\n",
       "      <td>42.8</td>\n",
       "      <td>42.8</td>\n",
       "      <td>42.8</td>\n",
       "      <td>42.8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4861</th>\n",
       "      <td>1.0</td>\n",
       "      <td>100.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>100.0</td>\n",
       "      <td>100.0</td>\n",
       "      <td>100.0</td>\n",
       "      <td>100.0</td>\n",
       "      <td>100.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6563</th>\n",
       "      <td>1.0</td>\n",
       "      <td>498.4</td>\n",
       "      <td>NaN</td>\n",
       "      <td>498.4</td>\n",
       "      <td>498.4</td>\n",
       "      <td>498.4</td>\n",
       "      <td>498.4</td>\n",
       "      <td>498.4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <th>...</th>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">PN9m</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">[ArIII]</th>\n",
       "      <th>7135</th>\n",
       "      <td>1.0</td>\n",
       "      <td>16.9</td>\n",
       "      <td>NaN</td>\n",
       "      <td>16.9</td>\n",
       "      <td>16.9</td>\n",
       "      <td>16.9</td>\n",
       "      <td>16.9</td>\n",
       "      <td>16.9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7751</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>[NeIII]</th>\n",
       "      <th>3869</th>\n",
       "      <td>1.0</td>\n",
       "      <td>4.6</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4.6</td>\n",
       "      <td>4.6</td>\n",
       "      <td>4.6</td>\n",
       "      <td>4.6</td>\n",
       "      <td>4.6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>HeI/[ArIV]</th>\n",
       "      <th>4712</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>[NeIII]/HI</th>\n",
       "      <th>3968</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1634 rows × 8 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                          count   mean  std    min    25%    50%    75%    max\n",
       "name     ion        wl                                                        \n",
       "HII123           HI 3835    1.0    6.5  NaN    6.5    6.5    6.5    6.5    6.5\n",
       "                    4100    1.0   22.9  NaN   22.9   22.9   22.9   22.9   22.9\n",
       "                    4340    1.0   42.8  NaN   42.8   42.8   42.8   42.8   42.8\n",
       "                    4861    1.0  100.0  NaN  100.0  100.0  100.0  100.0  100.0\n",
       "                    6563    1.0  498.4  NaN  498.4  498.4  498.4  498.4  498.4\n",
       "...                         ...    ...  ...    ...    ...    ...    ...    ...\n",
       "PN9m        [ArIII] 7135    1.0   16.9  NaN   16.9   16.9   16.9   16.9   16.9\n",
       "                    7751    1.0    0.0  NaN    0.0    0.0    0.0    0.0    0.0\n",
       "            [NeIII] 3869    1.0    4.6  NaN    4.6    4.6    4.6    4.6    4.6\n",
       "         HeI/[ArIV] 4712    1.0    0.0  NaN    0.0    0.0    0.0    0.0    0.0\n",
       "         [NeIII]/HI 3968    1.0    0.0  NaN    0.0    0.0    0.0    0.0    0.0\n",
       "\n",
       "[1634 rows x 8 columns]"
      ]
     },
     "execution_count": 94,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby(['name','ion','wl']).flux.describe()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "a330",
   "language": "python",
   "name": "a330"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}