{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 6. Pandas" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "What is `pandas`?\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Lot of answers, depends on your use case. \n", "- Primarily, pandas is a package built around the DataFrame\n", "- DataFrames are a more flexible, SQL-like data container" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Numpy vs Pandas\n", "\n", "When would we want to use `pandas` instead of a trusty `numpy` array?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- When we need to index columns by name\n", "- When we need to index *rows* by name \n", "- When we want to store heterogeneous data (e.g., one column of strings, next column of floats)\n", "- When we have multiple tables we want to join together\n", "- When we want to perform complex filtering operations on a single table. \n", "- When the I/O for some data on disk is frustrating using `numpy`" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Note that the first bullet point (indexing columns of an array by name) can be accomplished via some other data containers, including `Astropy` `Table`s and `numpy` `record arrays`. \n", "\n", "*However*, many of the other bullet points are not easily accomplished with these methods. *Furthermore*, `pandas` is the industry standard in data science and other fields. Learning it is a transferrable skill." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Introduction to Pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the purposes of this introduction, we need to learn about 2 `pandas` objects: \n", "- `pd.Series`, and \n", "- `pd.DataFrame`. \n", "\n", "DataFrames are constructed out of Series objects, but as it turns out, we don't need to define them explicitly on most occasions. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Let's set up a data frame:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
measurementquality
01Good
12Bad
23Ugly
\n", "
" ], "text/plain": [ " measurement quality\n", "0 1 Good\n", "1 2 Bad\n", "2 3 Ugly" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np \n", "import pandas as pd \n", "\n", "array1 = np.array([1,2,3])\n", "array2 = ['Good','Bad','Ugly']\n", "\n", "df = pd.DataFrame({'measurement':array1,'quality':array2})\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, one valid initialization is feeding in a dictionary with the desired column names as keys and the desired columns as values." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Here's another way:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
measurementquality
01Good
12Bad
23Ugly
\n", "
" ], "text/plain": [ " measurement quality\n", "0 1 Good\n", "1 2 Bad\n", "2 3 Ugly" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "array1 = np.array([1,2,3])\n", "array2 = ['Good','Bad','Ugly']\n", "\n", "df = pd.DataFrame()\n", "df['measurement'] = array1\n", "df['quality'] = array2\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "i.e., we can create empty `DataFrames`, and dynamically set new columns the way we are used to setting the values of dictionaries for a given key. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "It is worth noting that this \"dictionary-style\" indexing of DataFrame columns is not the only way. If our column names have no spaces, we can also use dot notation:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "Name: measurement, dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['measurement']" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "Name: measurement, dtype: int64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.measurement" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, both forms return the same array. Notice, though, that this \"column\" isn't just an array... here we see our first pandas `Series`. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "You can think of a `Series` as a sort of wrapper/container around data arrays that provides extra functionality. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Good\n", "1 Bad\n", "2 Ugly\n", "Name: quality, dtype: object" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.quality" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, we can see that this Series object has a `Name` (which it tells us when we print it), and it has an `index` --- that's the 0,1,2 over on the left. \n", "As a note, if we ever want to extract *just* the data values of a series, we run" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['Good', 'Bad', 'Ugly'], dtype=object)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.quality.values" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "I've found that the dot-notation indexing of DataFrames can be very handy -- `.thing` is fewer characters to type than `['thing']`, and doesn't require your fingers to find the `[]` keys. \n", "\n", "Let's come back to our DataFrame, `df`. What's going on with this index on the left? " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
measurementquality
01Good
12Bad
23Ugly
\n", "
" ], "text/plain": [ " measurement quality\n", "0 1 Good\n", "1 2 Bad\n", "2 3 Ugly" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "measurement 1\n", "quality Good\n", "Name: 0, dtype: object" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The **index** part of a DataFrame is a way to index specific *rows*, rather than specific *columns*. As you can see above, I used the `.loc[]` syntax, which tells pandas I'm inserting a row index, rather than a \"column index\". Pandas then spits out each column and associated value for that row." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We can set the index of a DataFrame to be whatever we want. For example, let's say our measurements were taken on 3 dates:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "df.index = ['2021-10-07','2021-10-08','2021-10-09']" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
measurementquality
2021-10-071Good
2021-10-082Bad
2021-10-093Ugly
\n", "
" ], "text/plain": [ " measurement quality\n", "2021-10-07 1 Good\n", "2021-10-08 2 Bad\n", "2021-10-09 3 Ugly" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now the `[0,1,2]` is replaced by our date strings. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Now, if I want to pull the measurement and measurement quality from `2021-10-08`, I would use" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "measurement 2\n", "quality Bad\n", "Name: 2021-10-08, dtype: object" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc['2021-10-08']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "What if I know the numerical index of the row I want, but don't know the name of that row? Pandas doesn't *love* it when we are in this state, but there is a solution:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "measurement 2\n", "quality Bad\n", "Name: 2021-10-08, dtype: object" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `df.iloc[]` syntax allows you to index a row by true index number, rather than whatever the name of that row is. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Filtering DataFrames" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "So now that we know how to create a simple DataFrame, how do we go about filtering it? Instead of creating external masks or using `np.where()`, `pandas` has a built in set of methods for accessing parts of a dataframe." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Sticking with our measurement df, what if I want a view of my DataFrame, but only for the rows where the `quality` is either `'Bad'` or `'Ugly'`. We can do that as follows:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
measurementquality
2021-10-082Bad
2021-10-093Ugly
\n", "
" ], "text/plain": [ " measurement quality\n", "2021-10-08 2 Bad\n", "2021-10-09 3 Ugly" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[(df.quality=='Bad')|(df.quality=='Ugly')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that by putting our conditions inside `df.loc[]` the way we normally would in `np.where()`, we return a version of the DataFrame where only the bad and ugly measurement remains." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "What if I want to perform the same filtering, but I don't want the whole DataFrame out, I just want a specific column? We can specify this as well:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2021-10-08 2\n", "2021-10-09 3\n", "Name: measurement, dtype: int64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[(df.quality=='Bad')|(df.quality=='Ugly'),'measurement']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We add a comma after our conditionals, and then specify the name of the column we want." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "This is extensible to multiple columns; we only have two here but it would look like this:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
measurementquality
2021-10-082Bad
2021-10-093Ugly
\n", "
" ], "text/plain": [ " measurement quality\n", "2021-10-08 2 Bad\n", "2021-10-09 3 Ugly" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[(df.quality=='Bad')|(df.quality=='Ugly'),['measurement','quality']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "i.e., we provide a list of column names we want to index, only where the conditions are met. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Let's say I want to set any 'Bad' or 'Ugly' measurements to `np.nan`:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
measurementquality
2021-10-071.0Good
2021-10-08NaNBad
2021-10-09NaNUgly
\n", "
" ], "text/plain": [ " measurement quality\n", "2021-10-07 1.0 Good\n", "2021-10-08 NaN Bad\n", "2021-10-09 NaN Ugly" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[(df.quality=='Bad')|(df.quality=='Ugly'),'measurement'] = np.nan\n", "df" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "**Warning**. A common \"gotcha\" with DataFrames is to not use the `col_indexer` as above when setting values, but rather something like the following:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/ipasha/anaconda3/envs/a330/lib/python3.8/site-packages/pandas/core/generic.py:5516: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " self[name] = value\n" ] } ], "source": [ "df.loc[(df.quality=='Bad')|(df.quality=='Ugly')].measurement = np.nan" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We get this error because running `.measurement` or indexing with `['measurement']` creates a copy of that slice of the dataframe for view -- meaning that if we try to set values in it, we are **not** affecting values in the underlying dataframe." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Joining DataFrames" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "A powerful strength of DataFrames is the ability to join/merge multiple dataframes on a given column. This behavior is inhereted from the way Databases (such as SQL) work. \n", "Let's create another DataFrame that has some anciliary data also taken on our three dates:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weatherseeing
2021-10-07sunny0.55
2021-10-08cloudy2.30
2021-10-09cloudy3.20
\n", "
" ], "text/plain": [ " weather seeing\n", "2021-10-07 sunny 0.55\n", "2021-10-08 cloudy 2.30\n", "2021-10-09 cloudy 3.20" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = pd.DataFrame()\n", "df2['weather'] = ['sunny','cloudy','cloudy']\n", "df2['seeing'] = np.array([0.55,2.3,3.2])\n", "df2.index = ['2021-10-07','2021-10-08','2021-10-09']\n", "df2" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We can easily combine these two DataFrames:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
measurementqualityweatherseeing
2021-10-071.0Goodsunny0.55
2021-10-08NaNBadcloudy2.30
2021-10-09NaNUglycloudy3.20
\n", "
" ], "text/plain": [ " measurement quality weather seeing\n", "2021-10-07 1.0 Good sunny 0.55\n", "2021-10-08 NaN Bad cloudy 2.30\n", "2021-10-09 NaN Ugly cloudy 3.20" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_df = pd.merge(df,df2,left_index=True,right_index=True)\n", "new_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When merging dataframes, we have to select which columns to merge on. Here, I've chosen a full join on the index. Let's talk about the options:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "From database parlance, we can join/merge two tables as long as there is a *unique identifer* column that is shared across both tables. In our case, the *index* is that \"column\". \n", "\n", "- A left (outer) join means that we take *all* rows of `df`, and if df2 has no entry for some rows, we'll fill it with NaNs. \n", "- A right (outer) join means we will keep *all* rows of `df2`, and if df has no entry for some rows, we'll fill it with NaNs.\n", "- A full (outer) join means we keep *all* rows of **both** dataframes, filling in NaNs where either is missing data. \n", "\n", "Alternatively, we could ask for just the rows that have matches in both dataframes, dropping all other columns. This would be an inner join." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Let's see this in action. I'm going to add a date, measurement, and quality to df:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
measurementquality
2021-10-071.0Good
2021-10-08NaNBad
2021-10-09NaNUgly
2021-10-105.0Good
\n", "
" ], "text/plain": [ " measurement quality\n", "2021-10-07 1.0 Good\n", "2021-10-08 NaN Bad\n", "2021-10-09 NaN Ugly\n", "2021-10-10 5.0 Good" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc['2021-10-10'] = [5,'Good']\n", "df" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Now when we go to do our merge, `df2` won't have a row for that last index. But if I use `df.join()`, I can specify an outer join:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
measurementqualityweatherseeing
2021-10-071.0Goodsunny0.55
2021-10-08NaNBadcloudy2.30
2021-10-09NaNUglycloudy3.20
2021-10-105.0GoodNaNNaN
\n", "
" ], "text/plain": [ " measurement quality weather seeing\n", "2021-10-07 1.0 Good sunny 0.55\n", "2021-10-08 NaN Bad cloudy 2.30\n", "2021-10-09 NaN Ugly cloudy 3.20\n", "2021-10-10 5.0 Good NaN NaN" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_df = df.join(df2,how='outer')\n", "new_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that I get NaN for the weather and seeing, as these weren't present in `df2`. But I do keep that row overall. " ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
measurementqualityweatherseeing
2021-10-071.0Goodsunny0.55
2021-10-08NaNBadcloudy2.30
2021-10-09NaNUglycloudy3.20
\n", "
" ], "text/plain": [ " measurement quality weather seeing\n", "2021-10-07 1.0 Good sunny 0.55\n", "2021-10-08 NaN Bad cloudy 2.30\n", "2021-10-09 NaN Ugly cloudy 3.20" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_df = df.join(df2,how='inner')\n", "new_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, if I use an \"inner\" join, I only keep indices that were in *both* dataframes. So I lose that entire row. " ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
measurementqualityweatherseeing
2021-10-071.0Goodsunny0.55
2021-10-08NaNBadcloudy2.30
2021-10-09NaNUglycloudy3.20
\n", "
" ], "text/plain": [ " measurement quality weather seeing\n", "2021-10-07 1.0 Good sunny 0.55\n", "2021-10-08 NaN Bad cloudy 2.30\n", "2021-10-09 NaN Ugly cloudy 3.20" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_df = df.join(df2,how='right')\n", "new_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This time, I've specified I want a \"right outer\" join. Because the \"righthand\" dataframe (`df2`) doesn't have a row for '2021-10-10', it's not part of the final df. " ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
measurementqualityweatherseeing
2021-10-071.0Goodsunny0.55
2021-10-08NaNBadcloudy2.30
2021-10-09NaNUglycloudy3.20
2021-10-105.0GoodNaNNaN
\n", "
" ], "text/plain": [ " measurement quality weather seeing\n", "2021-10-07 1.0 Good sunny 0.55\n", "2021-10-08 NaN Bad cloudy 2.30\n", "2021-10-09 NaN Ugly cloudy 3.20\n", "2021-10-10 5.0 Good NaN NaN" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_df = df.join(df2,how='left')\n", "new_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wheras if I use 'left', I keep all rows of `df` including the new one, filling in NaNs for df2 where it has no values. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "You may have noticed that in the above examples, 'left' and 'outer' produced the same results. If so, good eye! They're the same output in this case because `df2` has no rows *not* present in `df`. But if it did, then the 'outer' join would keep both that row and '2021-10-10' from df, while 'left' would ignore the extra row in df2." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a note, the join's we've been doing have *implicitly been using the index* as the join column, which is the default behavior. But if there were an actual column in the data that had the necessary properties, we could use the `pd.merge(df,df2,on='name')` syntax to join on that column instead." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Operations on Columns" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We can carry out operations on columns of dataframes like we would with numpy arrays. For example, let's say I have a dataframe as follows:" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
galaxy_namefluxdistance
0NGC 43001.100000e-1210.5
1NGC 30551.500000e-1333.7
2NGC 32354.600000e-13105.0
3NGC 65321.800000e-1222.0
\n", "
" ], "text/plain": [ " galaxy_name flux distance\n", "0 NGC 4300 1.100000e-12 10.5\n", "1 NGC 3055 1.500000e-13 33.7\n", "2 NGC 3235 4.600000e-13 105.0\n", "3 NGC 6532 1.800000e-12 22.0" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df3 = pd.DataFrame()\n", "df3['galaxy_name'] = ['NGC 4300','NGC 3055','NGC 3235','NGC 6532']\n", "df3['flux'] = np.array([1.1e-12,1.5e-13,4.6e-13,1.8e-12])\n", "df3['distance'] = np.array([10.5,33.7,105,22])\n", "df3" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "I can calculate the luminosity of these galaxies and make a new column as follows" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
galaxy_namefluxdistanceluminosity
0NGC 43001.100000e-1210.56.578710
1NGC 30551.500000e-1333.76.726290
2NGC 32354.600000e-13105.08.200075
3NGC 65321.800000e-1222.07.435057
\n", "
" ], "text/plain": [ " galaxy_name flux distance luminosity\n", "0 NGC 4300 1.100000e-12 10.5 6.578710\n", "1 NGC 3055 1.500000e-13 33.7 6.726290\n", "2 NGC 3235 4.600000e-13 105.0 8.200075\n", "3 NGC 6532 1.800000e-12 22.0 7.435057" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import astropy.units as u\n", "lums = (df3.flux.values*u.erg/u.s/u.cm**2* 4 * np.pi * (df3.distance.values*u.Mpc)**2).to(u.Lsun).value\n", "df3['luminosity'] = np.log10(lums)\n", "df3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, I use the `.values` to be sure things play nicely with `astropy`. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Dropping rows or columns \n", "\n", "So far, we've seen how to add things into dataframes and modify their values. But what if we want to drop a row or column entirely?\n", "\n", "For this, we need `df.drop()`. This command returns a version of the dataframe with the row or column dropped. As a note, the default behavior is to drop *rows*, so if you're specifying a column, you also need to add `axis=1` as follows:" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weatherseeing
2021-10-07sunny0.55
2021-10-08cloudy2.30
\n", "
" ], "text/plain": [ " weather seeing\n", "2021-10-07 sunny 0.55\n", "2021-10-08 cloudy 2.30" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.drop('2021-10-09')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We specified an index there, let's try a column:" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weather
2021-10-07sunny
2021-10-08cloudy
2021-10-09cloudy
\n", "
" ], "text/plain": [ " weather\n", "2021-10-07 sunny\n", "2021-10-08 cloudy\n", "2021-10-09 cloudy" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.drop('seeing',axis=1)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Note that `df2.drop()` doesn't overwrite df2. To do this, we'd need do do either of the following:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2 = df2.drop('seeing',axis=1)\n", "\n", "#OR \n", "\n", "df2.drop('seeing',inplace=True,axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I suggest the first, as it is easier to read and remember." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Other Common Operations" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Here I'm going to provide a grab bag of common pandas methods that are useful to know:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We can return the unique elements of a column (and the number of them) via \n", "`df['col2'].unique()`\n", "and \n", "`df['col2].nunique()`\n", "\n", "We can count how many times each value appears in a given column via \n", "`df['col2'].value_counts()` \n", "\n", "We can sort a dataframe by the values in some column via\n", "`df.sort_values('col2')`\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Data I/O" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "A final use for `pandas` is that it was designed to work with *messy* data. When loading data from, e.g., ascii files, there are often a lot of subtleties surrounding the spacings and columns, mixes of strings and floats, etc. \n", "\n", "Traditional `numpy` functions like `loadtxt()` or `genfromtxt()` struggle with these kinds of files, but often, `pandas` has no trouble with it. At the very least, then, we can use pandas to get the data into python, where we know how to mess with it until it meets our requirements." ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "df = pd.read_csv('table2.dat')" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameionwlfluxrmsI
0PN3m[OII]3727373.958.6517.3
1PN3mHeI38050.00.00.0
2PN3mHI38350.00.00.0
3PN3m[NeIII]38690.00.00.0
4PN3mHeI38890.00.00.0
\n", "
" ], "text/plain": [ " name ion wl flux rms I\n", "0 PN3m [OII] 3727 373.9 58.6 517.3\n", "1 PN3m HeI 3805 0.0 0.0 0.0\n", "2 PN3m HI 3835 0.0 0.0 0.0\n", "3 PN3m [NeIII] 3869 0.0 0.0 0.0\n", "4 PN3m HeI 3889 0.0 0.0 0.0" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Above, I've run a one line, no modification call of pandas' `read_csv()` function. With no issue, it loaded up this file, which contains column names (first row), and a variety of data types. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Let's try the same thing with numpy:" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "ename": "ValueError", "evalue": "could not convert string to float: 'name,ion,wl,flux,rms,I'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mx\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloadtxt\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'table2.dat'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m~/anaconda3/envs/a330/lib/python3.8/site-packages/numpy/lib/npyio.py\u001b[0m in \u001b[0;36mloadtxt\u001b[0;34m(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack, ndmin, encoding, max_rows, like)\u001b[0m\n\u001b[1;32m 1146\u001b[0m \u001b[0;31m# converting the data\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1147\u001b[0m \u001b[0mX\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1148\u001b[0;31m \u001b[0;32mfor\u001b[0m \u001b[0mx\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mread_data\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0m_loadtxt_chunksize\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1149\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mX\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1150\u001b[0m \u001b[0mX\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/anaconda3/envs/a330/lib/python3.8/site-packages/numpy/lib/npyio.py\u001b[0m in \u001b[0;36mread_data\u001b[0;34m(chunk_size)\u001b[0m\n\u001b[1;32m 997\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 998\u001b[0m \u001b[0;31m# Convert each value according to its column and store\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 999\u001b[0;31m \u001b[0mitems\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mconv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mval\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mconv\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mval\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mzip\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mconverters\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvals\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1000\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1001\u001b[0m \u001b[0;31m# Then pack it according to the dtype's nesting\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/anaconda3/envs/a330/lib/python3.8/site-packages/numpy/lib/npyio.py\u001b[0m in \u001b[0;36m\u001b[0;34m(.0)\u001b[0m\n\u001b[1;32m 997\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 998\u001b[0m \u001b[0;31m# Convert each value according to its column and store\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 999\u001b[0;31m \u001b[0mitems\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mconv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mval\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mconv\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mval\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mzip\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mconverters\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvals\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1000\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1001\u001b[0m \u001b[0;31m# Then pack it according to the dtype's nesting\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/anaconda3/envs/a330/lib/python3.8/site-packages/numpy/lib/npyio.py\u001b[0m in \u001b[0;36mfloatconv\u001b[0;34m(x)\u001b[0m\n\u001b[1;32m 734\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;34m'0x'\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 735\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mfloat\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfromhex\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 736\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mfloat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 737\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 738\u001b[0m \u001b[0mtyp\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtype\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mValueError\u001b[0m: could not convert string to float: 'name,ion,wl,flux,rms,I'" ] } ], "source": [ "x = np.loadtxt('table2.dat')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Yep. Numpy doesn't like it. In part, having columns with strings and also columns with floats leaves us dead in the water. We want pandas for this.\n", "\n", "There are cases where pandas is annoying about reading things in and numpy is actually easier. My advice? Know both, and pick whichever one to read in the data causes the least headache. You can always turn an array into a dataframe (and vice versa, if the data are all the same type)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We're going to use the following dataset in the lab, so I'd like to briefly familiarize you with it:" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameionwlfluxrmsI
0PN3m[OII]3727373.958.6517.3
1PN3mHeI38050.00.00.0
2PN3mHI38350.00.00.0
3PN3m[NeIII]38690.00.00.0
4PN3mHeI38890.00.00.0
\n", "
" ], "text/plain": [ " name ion wl flux rms I\n", "0 PN3m [OII] 3727 373.9 58.6 517.3\n", "1 PN3m HeI 3805 0.0 0.0 0.0\n", "2 PN3m HI 3835 0.0 0.0 0.0\n", "3 PN3m [NeIII] 3869 0.0 0.0 0.0\n", "4 PN3m HeI 3889 0.0 0.0 0.0" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1634" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df)" ] }, { "cell_type": "code", "execution_count": 94, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countmeanstdmin25%50%75%max
nameionwl
HII123HI38351.06.5NaN6.56.56.56.56.5
41001.022.9NaN22.922.922.922.922.9
43401.042.8NaN42.842.842.842.842.8
48611.0100.0NaN100.0100.0100.0100.0100.0
65631.0498.4NaN498.4498.4498.4498.4498.4
.................................
PN9m[ArIII]71351.016.9NaN16.916.916.916.916.9
77511.00.0NaN0.00.00.00.00.0
[NeIII]38691.04.6NaN4.64.64.64.64.6
HeI/[ArIV]47121.00.0NaN0.00.00.00.00.0
[NeIII]/HI39681.00.0NaN0.00.00.00.00.0
\n", "

1634 rows × 8 columns

\n", "
" ], "text/plain": [ " count mean std min 25% 50% 75% max\n", "name ion wl \n", "HII123 HI 3835 1.0 6.5 NaN 6.5 6.5 6.5 6.5 6.5\n", " 4100 1.0 22.9 NaN 22.9 22.9 22.9 22.9 22.9\n", " 4340 1.0 42.8 NaN 42.8 42.8 42.8 42.8 42.8\n", " 4861 1.0 100.0 NaN 100.0 100.0 100.0 100.0 100.0\n", " 6563 1.0 498.4 NaN 498.4 498.4 498.4 498.4 498.4\n", "... ... ... ... ... ... ... ... ...\n", "PN9m [ArIII] 7135 1.0 16.9 NaN 16.9 16.9 16.9 16.9 16.9\n", " 7751 1.0 0.0 NaN 0.0 0.0 0.0 0.0 0.0\n", " [NeIII] 3869 1.0 4.6 NaN 4.6 4.6 4.6 4.6 4.6\n", " HeI/[ArIV] 4712 1.0 0.0 NaN 0.0 0.0 0.0 0.0 0.0\n", " [NeIII]/HI 3968 1.0 0.0 NaN 0.0 0.0 0.0 0.0 0.0\n", "\n", "[1634 rows x 8 columns]" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(['name','ion','wl']).flux.describe()" ] } ], "metadata": { "kernelspec": { "display_name": "a330", "language": "python", "name": "a330" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.11" } }, "nbformat": 4, "nbformat_minor": 4 }