6. Pandas¶
What is pandas
?
Lot of answers, depends on your use case.
Primarily, pandas is a package built around the DataFrame
DataFrames are a more flexible, SQL-like data container
Numpy vs Pandas¶
When would we want to use pandas
instead of a trusty numpy
array?
When we need to index columns by name
When we need to index rows by name
When we want to store heterogeneous data (e.g., one column of strings, next column of floats)
When we have multiple tables we want to join together
When we want to perform complex filtering operations on a single table.
When the I/O for some data on disk is frustrating using
numpy
Note that the first bullet point (indexing columns of an array by name) can be accomplished via some other data containers, including Astropy
Table
s and numpy
record arrays
.
However, many of the other bullet points are not easily accomplished with these methods. Furthermore, pandas
is the industry standard in data science and other fields. Learning it is a transferrable skill.
Introduction to Pandas¶
For the purposes of this introduction, we need to learn about 2 pandas
objects:
pd.Series
, andpd.DataFrame
.
DataFrames are constructed out of Series objects, but as it turns out, we don’t need to define them explicitly on most occasions.
Let’s set up a data frame:
import numpy as np
import pandas as pd
array1 = np.array([1,2,3])
array2 = ['Good','Bad','Ugly']
df = pd.DataFrame({'measurement':array1,'quality':array2})
df
measurement | quality | |
---|---|---|
0 | 1 | Good |
1 | 2 | Bad |
2 | 3 | Ugly |
As you can see, one valid initialization is feeding in a dictionary with the desired column names as keys and the desired columns as values.
Here’s another way:
array1 = np.array([1,2,3])
array2 = ['Good','Bad','Ugly']
df = pd.DataFrame()
df['measurement'] = array1
df['quality'] = array2
df
measurement | quality | |
---|---|---|
0 | 1 | Good |
1 | 2 | Bad |
2 | 3 | Ugly |
i.e., we can create empty DataFrames
, and dynamically set new columns the way we are used to setting the values of dictionaries for a given key.
It is worth noting that this “dictionary-style” indexing of DataFrame columns is not the only way. If our column names have no spaces, we can also use dot notation:
df['measurement']
0 1
1 2
2 3
Name: measurement, dtype: int64
df.measurement
0 1
1 2
2 3
Name: measurement, dtype: int64
As you can see, both forms return the same array. Notice, though, that this “column” isn’t just an array… here we see our first pandas Series
.
You can think of a Series
as a sort of wrapper/container around data arrays that provides extra functionality.
df.quality
0 Good
1 Bad
2 Ugly
Name: quality, dtype: object
For example, we can see that this Series object has a Name
(which it tells us when we print it), and it has an index
— that’s the 0,1,2 over on the left.
As a note, if we ever want to extract just the data values of a series, we run
df.quality.values
array(['Good', 'Bad', 'Ugly'], dtype=object)
I’ve found that the dot-notation indexing of DataFrames can be very handy – .thing
is fewer characters to type than ['thing']
, and doesn’t require your fingers to find the []
keys.
Let’s come back to our DataFrame, df
. What’s going on with this index on the left?
df
measurement | quality | |
---|---|---|
0 | 1 | Good |
1 | 2 | Bad |
2 | 3 | Ugly |
df.loc[0]
measurement 1
quality Good
Name: 0, dtype: object
The index part of a DataFrame is a way to index specific rows, rather than specific columns. As you can see above, I used the .loc[]
syntax, which tells pandas I’m inserting a row index, rather than a “column index”. Pandas then spits out each column and associated value for that row.
We can set the index of a DataFrame to be whatever we want. For example, let’s say our measurements were taken on 3 dates:
df.index = ['2021-10-07','2021-10-08','2021-10-09']
df
measurement | quality | |
---|---|---|
2021-10-07 | 1 | Good |
2021-10-08 | 2 | Bad |
2021-10-09 | 3 | Ugly |
Now the [0,1,2]
is replaced by our date strings.
Now, if I want to pull the measurement and measurement quality from 2021-10-08
, I would use
df.loc['2021-10-08']
measurement 2
quality Bad
Name: 2021-10-08, dtype: object
What if I know the numerical index of the row I want, but don’t know the name of that row? Pandas doesn’t love it when we are in this state, but there is a solution:
df.iloc[1]
measurement 2
quality Bad
Name: 2021-10-08, dtype: object
The df.iloc[]
syntax allows you to index a row by true index number, rather than whatever the name of that row is.
Filtering DataFrames¶
So now that we know how to create a simple DataFrame, how do we go about filtering it? Instead of creating external masks or using np.where()
, pandas
has a built in set of methods for accessing parts of a dataframe.
Sticking with our measurement df, what if I want a view of my DataFrame, but only for the rows where the quality
is either 'Bad'
or 'Ugly'
. We can do that as follows:
df.loc[(df.quality=='Bad')|(df.quality=='Ugly')]
measurement | quality | |
---|---|---|
2021-10-08 | 2 | Bad |
2021-10-09 | 3 | Ugly |
We can see that by putting our conditions inside df.loc[]
the way we normally would in np.where()
, we return a version of the DataFrame where only the bad and ugly measurement remains.
What if I want to perform the same filtering, but I don’t want the whole DataFrame out, I just want a specific column? We can specify this as well:
df.loc[(df.quality=='Bad')|(df.quality=='Ugly'),'measurement']
2021-10-08 2
2021-10-09 3
Name: measurement, dtype: int64
We add a comma after our conditionals, and then specify the name of the column we want.
This is extensible to multiple columns; we only have two here but it would look like this:
df.loc[(df.quality=='Bad')|(df.quality=='Ugly'),['measurement','quality']]
measurement | quality | |
---|---|---|
2021-10-08 | 2 | Bad |
2021-10-09 | 3 | Ugly |
i.e., we provide a list of column names we want to index, only where the conditions are met.
Let’s say I want to set any ‘Bad’ or ‘Ugly’ measurements to np.nan
:
df.loc[(df.quality=='Bad')|(df.quality=='Ugly'),'measurement'] = np.nan
df
measurement | quality | |
---|---|---|
2021-10-07 | 1.0 | Good |
2021-10-08 | NaN | Bad |
2021-10-09 | NaN | Ugly |
Warning. A common “gotcha” with DataFrames is to not use the col_indexer
as above when setting values, but rather something like the following:
df.loc[(df.quality=='Bad')|(df.quality=='Ugly')].measurement = np.nan
/Users/ipasha/anaconda3/envs/a330/lib/python3.8/site-packages/pandas/core/generic.py:5516: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self[name] = value
We get this error because running .measurement
or indexing with ['measurement']
creates a copy of that slice of the dataframe for view – meaning that if we try to set values in it, we are not affecting values in the underlying dataframe.
Joining DataFrames¶
A powerful strength of DataFrames is the ability to join/merge multiple dataframes on a given column. This behavior is inhereted from the way Databases (such as SQL) work. Let’s create another DataFrame that has some anciliary data also taken on our three dates:
df2 = pd.DataFrame()
df2['weather'] = ['sunny','cloudy','cloudy']
df2['seeing'] = np.array([0.55,2.3,3.2])
df2.index = ['2021-10-07','2021-10-08','2021-10-09']
df2
weather | seeing | |
---|---|---|
2021-10-07 | sunny | 0.55 |
2021-10-08 | cloudy | 2.30 |
2021-10-09 | cloudy | 3.20 |
We can easily combine these two DataFrames:
new_df = pd.merge(df,df2,left_index=True,right_index=True)
new_df
measurement | quality | weather | seeing | |
---|---|---|---|---|
2021-10-07 | 1.0 | Good | sunny | 0.55 |
2021-10-08 | NaN | Bad | cloudy | 2.30 |
2021-10-09 | NaN | Ugly | cloudy | 3.20 |
When merging dataframes, we have to select which columns to merge on. Here, I’ve chosen a full join on the index. Let’s talk about the options:
From database parlance, we can join/merge two tables as long as there is a unique identifer column that is shared across both tables. In our case, the index is that “column”.
A left (outer) join means that we take all rows of
df
, and if df2 has no entry for some rows, we’ll fill it with NaNs.A right (outer) join means we will keep all rows of
df2
, and if df has no entry for some rows, we’ll fill it with NaNs.A full (outer) join means we keep all rows of both dataframes, filling in NaNs where either is missing data.
Alternatively, we could ask for just the rows that have matches in both dataframes, dropping all other columns. This would be an inner join.
Let’s see this in action. I’m going to add a date, measurement, and quality to df:
df.loc['2021-10-10'] = [5,'Good']
df
measurement | quality | |
---|---|---|
2021-10-07 | 1.0 | Good |
2021-10-08 | NaN | Bad |
2021-10-09 | NaN | Ugly |
2021-10-10 | 5.0 | Good |
Now when we go to do our merge, df2
won’t have a row for that last index. But if I use df.join()
, I can specify an outer join:
new_df = df.join(df2,how='outer')
new_df
measurement | quality | weather | seeing | |
---|---|---|---|---|
2021-10-07 | 1.0 | Good | sunny | 0.55 |
2021-10-08 | NaN | Bad | cloudy | 2.30 |
2021-10-09 | NaN | Ugly | cloudy | 3.20 |
2021-10-10 | 5.0 | Good | NaN | NaN |
Notice that I get NaN for the weather and seeing, as these weren’t present in df2
. But I do keep that row overall.
new_df = df.join(df2,how='inner')
new_df
measurement | quality | weather | seeing | |
---|---|---|---|---|
2021-10-07 | 1.0 | Good | sunny | 0.55 |
2021-10-08 | NaN | Bad | cloudy | 2.30 |
2021-10-09 | NaN | Ugly | cloudy | 3.20 |
Alternatively, if I use an “inner” join, I only keep indices that were in both dataframes. So I lose that entire row.
new_df = df.join(df2,how='right')
new_df
measurement | quality | weather | seeing | |
---|---|---|---|---|
2021-10-07 | 1.0 | Good | sunny | 0.55 |
2021-10-08 | NaN | Bad | cloudy | 2.30 |
2021-10-09 | NaN | Ugly | cloudy | 3.20 |
This time, I’ve specified I want a “right outer” join. Because the “righthand” dataframe (df2
) doesn’t have a row for ‘2021-10-10’, it’s not part of the final df.
new_df = df.join(df2,how='left')
new_df
measurement | quality | weather | seeing | |
---|---|---|---|---|
2021-10-07 | 1.0 | Good | sunny | 0.55 |
2021-10-08 | NaN | Bad | cloudy | 2.30 |
2021-10-09 | NaN | Ugly | cloudy | 3.20 |
2021-10-10 | 5.0 | Good | NaN | NaN |
Wheras if I use ‘left’, I keep all rows of df
including the new one, filling in NaNs for df2 where it has no values.
You may have noticed that in the above examples, ‘left’ and ‘outer’ produced the same results. If so, good eye! They’re the same output in this case because df2
has no rows not present in df
. But if it did, then the ‘outer’ join would keep both that row and ‘2021-10-10’ from df, while ‘left’ would ignore the extra row in df2.
As a note, the join’s we’ve been doing have implicitly been using the index as the join column, which is the default behavior. But if there were an actual column in the data that had the necessary properties, we could use the pd.merge(df,df2,on='name')
syntax to join on that column instead.
Operations on Columns¶
We can carry out operations on columns of dataframes like we would with numpy arrays. For example, let’s say I have a dataframe as follows:
df3 = pd.DataFrame()
df3['galaxy_name'] = ['NGC 4300','NGC 3055','NGC 3235','NGC 6532']
df3['flux'] = np.array([1.1e-12,1.5e-13,4.6e-13,1.8e-12])
df3['distance'] = np.array([10.5,33.7,105,22])
df3
galaxy_name | flux | distance | |
---|---|---|---|
0 | NGC 4300 | 1.100000e-12 | 10.5 |
1 | NGC 3055 | 1.500000e-13 | 33.7 |
2 | NGC 3235 | 4.600000e-13 | 105.0 |
3 | NGC 6532 | 1.800000e-12 | 22.0 |
I can calculate the luminosity of these galaxies and make a new column as follows
import astropy.units as u
lums = (df3.flux.values*u.erg/u.s/u.cm**2* 4 * np.pi * (df3.distance.values*u.Mpc)**2).to(u.Lsun).value
df3['luminosity'] = np.log10(lums)
df3
galaxy_name | flux | distance | luminosity | |
---|---|---|---|---|
0 | NGC 4300 | 1.100000e-12 | 10.5 | 6.578710 |
1 | NGC 3055 | 1.500000e-13 | 33.7 | 6.726290 |
2 | NGC 3235 | 4.600000e-13 | 105.0 | 8.200075 |
3 | NGC 6532 | 1.800000e-12 | 22.0 | 7.435057 |
In this case, I use the .values
to be sure things play nicely with astropy
.
Dropping rows or columns¶
So far, we’ve seen how to add things into dataframes and modify their values. But what if we want to drop a row or column entirely?
For this, we need df.drop()
. This command returns a version of the dataframe with the row or column dropped. As a note, the default behavior is to drop rows, so if you’re specifying a column, you also need to add axis=1
as follows:
df2.drop('2021-10-09')
weather | seeing | |
---|---|---|
2021-10-07 | sunny | 0.55 |
2021-10-08 | cloudy | 2.30 |
We specified an index there, let’s try a column:
df2.drop('seeing',axis=1)
weather | |
---|---|
2021-10-07 | sunny |
2021-10-08 | cloudy |
2021-10-09 | cloudy |
Note that df2.drop()
doesn’t overwrite df2. To do this, we’d need do do either of the following:
df2 = df2.drop('seeing',axis=1)
#OR
df2.drop('seeing',inplace=True,axis=1)
I suggest the first, as it is easier to read and remember.
Other Common Operations¶
Here I’m going to provide a grab bag of common pandas methods that are useful to know:
We can return the unique elements of a column (and the number of them) via
df['col2'].unique()
and
df['col2].nunique()
We can count how many times each value appears in a given column via
df['col2'].value_counts()
We can sort a dataframe by the values in some column via
df.sort_values('col2')
Data I/O¶
A final use for pandas
is that it was designed to work with messy data. When loading data from, e.g., ascii files, there are often a lot of subtleties surrounding the spacings and columns, mixes of strings and floats, etc.
Traditional numpy
functions like loadtxt()
or genfromtxt()
struggle with these kinds of files, but often, pandas
has no trouble with it. At the very least, then, we can use pandas to get the data into python, where we know how to mess with it until it meets our requirements.
df = pd.read_csv('table2.dat')
df.head()
name | ion | wl | flux | rms | I | |
---|---|---|---|---|---|---|
0 | PN3m | [OII] | 3727 | 373.9 | 58.6 | 517.3 |
1 | PN3m | HeI | 3805 | 0.0 | 0.0 | 0.0 |
2 | PN3m | HI | 3835 | 0.0 | 0.0 | 0.0 |
3 | PN3m | [NeIII] | 3869 | 0.0 | 0.0 | 0.0 |
4 | PN3m | HeI | 3889 | 0.0 | 0.0 | 0.0 |
Above, I’ve run a one line, no modification call of pandas’ read_csv()
function. With no issue, it loaded up this file, which contains column names (first row), and a variety of data types.
Let’s try the same thing with numpy:
x = np.loadtxt('table2.dat')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-72-3c744ba5b70f> in <module>
----> 1 x = np.loadtxt('table2.dat')
~/anaconda3/envs/a330/lib/python3.8/site-packages/numpy/lib/npyio.py in loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack, ndmin, encoding, max_rows, like)
1146 # converting the data
1147 X = None
-> 1148 for x in read_data(_loadtxt_chunksize):
1149 if X is None:
1150 X = np.array(x, dtype)
~/anaconda3/envs/a330/lib/python3.8/site-packages/numpy/lib/npyio.py in read_data(chunk_size)
997
998 # Convert each value according to its column and store
--> 999 items = [conv(val) for (conv, val) in zip(converters, vals)]
1000
1001 # Then pack it according to the dtype's nesting
~/anaconda3/envs/a330/lib/python3.8/site-packages/numpy/lib/npyio.py in <listcomp>(.0)
997
998 # Convert each value according to its column and store
--> 999 items = [conv(val) for (conv, val) in zip(converters, vals)]
1000
1001 # Then pack it according to the dtype's nesting
~/anaconda3/envs/a330/lib/python3.8/site-packages/numpy/lib/npyio.py in floatconv(x)
734 if '0x' in x:
735 return float.fromhex(x)
--> 736 return float(x)
737
738 typ = dtype.type
ValueError: could not convert string to float: 'name,ion,wl,flux,rms,I'
Yep. Numpy doesn’t like it. In part, having columns with strings and also columns with floats leaves us dead in the water. We want pandas for this.
There are cases where pandas is annoying about reading things in and numpy is actually easier. My advice? Know both, and pick whichever one to read in the data causes the least headache. You can always turn an array into a dataframe (and vice versa, if the data are all the same type).
We’re going to use the following dataset in the lab, so I’d like to briefly familiarize you with it:
df.head()
name | ion | wl | flux | rms | I | |
---|---|---|---|---|---|---|
0 | PN3m | [OII] | 3727 | 373.9 | 58.6 | 517.3 |
1 | PN3m | HeI | 3805 | 0.0 | 0.0 | 0.0 |
2 | PN3m | HI | 3835 | 0.0 | 0.0 | 0.0 |
3 | PN3m | [NeIII] | 3869 | 0.0 | 0.0 | 0.0 |
4 | PN3m | HeI | 3889 | 0.0 | 0.0 | 0.0 |
len(df)
1634
df.groupby(['name','ion','wl']).flux.describe()
count | mean | std | min | 25% | 50% | 75% | max | |||
---|---|---|---|---|---|---|---|---|---|---|
name | ion | wl | ||||||||
HII123 | HI | 3835 | 1.0 | 6.5 | NaN | 6.5 | 6.5 | 6.5 | 6.5 | 6.5 |
4100 | 1.0 | 22.9 | NaN | 22.9 | 22.9 | 22.9 | 22.9 | 22.9 | ||
4340 | 1.0 | 42.8 | NaN | 42.8 | 42.8 | 42.8 | 42.8 | 42.8 | ||
4861 | 1.0 | 100.0 | NaN | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | ||
6563 | 1.0 | 498.4 | NaN | 498.4 | 498.4 | 498.4 | 498.4 | 498.4 | ||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
PN9m | [ArIII] | 7135 | 1.0 | 16.9 | NaN | 16.9 | 16.9 | 16.9 | 16.9 | 16.9 |
7751 | 1.0 | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ||
[NeIII] | 3869 | 1.0 | 4.6 | NaN | 4.6 | 4.6 | 4.6 | 4.6 | 4.6 | |
HeI/[ArIV] | 4712 | 1.0 | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |
[NeIII]/HI | 3968 | 1.0 | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1634 rows × 8 columns