6. Pandas

What is pandas?

  • Lot of answers, depends on your use case.

  • Primarily, pandas is a package built around the DataFrame

  • DataFrames are a more flexible, SQL-like data container

Numpy vs Pandas

When would we want to use pandas instead of a trusty numpy array?

  • When we need to index columns by name

  • When we need to index rows by name

  • When we want to store heterogeneous data (e.g., one column of strings, next column of floats)

  • When we have multiple tables we want to join together

  • When we want to perform complex filtering operations on a single table.

  • When the I/O for some data on disk is frustrating using numpy

Note that the first bullet point (indexing columns of an array by name) can be accomplished via some other data containers, including Astropy Tables and numpy record arrays.

However, many of the other bullet points are not easily accomplished with these methods. Furthermore, pandas is the industry standard in data science and other fields. Learning it is a transferrable skill.

Introduction to Pandas

For the purposes of this introduction, we need to learn about 2 pandas objects:

  • pd.Series, and

  • pd.DataFrame.

DataFrames are constructed out of Series objects, but as it turns out, we don’t need to define them explicitly on most occasions.

Let’s set up a data frame:

import numpy as np 
import pandas as pd 

array1 = np.array([1,2,3])
array2 = ['Good','Bad','Ugly']

df = pd.DataFrame({'measurement':array1,'quality':array2})
df
measurement quality
0 1 Good
1 2 Bad
2 3 Ugly

As you can see, one valid initialization is feeding in a dictionary with the desired column names as keys and the desired columns as values.

Here’s another way:

array1 = np.array([1,2,3])
array2 = ['Good','Bad','Ugly']

df = pd.DataFrame()
df['measurement'] = array1
df['quality'] = array2
df
measurement quality
0 1 Good
1 2 Bad
2 3 Ugly

i.e., we can create empty DataFrames, and dynamically set new columns the way we are used to setting the values of dictionaries for a given key.

It is worth noting that this “dictionary-style” indexing of DataFrame columns is not the only way. If our column names have no spaces, we can also use dot notation:

df['measurement']
0    1
1    2
2    3
Name: measurement, dtype: int64
df.measurement
0    1
1    2
2    3
Name: measurement, dtype: int64

As you can see, both forms return the same array. Notice, though, that this “column” isn’t just an array… here we see our first pandas Series.

You can think of a Series as a sort of wrapper/container around data arrays that provides extra functionality.

df.quality
0    Good
1     Bad
2    Ugly
Name: quality, dtype: object

For example, we can see that this Series object has a Name (which it tells us when we print it), and it has an index — that’s the 0,1,2 over on the left. As a note, if we ever want to extract just the data values of a series, we run

df.quality.values
array(['Good', 'Bad', 'Ugly'], dtype=object)

I’ve found that the dot-notation indexing of DataFrames can be very handy – .thing is fewer characters to type than ['thing'], and doesn’t require your fingers to find the [] keys.

Let’s come back to our DataFrame, df. What’s going on with this index on the left?

df
measurement quality
0 1 Good
1 2 Bad
2 3 Ugly
df.loc[0]
measurement       1
quality        Good
Name: 0, dtype: object

The index part of a DataFrame is a way to index specific rows, rather than specific columns. As you can see above, I used the .loc[] syntax, which tells pandas I’m inserting a row index, rather than a “column index”. Pandas then spits out each column and associated value for that row.

We can set the index of a DataFrame to be whatever we want. For example, let’s say our measurements were taken on 3 dates:

df.index = ['2021-10-07','2021-10-08','2021-10-09']
df
measurement quality
2021-10-07 1 Good
2021-10-08 2 Bad
2021-10-09 3 Ugly

Now the [0,1,2] is replaced by our date strings.

Now, if I want to pull the measurement and measurement quality from 2021-10-08, I would use

df.loc['2021-10-08']
measurement      2
quality        Bad
Name: 2021-10-08, dtype: object

What if I know the numerical index of the row I want, but don’t know the name of that row? Pandas doesn’t love it when we are in this state, but there is a solution:

df.iloc[1]
measurement      2
quality        Bad
Name: 2021-10-08, dtype: object

The df.iloc[] syntax allows you to index a row by true index number, rather than whatever the name of that row is.

Filtering DataFrames

So now that we know how to create a simple DataFrame, how do we go about filtering it? Instead of creating external masks or using np.where(), pandas has a built in set of methods for accessing parts of a dataframe.

Sticking with our measurement df, what if I want a view of my DataFrame, but only for the rows where the quality is either 'Bad' or 'Ugly'. We can do that as follows:

df.loc[(df.quality=='Bad')|(df.quality=='Ugly')]
measurement quality
2021-10-08 2 Bad
2021-10-09 3 Ugly

We can see that by putting our conditions inside df.loc[] the way we normally would in np.where(), we return a version of the DataFrame where only the bad and ugly measurement remains.

What if I want to perform the same filtering, but I don’t want the whole DataFrame out, I just want a specific column? We can specify this as well:

df.loc[(df.quality=='Bad')|(df.quality=='Ugly'),'measurement']
2021-10-08    2
2021-10-09    3
Name: measurement, dtype: int64

We add a comma after our conditionals, and then specify the name of the column we want.

This is extensible to multiple columns; we only have two here but it would look like this:

df.loc[(df.quality=='Bad')|(df.quality=='Ugly'),['measurement','quality']]
measurement quality
2021-10-08 2 Bad
2021-10-09 3 Ugly

i.e., we provide a list of column names we want to index, only where the conditions are met.

Let’s say I want to set any ‘Bad’ or ‘Ugly’ measurements to np.nan:

df.loc[(df.quality=='Bad')|(df.quality=='Ugly'),'measurement'] = np.nan
df
measurement quality
2021-10-07 1.0 Good
2021-10-08 NaN Bad
2021-10-09 NaN Ugly

Warning. A common “gotcha” with DataFrames is to not use the col_indexer as above when setting values, but rather something like the following:

df.loc[(df.quality=='Bad')|(df.quality=='Ugly')].measurement = np.nan
/Users/ipasha/anaconda3/envs/a330/lib/python3.8/site-packages/pandas/core/generic.py:5516: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value

We get this error because running .measurement or indexing with ['measurement'] creates a copy of that slice of the dataframe for view – meaning that if we try to set values in it, we are not affecting values in the underlying dataframe.

Joining DataFrames

A powerful strength of DataFrames is the ability to join/merge multiple dataframes on a given column. This behavior is inhereted from the way Databases (such as SQL) work. Let’s create another DataFrame that has some anciliary data also taken on our three dates:

df2 = pd.DataFrame()
df2['weather'] = ['sunny','cloudy','cloudy']
df2['seeing'] = np.array([0.55,2.3,3.2])
df2.index = ['2021-10-07','2021-10-08','2021-10-09']
df2
weather seeing
2021-10-07 sunny 0.55
2021-10-08 cloudy 2.30
2021-10-09 cloudy 3.20

We can easily combine these two DataFrames:

new_df = pd.merge(df,df2,left_index=True,right_index=True)
new_df
measurement quality weather seeing
2021-10-07 1.0 Good sunny 0.55
2021-10-08 NaN Bad cloudy 2.30
2021-10-09 NaN Ugly cloudy 3.20

When merging dataframes, we have to select which columns to merge on. Here, I’ve chosen a full join on the index. Let’s talk about the options:

From database parlance, we can join/merge two tables as long as there is a unique identifer column that is shared across both tables. In our case, the index is that “column”.

  • A left (outer) join means that we take all rows of df, and if df2 has no entry for some rows, we’ll fill it with NaNs.

  • A right (outer) join means we will keep all rows of df2, and if df has no entry for some rows, we’ll fill it with NaNs.

  • A full (outer) join means we keep all rows of both dataframes, filling in NaNs where either is missing data.

Alternatively, we could ask for just the rows that have matches in both dataframes, dropping all other columns. This would be an inner join.

Let’s see this in action. I’m going to add a date, measurement, and quality to df:

df.loc['2021-10-10'] = [5,'Good']
df
measurement quality
2021-10-07 1.0 Good
2021-10-08 NaN Bad
2021-10-09 NaN Ugly
2021-10-10 5.0 Good

Now when we go to do our merge, df2 won’t have a row for that last index. But if I use df.join(), I can specify an outer join:

new_df = df.join(df2,how='outer')
new_df
measurement quality weather seeing
2021-10-07 1.0 Good sunny 0.55
2021-10-08 NaN Bad cloudy 2.30
2021-10-09 NaN Ugly cloudy 3.20
2021-10-10 5.0 Good NaN NaN

Notice that I get NaN for the weather and seeing, as these weren’t present in df2. But I do keep that row overall.

new_df = df.join(df2,how='inner')
new_df
measurement quality weather seeing
2021-10-07 1.0 Good sunny 0.55
2021-10-08 NaN Bad cloudy 2.30
2021-10-09 NaN Ugly cloudy 3.20

Alternatively, if I use an “inner” join, I only keep indices that were in both dataframes. So I lose that entire row.

new_df = df.join(df2,how='right')
new_df
measurement quality weather seeing
2021-10-07 1.0 Good sunny 0.55
2021-10-08 NaN Bad cloudy 2.30
2021-10-09 NaN Ugly cloudy 3.20

This time, I’ve specified I want a “right outer” join. Because the “righthand” dataframe (df2) doesn’t have a row for ‘2021-10-10’, it’s not part of the final df.

new_df = df.join(df2,how='left')
new_df
measurement quality weather seeing
2021-10-07 1.0 Good sunny 0.55
2021-10-08 NaN Bad cloudy 2.30
2021-10-09 NaN Ugly cloudy 3.20
2021-10-10 5.0 Good NaN NaN

Wheras if I use ‘left’, I keep all rows of df including the new one, filling in NaNs for df2 where it has no values.

You may have noticed that in the above examples, ‘left’ and ‘outer’ produced the same results. If so, good eye! They’re the same output in this case because df2 has no rows not present in df. But if it did, then the ‘outer’ join would keep both that row and ‘2021-10-10’ from df, while ‘left’ would ignore the extra row in df2.

As a note, the join’s we’ve been doing have implicitly been using the index as the join column, which is the default behavior. But if there were an actual column in the data that had the necessary properties, we could use the pd.merge(df,df2,on='name') syntax to join on that column instead.

Operations on Columns

We can carry out operations on columns of dataframes like we would with numpy arrays. For example, let’s say I have a dataframe as follows:

df3 = pd.DataFrame()
df3['galaxy_name'] = ['NGC 4300','NGC 3055','NGC 3235','NGC 6532']
df3['flux'] = np.array([1.1e-12,1.5e-13,4.6e-13,1.8e-12])
df3['distance'] = np.array([10.5,33.7,105,22])
df3
galaxy_name flux distance
0 NGC 4300 1.100000e-12 10.5
1 NGC 3055 1.500000e-13 33.7
2 NGC 3235 4.600000e-13 105.0
3 NGC 6532 1.800000e-12 22.0

I can calculate the luminosity of these galaxies and make a new column as follows

import astropy.units as u
lums = (df3.flux.values*u.erg/u.s/u.cm**2* 4 * np.pi * (df3.distance.values*u.Mpc)**2).to(u.Lsun).value
df3['luminosity'] = np.log10(lums)
df3
galaxy_name flux distance luminosity
0 NGC 4300 1.100000e-12 10.5 6.578710
1 NGC 3055 1.500000e-13 33.7 6.726290
2 NGC 3235 4.600000e-13 105.0 8.200075
3 NGC 6532 1.800000e-12 22.0 7.435057

In this case, I use the .values to be sure things play nicely with astropy.

Dropping rows or columns

So far, we’ve seen how to add things into dataframes and modify their values. But what if we want to drop a row or column entirely?

For this, we need df.drop(). This command returns a version of the dataframe with the row or column dropped. As a note, the default behavior is to drop rows, so if you’re specifying a column, you also need to add axis=1 as follows:

df2.drop('2021-10-09')
weather seeing
2021-10-07 sunny 0.55
2021-10-08 cloudy 2.30

We specified an index there, let’s try a column:

df2.drop('seeing',axis=1)
weather
2021-10-07 sunny
2021-10-08 cloudy
2021-10-09 cloudy

Note that df2.drop() doesn’t overwrite df2. To do this, we’d need do do either of the following:

df2 = df2.drop('seeing',axis=1)

#OR 

df2.drop('seeing',inplace=True,axis=1)

I suggest the first, as it is easier to read and remember.

Other Common Operations

Here I’m going to provide a grab bag of common pandas methods that are useful to know:

We can return the unique elements of a column (and the number of them) via df['col2'].unique() and df['col2].nunique()

We can count how many times each value appears in a given column via df['col2'].value_counts()

We can sort a dataframe by the values in some column via df.sort_values('col2')

Data I/O

A final use for pandas is that it was designed to work with messy data. When loading data from, e.g., ascii files, there are often a lot of subtleties surrounding the spacings and columns, mixes of strings and floats, etc.

Traditional numpy functions like loadtxt() or genfromtxt() struggle with these kinds of files, but often, pandas has no trouble with it. At the very least, then, we can use pandas to get the data into python, where we know how to mess with it until it meets our requirements.

df = pd.read_csv('table2.dat')
df.head()
name ion wl flux rms I
0 PN3m [OII] 3727 373.9 58.6 517.3
1 PN3m HeI 3805 0.0 0.0 0.0
2 PN3m HI 3835 0.0 0.0 0.0
3 PN3m [NeIII] 3869 0.0 0.0 0.0
4 PN3m HeI 3889 0.0 0.0 0.0

Above, I’ve run a one line, no modification call of pandas’ read_csv() function. With no issue, it loaded up this file, which contains column names (first row), and a variety of data types.

Let’s try the same thing with numpy:

x = np.loadtxt('table2.dat')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-72-3c744ba5b70f> in <module>
----> 1 x = np.loadtxt('table2.dat')

~/anaconda3/envs/a330/lib/python3.8/site-packages/numpy/lib/npyio.py in loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack, ndmin, encoding, max_rows, like)
   1146         # converting the data
   1147         X = None
-> 1148         for x in read_data(_loadtxt_chunksize):
   1149             if X is None:
   1150                 X = np.array(x, dtype)

~/anaconda3/envs/a330/lib/python3.8/site-packages/numpy/lib/npyio.py in read_data(chunk_size)
    997 
    998             # Convert each value according to its column and store
--> 999             items = [conv(val) for (conv, val) in zip(converters, vals)]
   1000 
   1001             # Then pack it according to the dtype's nesting

~/anaconda3/envs/a330/lib/python3.8/site-packages/numpy/lib/npyio.py in <listcomp>(.0)
    997 
    998             # Convert each value according to its column and store
--> 999             items = [conv(val) for (conv, val) in zip(converters, vals)]
   1000 
   1001             # Then pack it according to the dtype's nesting

~/anaconda3/envs/a330/lib/python3.8/site-packages/numpy/lib/npyio.py in floatconv(x)
    734         if '0x' in x:
    735             return float.fromhex(x)
--> 736         return float(x)
    737 
    738     typ = dtype.type

ValueError: could not convert string to float: 'name,ion,wl,flux,rms,I'

Yep. Numpy doesn’t like it. In part, having columns with strings and also columns with floats leaves us dead in the water. We want pandas for this.

There are cases where pandas is annoying about reading things in and numpy is actually easier. My advice? Know both, and pick whichever one to read in the data causes the least headache. You can always turn an array into a dataframe (and vice versa, if the data are all the same type).

We’re going to use the following dataset in the lab, so I’d like to briefly familiarize you with it:

df.head()
name ion wl flux rms I
0 PN3m [OII] 3727 373.9 58.6 517.3
1 PN3m HeI 3805 0.0 0.0 0.0
2 PN3m HI 3835 0.0 0.0 0.0
3 PN3m [NeIII] 3869 0.0 0.0 0.0
4 PN3m HeI 3889 0.0 0.0 0.0
len(df)
1634
df.groupby(['name','ion','wl']).flux.describe()
count mean std min 25% 50% 75% max
name ion wl
HII123 HI 3835 1.0 6.5 NaN 6.5 6.5 6.5 6.5 6.5
4100 1.0 22.9 NaN 22.9 22.9 22.9 22.9 22.9
4340 1.0 42.8 NaN 42.8 42.8 42.8 42.8 42.8
4861 1.0 100.0 NaN 100.0 100.0 100.0 100.0 100.0
6563 1.0 498.4 NaN 498.4 498.4 498.4 498.4 498.4
... ... ... ... ... ... ... ... ... ... ...
PN9m [ArIII] 7135 1.0 16.9 NaN 16.9 16.9 16.9 16.9 16.9
7751 1.0 0.0 NaN 0.0 0.0 0.0 0.0 0.0
[NeIII] 3869 1.0 4.6 NaN 4.6 4.6 4.6 4.6 4.6
HeI/[ArIV] 4712 1.0 0.0 NaN 0.0 0.0 0.0 0.0 0.0
[NeIII]/HI 3968 1.0 0.0 NaN 0.0 0.0 0.0 0.0 0.0

1634 rows × 8 columns