6. Pandas¶

What is pandas?

Lot of answers, depends on your use case.
Primarily, pandas is a package built around the DataFrame
DataFrames are a more flexible, SQL-like data container

Numpy vs Pandas¶

When would we want to use pandas instead of a trusty numpy array?

When we need to index columns by name
When we need to index rows by name
When we want to store heterogeneous data (e.g., one column of strings, next column of floats)
When we have multiple tables we want to join together
When we want to perform complex filtering operations on a single table.
When the I/O for some data on disk is frustrating using numpy

Note that the first bullet point (indexing columns of an array by name) can be accomplished via some other data containers, including Astropy Tables and numpy record arrays.

However, many of the other bullet points are not easily accomplished with these methods. Furthermore, pandas is the industry standard in data science and other fields. Learning it is a transferrable skill.

Introduction to Pandas¶

For the purposes of this introduction, we need to learn about 2 pandas objects:

pd.Series, and
pd.DataFrame.

DataFrames are constructed out of Series objects, but as it turns out, we don’t need to define them explicitly on most occasions.

Let’s set up a data frame:

import numpy as np 
import pandas as pd 

array1 = np.array([1,2,3])
array2 = ['Good','Bad','Ugly']

df = pd.DataFrame({'measurement':array1,'quality':array2})
df

	measurement	quality
0	1	Good
1	2	Bad
2	3	Ugly

As you can see, one valid initialization is feeding in a dictionary with the desired column names as keys and the desired columns as values.

Here’s another way:

array1 = np.array([1,2,3])
array2 = ['Good','Bad','Ugly']

df = pd.DataFrame()
df['measurement'] = array1
df['quality'] = array2
df

	measurement	quality
0	1	Good
1	2	Bad
2	3	Ugly

i.e., we can create empty DataFrames, and dynamically set new columns the way we are used to setting the values of dictionaries for a given key.

It is worth noting that this “dictionary-style” indexing of DataFrame columns is not the only way. If our column names have no spaces, we can also use dot notation:

df['measurement']

  1
  2
  3
Name: measurement, dtype: int64

df.measurement

  1
  2
  3
Name: measurement, dtype: int64

As you can see, both forms return the same array. Notice, though, that this “column” isn’t just an array… here we see our first pandas Series.

You can think of a Series as a sort of wrapper/container around data arrays that provides extra functionality.

df.quality

  Good
   Bad
  Ugly
Name: quality, dtype: object

For example, we can see that this Series object has a Name (which it tells us when we print it), and it has an index — that’s the 0,1,2 over on the left. As a note, if we ever want to extract just the data values of a series, we run

df.quality.values

array(['Good', 'Bad', 'Ugly'], dtype=object)

I’ve found that the dot-notation indexing of DataFrames can be very handy – .thing is fewer characters to type than ['thing'], and doesn’t require your fingers to find the [] keys.

Let’s come back to our DataFrame, df. What’s going on with this index on the left?

df

	measurement	quality
0	1	Good
1	2	Bad
2	3	Ugly

df.loc[0]

measurement       1
quality        Good
Name: 0, dtype: object

The index part of a DataFrame is a way to index specific rows, rather than specific columns. As you can see above, I used the .loc[] syntax, which tells pandas I’m inserting a row index, rather than a “column index”. Pandas then spits out each column and associated value for that row.

We can set the index of a DataFrame to be whatever we want. For example, let’s say our measurements were taken on 3 dates:

df.index = ['2021-10-07','2021-10-08','2021-10-09']

df

	measurement	quality
2021-10-07	1	Good
2021-10-08	2	Bad
2021-10-09	3	Ugly

Now the [0,1,2] is replaced by our date strings.

Now, if I want to pull the measurement and measurement quality from 2021-10-08, I would use

df.loc['2021-10-08']

measurement      2
quality        Bad
Name: 2021-10-08, dtype: object

What if I know the numerical index of the row I want, but don’t know the name of that row? Pandas doesn’t love it when we are in this state, but there is a solution:

df.iloc[1]

measurement      2
quality        Bad
Name: 2021-10-08, dtype: object

The df.iloc[] syntax allows you to index a row by true index number, rather than whatever the name of that row is.

Filtering DataFrames¶

So now that we know how to create a simple DataFrame, how do we go about filtering it? Instead of creating external masks or using np.where(), pandas has a built in set of methods for accessing parts of a dataframe.

Sticking with our measurement df, what if I want a view of my DataFrame, but only for the rows where the quality is either 'Bad' or 'Ugly'. We can do that as follows:

df.loc[(df.quality=='Bad')|(df.quality=='Ugly')]

	measurement	quality
2021-10-08	2	Bad
2021-10-09	3	Ugly

We can see that by putting our conditions inside df.loc[] the way we normally would in np.where(), we return a version of the DataFrame where only the bad and ugly measurement remains.

What if I want to perform the same filtering, but I don’t want the whole DataFrame out, I just want a specific column? We can specify this as well:

df.loc[(df.quality=='Bad')|(df.quality=='Ugly'),'measurement']

2021-10-08    2
2021-10-09    3
Name: measurement, dtype: int64

We add a comma after our conditionals, and then specify the name of the column we want.

This is extensible to multiple columns; we only have two here but it would look like this:

df.loc[(df.quality=='Bad')|(df.quality=='Ugly'),['measurement','quality']]

	measurement	quality
2021-10-08	2	Bad
2021-10-09	3	Ugly

i.e., we provide a list of column names we want to index, only where the conditions are met.

Let’s say I want to set any ‘Bad’ or ‘Ugly’ measurements to np.nan:

df.loc[(df.quality=='Bad')|(df.quality=='Ugly'),'measurement'] = np.nan
df

	measurement	quality
2021-10-07	1.0	Good
2021-10-08	NaN	Bad
2021-10-09	NaN	Ugly

Warning. A common “gotcha” with DataFrames is to not use the col_indexer as above when setting values, but rather something like the following:

df.loc[(df.quality=='Bad')|(df.quality=='Ugly')].measurement = np.nan

/Users/ipasha/anaconda3/envs/a330/lib/python3.8/site-packages/pandas/core/generic.py:5516: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value

We get this error because running .measurement or indexing with ['measurement'] creates a copy of that slice of the dataframe for view – meaning that if we try to set values in it, we are not affecting values in the underlying dataframe.

Joining DataFrames¶

A powerful strength of DataFrames is the ability to join/merge multiple dataframes on a given column. This behavior is inhereted from the way Databases (such as SQL) work. Let’s create another DataFrame that has some anciliary data also taken on our three dates:

df2 = pd.DataFrame()
df2['weather'] = ['sunny','cloudy','cloudy']
df2['seeing'] = np.array([0.55,2.3,3.2])
df2.index = ['2021-10-07','2021-10-08','2021-10-09']
df2

	weather	seeing
2021-10-07	sunny	0.55
2021-10-08	cloudy	2.30
2021-10-09	cloudy	3.20

We can easily combine these two DataFrames:

new_df = pd.merge(df,df2,left_index=True,right_index=True)
new_df

	measurement	quality	weather	seeing
2021-10-07	1.0	Good	sunny	0.55
2021-10-08	NaN	Bad	cloudy	2.30
2021-10-09	NaN	Ugly	cloudy	3.20

When merging dataframes, we have to select which columns to merge on. Here, I’ve chosen a full join on the index. Let’s talk about the options:

From database parlance, we can join/merge two tables as long as there is a unique identifer column that is shared across both tables. In our case, the index is that “column”.

A left (outer) join means that we take all rows of df, and if df2 has no entry for some rows, we’ll fill it with NaNs.
A right (outer) join means we will keep all rows of df2, and if df has no entry for some rows, we’ll fill it with NaNs.
A full (outer) join means we keep all rows of both dataframes, filling in NaNs where either is missing data.

Alternatively, we could ask for just the rows that have matches in both dataframes, dropping all other columns. This would be an inner join.

Let’s see this in action. I’m going to add a date, measurement, and quality to df:

df.loc['2021-10-10'] = [5,'Good']
df

	measurement	quality
2021-10-07	1.0	Good
2021-10-08	NaN	Bad
2021-10-09	NaN	Ugly
2021-10-10	5.0	Good

Now when we go to do our merge, df2 won’t have a row for that last index. But if I use df.join(), I can specify an outer join:

new_df = df.join(df2,how='outer')
new_df

	measurement	quality	weather	seeing
2021-10-07	1.0	Good	sunny	0.55
2021-10-08	NaN	Bad	cloudy	2.30
2021-10-09	NaN	Ugly	cloudy	3.20
2021-10-10	5.0	Good	NaN	NaN

Notice that I get NaN for the weather and seeing, as these weren’t present in df2. But I do keep that row overall.

new_df = df.join(df2,how='inner')
new_df

	measurement	quality	weather	seeing
2021-10-07	1.0	Good	sunny	0.55
2021-10-08	NaN	Bad	cloudy	2.30
2021-10-09	NaN	Ugly	cloudy	3.20

Alternatively, if I use an “inner” join, I only keep indices that were in both dataframes. So I lose that entire row.

new_df = df.join(df2,how='right')
new_df

	measurement	quality	weather	seeing
2021-10-07	1.0	Good	sunny	0.55
2021-10-08	NaN	Bad	cloudy	2.30
2021-10-09	NaN	Ugly	cloudy	3.20

This time, I’ve specified I want a “right outer” join. Because the “righthand” dataframe (df2) doesn’t have a row for ‘2021-10-10’, it’s not part of the final df.

new_df = df.join(df2,how='left')
new_df

	measurement	quality	weather	seeing
2021-10-07	1.0	Good	sunny	0.55
2021-10-08	NaN	Bad	cloudy	2.30
2021-10-09	NaN	Ugly	cloudy	3.20
2021-10-10	5.0	Good	NaN	NaN

Wheras if I use ‘left’, I keep all rows of df including the new one, filling in NaNs for df2 where it has no values.

You may have noticed that in the above examples, ‘left’ and ‘outer’ produced the same results. If so, good eye! They’re the same output in this case because df2 has no rows not present in df. But if it did, then the ‘outer’ join would keep both that row and ‘2021-10-10’ from df, while ‘left’ would ignore the extra row in df2.

As a note, the join’s we’ve been doing have implicitly been using the index as the join column, which is the default behavior. But if there were an actual column in the data that had the necessary properties, we could use the pd.merge(df,df2,on='name') syntax to join on that column instead.

Operations on Columns¶

We can carry out operations on columns of dataframes like we would with numpy arrays. For example, let’s say I have a dataframe as follows:

df3 = pd.DataFrame()
df3['galaxy_name'] = ['NGC 4300','NGC 3055','NGC 3235','NGC 6532']
df3['flux'] = np.array([1.1e-12,1.5e-13,4.6e-13,1.8e-12])
df3['distance'] = np.array([10.5,33.7,105,22])
df3

	galaxy_name	flux	distance
0	NGC 4300	1.100000e-12	10.5
1	NGC 3055	1.500000e-13	33.7
2	NGC 3235	4.600000e-13	105.0
3	NGC 6532	1.800000e-12	22.0

I can calculate the luminosity of these galaxies and make a new column as follows

import astropy.units as u
lums = (df3.flux.values*u.erg/u.s/u.cm**2* 4 * np.pi * (df3.distance.values*u.Mpc)**2).to(u.Lsun).value
df3['luminosity'] = np.log10(lums)
df3

	galaxy_name	flux	distance	luminosity
0	NGC 4300	1.100000e-12	10.5	6.578710
1	NGC 3055	1.500000e-13	33.7	6.726290
2	NGC 3235	4.600000e-13	105.0	8.200075
3	NGC 6532	1.800000e-12	22.0	7.435057

In this case, I use the .values to be sure things play nicely with astropy.

Dropping rows or columns¶

So far, we’ve seen how to add things into dataframes and modify their values. But what if we want to drop a row or column entirely?

For this, we need df.drop(). This command returns a version of the dataframe with the row or column dropped. As a note, the default behavior is to drop rows, so if you’re specifying a column, you also need to add axis=1 as follows:

df2.drop('2021-10-09')

	weather	seeing
2021-10-07	sunny	0.55
2021-10-08	cloudy	2.30

We specified an index there, let’s try a column:

df2.drop('seeing',axis=1)

	weather
2021-10-07	sunny
2021-10-08	cloudy
2021-10-09	cloudy

Note that df2.drop() doesn’t overwrite df2. To do this, we’d need do do either of the following:

df2 = df2.drop('seeing',axis=1)

#OR 

df2.drop('seeing',inplace=True,axis=1)

I suggest the first, as it is easier to read and remember.

Other Common Operations¶

Here I’m going to provide a grab bag of common pandas methods that are useful to know:

We can return the unique elements of a column (and the number of them) via df['col2'].unique() and df['col2].nunique()

We can count how many times each value appears in a given column via df['col2'].value_counts()

We can sort a dataframe by the values in some column via df.sort_values('col2')

Data I/O¶

A final use for pandas is that it was designed to work with messy data. When loading data from, e.g., ascii files, there are often a lot of subtleties surrounding the spacings and columns, mixes of strings and floats, etc.

Traditional numpy functions like loadtxt() or genfromtxt() struggle with these kinds of files, but often, pandas has no trouble with it. At the very least, then, we can use pandas to get the data into python, where we know how to mess with it until it meets our requirements.

df = pd.read_csv('table2.dat')

df.head()

	name	ion	wl	flux	rms	I
0	PN3m	[OII]	3727	373.9	58.6	517.3
1	PN3m	HeI	3805	0.0	0.0	0.0
2	PN3m	HI	3835	0.0	0.0	0.0
3	PN3m	[NeIII]	3869	0.0	0.0	0.0
4	PN3m	HeI	3889	0.0	0.0	0.0

Above, I’ve run a one line, no modification call of pandas’ read_csv() function. With no issue, it loaded up this file, which contains column names (first row), and a variety of data types.

Let’s try the same thing with numpy:

x = np.loadtxt('table2.dat')

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-72-3c744ba5b70f> in <module>
----> 1 x = np.loadtxt('table2.dat')

~/anaconda3/envs/a330/lib/python3.8/site-packages/numpy/lib/npyio.py in loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack, ndmin, encoding, max_rows, like)
   1146         # converting the data
   1147         X = None
-> 1148         for x in read_data(_loadtxt_chunksize):
   1149             if X is None:
   1150                 X = np.array(x, dtype)

~/anaconda3/envs/a330/lib/python3.8/site-packages/numpy/lib/npyio.py in read_data(chunk_size)
    997 
    998             # Convert each value according to its column and store
--> 999             items = [conv(val) for (conv, val) in zip(converters, vals)]
   1000 
   1001             # Then pack it according to the dtype's nesting

~/anaconda3/envs/a330/lib/python3.8/site-packages/numpy/lib/npyio.py in <listcomp>(.0)
    997 
    998             # Convert each value according to its column and store
--> 999             items = [conv(val) for (conv, val) in zip(converters, vals)]
   1000 
   1001             # Then pack it according to the dtype's nesting

~/anaconda3/envs/a330/lib/python3.8/site-packages/numpy/lib/npyio.py in floatconv(x)
    734         if '0x' in x:
    735             return float.fromhex(x)
--> 736         return float(x)
    737 
    738     typ = dtype.type

ValueError: could not convert string to float: 'name,ion,wl,flux,rms,I'

Yep. Numpy doesn’t like it. In part, having columns with strings and also columns with floats leaves us dead in the water. We want pandas for this.

There are cases where pandas is annoying about reading things in and numpy is actually easier. My advice? Know both, and pick whichever one to read in the data causes the least headache. You can always turn an array into a dataframe (and vice versa, if the data are all the same type).

We’re going to use the following dataset in the lab, so I’d like to briefly familiarize you with it:

df.head()

	name	ion	wl	flux	rms	I
0	PN3m	[OII]	3727	373.9	58.6	517.3
1	PN3m	HeI	3805	0.0	0.0	0.0
2	PN3m	HI	3835	0.0	0.0	0.0
3	PN3m	[NeIII]	3869	0.0	0.0	0.0
4	PN3m	HeI	3889	0.0	0.0	0.0

len(df)

df.groupby(['name','ion','wl']).flux.describe()

			count	mean	std	min	25%	50%	75%	max
name	ion	wl
HII123	HI	3835	1.0	6.5	NaN	6.5	6.5	6.5	6.5	6.5
		4100	1.0	22.9	NaN	22.9	22.9	22.9	22.9	22.9
		4340	1.0	42.8	NaN	42.8	42.8	42.8	42.8	42.8
		4861	1.0	100.0	NaN	100.0	100.0	100.0	100.0	100.0
		6563	1.0	498.4	NaN	498.4	498.4	498.4	498.4	498.4
...	...	...	...	...	...	...	...	...	...	...
PN9m	[ArIII]	7135	1.0	16.9	NaN	16.9	16.9	16.9	16.9	16.9
	[ArIII]	7751	1.0	0.0	NaN	0.0	0.0	0.0	0.0	0.0
	[NeIII]	3869	1.0	4.6	NaN	4.6	4.6	4.6	4.6	4.6
	HeI/[ArIV]	4712	1.0	0.0	NaN	0.0	0.0	0.0	0.0	0.0
	[NeIII]/HI	3968	1.0	0.0	NaN	0.0	0.0	0.0	0.0	0.0

1634 rows × 8 columns

Astro 330 Course Book