Pandas and Climate Data

Overview

Teaching: 25 min
Exercises: 0 min

Questions

How can Pandas be used with climate data?

Objectives

Apply Pandas to climate data.

Use Seaborn to analyze Pandas DataFrames

More about Pandas

Pandas is designed for analyzing tabular data. Climate model output is rarely in tabular form, but often observational data is. Such data may be in the form of plain text files, or more often comma-separated files that have the suffix .csv Note that the separator does not have to be a comma - any unique charater will work (e.g., semicolons, tabs, or spaces).

Spreadsheet apps like Microsoft Excel and Google Sheets easily read and write out files in CSV format.

Open a new Jupyter notebook. We will need to import both pandas and seaborn.

import pandas as pd
import seaborn as sns

We will use some of the monthly climate index data from NOAA, which has been placed in the pandas_data directory with the name: monthly_climate_indices.csv

Let’s open the file as a Pandas DataFrame and have a look. From a new notebook in your home directory:

file = "pandas_data/monthly_climate_indices.csv"
df = pd.read_csv(file)
df

	YEAR	MONTH	NINO1+2	ANOM1+2	NINO3	ANOM3	NINO4	ANOM4	NINO3.4	ANOM3.4	AO Index
1950	1	23.01	-1.55	23.56	-2.10	26.94	-1.38	24.55	-1.99	-0.06030
1950	2	24.32	-1.78	24.89	-1.52	26.67	-1.53	25.06	-1.69	 0.62700
1950	3	25.11	-1.38	26.36	-0.84	26.52	-1.80	25.87	-1.42	-0.00813
1950	4	23.63	-1.90	26.44	-1.14	26.90	-1.73	26.28	-1.54	 0.55500
1950	5	22.68	-1.74	25.69	-1.57	27.73	-1.18	26.18	-1.75	 0.07160
...	...	...	...	...	...	...	...	...	...	...	...
2022	5	22.77	-1.65	26.20	-1.06	28.10	-0.81	26.81	-1.12	 1.22000
2022	6	21.65	-1.48	25.81	-0.81	28.25	-0.72	26.97	-0.76	-0.08400
2022	7	20.77	-1.19	25.27	-0.53	27.90	-1.00	26.59	-0.70	 0.01780
2022	8	20.43	-0.58	24.44	-0.68	27.69	-1.10	25.87	-0.98	-0.18000
2022	9	19.70	-1.02	23.94	-0.97	27.58	-1.18	25.62	-1.09	-0.66100
rows × 11 columns

Calculate a climatology from a Pandas DataFrame

This works in the same way as we have seen for xarray - in fact, xarray uses several of the techniques from Pandas, such as Pandas’ elegant handling of missing data, and its time/caledar features.

Here, we will use the groupby method learned previously. It works very similarly for xarray and Pandas; there is a slight difference in the arguments and parameters, but both produce a groupby object that we can aggregate.

df_climo = df.groupby(['MONTH']).mean()
df_climo

    YEAR   NINO1+2   ANOM1+2   NINO3      ANOM3     NINO4     ANOM4    NINO3.4   ANOM3.4   AO Index
MONTH										
1986.0 24.277671 -0.288356 25.477534 -0.182055 28.085205 -0.233973 26.408767 -0.135890 -0.362575
1986.0 25.778630 -0.321781 26.232603 -0.173288 27.984658 -0.212329 26.614658 -0.137123 -0.318233
1986.0 26.174521 -0.313151 27.008904 -0.195205 28.087671 -0.232740 27.116712 -0.165890 -0.037440
1986.0 25.256438 -0.279315 27.298493 -0.282466 28.362603 -0.264795 27.566575 -0.251781  0.098748
1986.0 24.058082 -0.358082 26.926712 -0.325068 28.673425 -0.242466 27.681507 -0.251781  0.012021
1986.0 22.804247 -0.320000 26.335068 -0.283973 28.706986 -0.262055 27.499315 -0.228493  0.004055
1986.0 21.688767 -0.270685 25.545616 -0.259452 28.595205 -0.302466 27.070685 -0.222466 -0.111214
1986.0 20.765890 -0.240685 24.903562 -0.214110 28.459589 -0.329315 26.655068 -0.199863 -0.134370
1986.0 20.462877 -0.263699 24.710137 -0.191233 28.436712 -0.323699 26.542192 -0.176438 -0.004132
1985.5 20.751528 -0.264306 24.758889 -0.221389 28.458889 -0.301528 26.520278 -0.196389 -0.019480
1985.5 21.435833 -0.218750 24.866806 -0.235694 28.425694 -0.270833 26.504167 -0.198333 -0.075086
1985.5 22.580278 -0.230000 25.035000 -0.192361 28.304306 -0.235417 26.447361 -0.150833 -0.138510

See what we did and notice the results…

As we saw in the previous Pandas lessons with the .loc method, we specify the name of the column around which we group the data (MONTH) as a string (to give the name of the column) inside square brackets: ['MONTH'].

The aggregator mean() has calculated averages for all the other columns in the DataFrame. > For the column YEAR, this averaging is kind of meaningless - the > climatology does not have an associated year, but rather the range of years in the DataFrame. > > The columns representing anomalies for the El Niño indices are not equal to zero. Why?

This last point is an important one for scientists - you should always be on the lookout for irregularities like this. When you find one, investigate it. Does this mean we did something wrong, or is there another explanation? What could cause such a result? Let’s track it down…

Solution

Click on the link at the top of this notebook for the source of this data, then scroll down to the entry called Niño 3*.

There is a clue in the file name there, but let’s look further: click on the link for NOAA Climate Prediction Center(CPC).

Scroll down, under Sea Surface Temperature (SST) and Monthly ERSSTv5, where the four different indices are listed. What does it say about the base period?

The base period is 1991-2020. We took an avearge across all the data, which start from 1950. We should not expect the 30-year anomaly to match the anomaly calculated acros a different period, especially when there are trends. The negative values above, result from including the 41 years prior to 1991, prior to the increasing effects of global warming.

Below is an example of how we can choose only rows where the values in a specific column (here, YEAR) have a specific value or range of values. With this, we can limit the calculation of the monthly climatology to a range of years.

df_climo = df.loc[(df['YEAR'] >= 1991) & (df['YEAR'] <= 2020)].groupby(['MONTH']).mean()
df_climo

    YEAR   NINO1+2   ANOM1+2    NINO3       ANOM3       NINO4     ANOM4    NINO3.4    ANOM3.4	AO    Index
MONTH										
2005.5 24.565333 -0.000667 25.658667 -1.000000e-03 28.319000 -0.000667 26.544333  7.401487e-18  0.000233
2005.5 26.099667 -0.000333 26.405000 -2.775558e-18 28.197333  0.000333 26.752333  6.666667e-04  0.119400
2005.5 26.488000  0.000667 27.203667 -3.333333e-04 28.321000  0.000333 27.283000  1.000000e-03  0.255533
2005.5 25.536000 -0.000333 27.580667 -3.333333e-04 28.627000 -0.000667 27.818667 -3.333333e-04  0.145687
2005.5 24.415333 -0.001000 27.251333 -6.666667e-04 28.916667  0.000667 27.933333  3.333333e-04  0.059827
2005.5 23.124667 -0.000667 26.619333  3.700743e-18 28.969000  0.000333 27.726667  3.700743e-18 -0.032170
2005.5 21.959000 -0.000333 25.804667  0.000000e+00 28.898667  0.001000 27.293333  1.850372e-18 -0.146541
2005.5 21.006000 -0.001000 25.118333  3.333333e-04 28.788333 -0.000333 26.855000  0.000000e+00 -0.005027
2005.5 20.726333 -0.000333 24.901333 -3.700743e-18 28.761333  0.000333 26.718667 -1.000000e-03  0.042643
2005.5 21.022000  0.005333 24.981333  1.333333e-03 28.760667  0.000333 26.716333  3.333333e-04 -0.218530
2005.5 21.659333  0.004000 25.103000  3.333333e-04 28.697000 -0.000333 26.702667 -6.666667e-04  0.170653
2005.5 22.817333  0.007000 25.229667  2.666667e-03 28.539333 -0.000667 26.599000  6.666667e-04 -0.045037

Now the anomalies are very small - essentially zero, within the range of precision of our original data (0.01˚C, so a 30-year mean may be off by a few thirtieths times one hundredth of a degree). Interestingly, a few values are still suspiciously large (e.g., ANOM1+2 for October-December)!

Preparing Data for Seaborn

Seaborn is a Python data visualization library build atop matplotlib. It is designed to work with Pandas DataFrames, particularly to combine data from multiple columns in a single visualization.

To take full advantage of Seaborn, let’s rearrange our DataFrame. Use the melt method to combine the index columns into a single column of values, and add a new column that indicates the name of the index. We will maintain values for each month and year.

melted_df = pd.melt(df, id_vars=['YEAR','MONTH'], value_vars=['ANOM1+2','ANOM3','ANOM4','ANOM3.4','AO Index'],var_name='Index')
melted_df

      YEAR MONTH   Index   value
1950   1    ANOM1+2 -1.5500
1950   2    ANOM1+2 -1.7800
1950   3    ANOM1+2 -1.3800
1950   4    ANOM1+2 -1.9000
1950   5    ANOM1+2 -1.7400
 ...   ...  ...    ...     ...
2022   5   AO Index  1.2200
2022   6   AO Index -0.0840
2022   7   AO Index  0.0178
2022   8   AO Index -0.1800
2022   9   AO Index -0.6610
rows × 4 columns

You can see that the new DataFrame has fewer columns, but the rows are grouped by index (the old columns) by default.

We can re-sort this DataFrame so that the rows progress in order of time.

sorted_df = melted_df.sort_values(by=['YEAR', 'MONTH'],ignore_index=True)
sorted_df

      YEAR MONTH	 Index   value
1950   1    ANOM1+2 -1.5500
1950   1      ANOM3 -2.1000
1950   1      ANOM4 -1.3800
1950   1    ANOM3.4 -1.9900
1950   1   AO Index -0.0603
 ...   ...  ...     ...       ...
2022   9    ANOM1+2 -1.0200
2022   9      ANOM3 -0.9700
2022   9      ANOM4 -1.1800
2022   9    ANOM3.4 -1.0900
2022   9   AO Index -0.6610
rows × 4 columns

Plotting with Seaborn

We can plot this data in multiple ways with very succinct plotting functions. The power of Seaborn is that the data in any column can be used to determine aspects of the plot (e.g., color, line style, symbol shape, size).

Here we plot time series of the December values of each index, where the line color corresponds to the different indices. The appropriate legend is automatically generated.

sns.set_theme(style="darkgrid")
sx = sns.lineplot(x="YEAR",y="value", hue="Index",
                  data=sorted_df.query("MONTH == 12"))
sx.set_title("December values for Climate Indices") ;

Seaborn line plot

The dark theme with white grid lines is a trademark of Seaborn plots.

Not surprisingly, the various ENSO indices are highly correlated. The Acrtic Oscillation appears it might be anti-correlated with El Niño.

We can display how all these indices relate to each other with the pairplot method. We will go back to our original DataFrame for this, and extract only the anomaly indices for plotting:

columns=['ANOM1+2','ANOM3','ANOM4','ANOM3.4','AO Index']
sns.pairplot(df.query("MONTH == 12")[columns]) ;

Seaborn pair plot

Each index is scattered against every other index. Along the diagonal, a scatter plot would result in a perfect correlation and not reveal any useful new information, so instead a histogram of the probability distribution for each index is plotted automatically.

This is probably not the kind of plot you would use in a presentation or publication, but it is very good for getting a quick look at your data and how the variables might relate. We can see that the El Niño indices are quite well correlated with each other, but the suspected anticorrelation with the AO index is not apparent.

There is an option to plot kernel density instead:

sns.pairplot(df.query("MONTH == 12")[columns], kind="kde") ;

Seaborn pair plot

Key Points

Pandas DataFrames are much like spreadsheets.

Many of the concepts, and even the names of methods, are the same as for xarray.

Seaborn provides quick ways to plot the contents of DataFrames.

previous episode

Using JupyterLab & Python Programming 2

lesson home

Pandas and Climate Data

Overview

More about Pandas

Calculate a climatology from a Pandas DataFrame

See what we did and notice the results…

Solution

Preparing Data for Seaborn

Plotting with Seaborn

Key Points

previous episode

lesson home