In this blog post, we’ll explore the NOAA climate data using several interesting and interactive data visualizations.
We begin by importing the packages that we will use.
import sqlite3
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from plotly import express as px
1. Create a Database
Before we can create our database, we must create a function that will allow us to clean the data we wish to put into it.
def prepare_df(df):
df = df.set_index(keys=["ID", "Year"])
df = df.stack()
df = df.reset_index()
df = df.rename(columns = {"level_2" : "Month" , 0 : "Temp"})
df["Month"] = df["Month"].str[5:].astype(int)
df["Temp"] = df["Temp"] / 100
df["FIPS 10-4"] = df["ID"].str[0:2]
return(df)
We can now create our database.
conn = sqlite3.connect("bp1.db")
Our first step will be to read the file containing the temperature data in chunks and copy the information over to the temperatures
table in our database.
df_iter = pd.read_csv("temps.csv", chunksize = 100000)
for df in df_iter:
df = prepare_df(df)
df.to_sql("temperatures", conn, if_exists = "append", index = False)
We add the stations
table to our database.
url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/noaa-ghcn/station-metadata.csv"
stations = pd.read_csv(url)
stations.to_sql("stations", conn, if_exists = "replace", index = False)
We add the countries
table to our database.
countries_url = "https://raw.githubusercontent.com/mysociety/gaze/master/data/fips-10-4-to-iso-country-codes.csv"
countries = pd.read_csv(countries_url)
countries.to_sql("countries", conn, if_exists = "replace", index = False)
We have now finished populating our database. Let us check to make sure we have done so correctly.
cursor = conn.cursor()
cursor.execute("SELECT sql FROM sqlite_master WHERE type='table';")
for result in cursor.fetchall():
print(result[0])
CREATE TABLE "temperatures" (
"ID" TEXT,
"Year" INTEGER,
"Month" INTEGER,
"Temp" REAL,
"FIPS 10-4" TEXT
)
CREATE TABLE "stations" (
"ID" TEXT,
"LATITUDE" REAL,
"LONGITUDE" REAL,
"STNELEV" REAL,
"NAME" TEXT
)
CREATE TABLE "countries" (
"FIPS 10-4" TEXT,
"ISO 3166" TEXT,
"Name" TEXT
)
Looks good! We finish off by closing the connection to the database.
conn.close()
2. Write a Query Function
We now want to write a query function that allows us to retrieve data from our database and returns it in the form of a dataframe. This data should be of a given month, country, and time frame.
def query_climate_database(country, year_begin, year_end, month):
conn = sqlite3.connect("bp1.db")
cmd = \
f"""
SELECT S.name, S.latitude, S.longitude, C.name, T.year, T.month, T.temp
FROM temperatures T
LEFT JOIN stations S on T.id = S.id
LEFT JOIN countries C on T.[FIPS 10-4] = C.[FIPS 10-4]
WHERE T.month = {month} AND C.name = "{country}" AND (T.year >= {year_begin} AND T.year <= {year_end})
"""
df = pd.read_sql_query(cmd, conn)
conn.close()
return df
Let us test the function to make sure it works:
query_climate_database(country = "India",
year_begin = 1980,
year_end = 2020,
month = 1)
NAME | LATITUDE | LONGITUDE | Name | Year | Month | Temp | |
---|---|---|---|---|---|---|---|
0 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1980 | 1 | 23.48 |
1 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1981 | 1 | 24.57 |
2 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1982 | 1 | 24.19 |
3 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1983 | 1 | 23.51 |
4 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1984 | 1 | 24.81 |
... | ... | ... | ... | ... | ... | ... | ... |
3147 | DARJEELING | 27.050 | 88.270 | India | 1983 | 1 | 5.10 |
3148 | DARJEELING | 27.050 | 88.270 | India | 1986 | 1 | 6.90 |
3149 | DARJEELING | 27.050 | 88.270 | India | 1994 | 1 | 8.10 |
3150 | DARJEELING | 27.050 | 88.270 | India | 1995 | 1 | 5.60 |
3151 | DARJEELING | 27.050 | 88.270 | India | 1997 | 1 | 5.70 |
3152 rows × 7 columns
Our output looks as we expected it to! Let’s move on.
3. Write a Geographic Scatter Function for Yearly Temperature Increases
Time to plot the data! Our goal in this section is to answer the question: “How does the average yearly change in temperature vary within a given country?”
First, we will write a function that will allow use to fit a linear regression model onto a piece of data from the set we’re working with and return the coefficient of the resulting model.
from sklearn.linear_model import LinearRegression
def coef(data_group):
x = data_group[["Year"]] # predictor data
y = data_group["Temp"] # target data
LR = LinearRegression()
LR.fit(x, y)
return LR.coef_[0]
We will now define our plotting function. This function will be given a country, a month, a range of years, a minimum number of observation years, and various arguments appropriate for a px.scatter_mapbox
. Within the function, we will use the given inputs to pull from our database and create a dataframe using the query function we defined above. We will then calculate the linear regression coefficients for each of the stations in our dataframe, and plot this on a geographic scatterplot.
def temperature_coefficient_plot(country, year_begin, year_end, month, min_obs, **kwargs):
df = query_climate_database(country, year_begin, year_end, month)
df["Years of Data"] = df.groupby(["NAME"])["Temp"].transform(len)
df = df[df["Years of Data"] >= min_obs]
coef_df = df.groupby(["NAME"]).apply(coef)
coef_df = coef_df.reset_index()
coef_df.rename(columns={0:"Estimated Yearly Increase (°C)"}, inplace=True )
df = pd.merge(df, coef_df, on = ["NAME"])
df["Temp"].round(decimals = 4)
fig = px.scatter_mapbox(df,
lat = "LATITUDE",
lon = "LONGITUDE",
hover_name = "NAME",
color = "Estimated Yearly Increase (°C)",
color_continuous_midpoint = 0,
title = f"Estimates of yearly increase in temperature in Month {month} for stations in {country}, years {year_begin} - {year_end}",
**kwargs)
fig.update_layout(margin={"r":0, "t":50, "l":0, "b":0})
return fig
We test our function on Januaries in India from 1980 to 2020.
# assumes you have imported necessary packages
color_map = px.colors.diverging.RdGy_r # choose a colormap
fig = temperature_coefficient_plot("India", 1980, 2020, 1,
min_obs = 10,
zoom = 2,
mapbox_style="carto-positron",
color_continuous_scale=color_map)
fig.show()
4. Create Two More Interesting Figures
Let us start by posing some questions and considering how we might answer them.
Our first question: How can we observe the effects of global warming in a given country?
One way to approach this is to take a given country and a given range of years, and calculate how the z-score for each station in that country is changing over time.
Let us start by defining a new query function, which is similar to our previous one but will no longer depend on a given month, as we want to track data for all months over all years in our range.
def query_climate_database_2(country, year_begin, year_end):
conn = sqlite3.connect("bp1.db")
cmd = \
f"""
SELECT S.name, S.latitude, S.longitude, C.name, T.year, T.month, T.temp
FROM temperatures T
LEFT JOIN stations S on T.id = S.id
LEFT JOIN countries C on T.[FIPS 10-4] = C.[FIPS 10-4]
WHERE C.name = "{country}" AND (T.year >= {year_begin} AND T.year <= {year_end})
"""
df = pd.read_sql_query(cmd, conn)
conn.close()
return df
We also know that we will need to calculate a z-score, so we define a function for calculating that.
def z_score(x):
m = np.mean(x)
s = np.std(x)
return (x - m)/s
Now we can plot our data. We want to extract data from the database using our new query function, and then calculate the z-score for each station and month. In other words, we will calculate how far the temperature at a given station and month deviates from the average temperature of that station and month. We will then make a line plot that compares the date to the z-score of a station; each line will represent one station in the country.
def z_score_plot(country, year_begin, year_end, **kwargs):
df = query_climate_database_2(country, year_begin, year_end)
df["z"] = df.groupby(["NAME", "Month"])["Temp"].transform(z_score)
df["Date"] = pd.to_datetime(df["Year"].astype(str) + "-" + df["Month"].astype(str))
fig = px.line(df,
x = "Date",
y = "z",
hover_name = "NAME",
color = "NAME",
title = f"Z-score vs. time in {country}, from {year_begin} - {year_end}",
**kwargs)
fig.update_layout(margin={"r":0, "t":50, "l":0, "b":0})
return fig
We test our function on the coutnry of Suriname, which is the smallest independent country in South America. It is important to note that this plotting function would likely not work well with a large country that has many stations.
fig = z_score_plot("Suriname", 1980, 2010)
fig.show()
It appears that all but one station ceased data collection by approximately 1990. From the plot, we can see that, since roughly 2000, there has been a trend of increasing z-scores. This means that as time has gone on, the temperature has deviated more and more from the overall average, which can be taken as a sign of global warming.
We now move on to a different question: What is the relationship between longitude (i.e. distance from the equator) and temperature?
Again, we define a query function that will allow us to extract relevant data from our database. In this case, we will compare the temperatures of 3 countries at various longitudes at two given months in a given year. By default, these two months will be June and December.
def query_climate_database_3(year, country_1, country_2, country_3, sum_month = 6, win_month = 12):
conn = sqlite3.connect("bp1.db")
cmd = \
f"""
SELECT S.name, S.latitude, S.longitude, C.name, T.year, T.month, T.temp
FROM temperatures T
LEFT JOIN stations S on T.id = S.id
LEFT JOIN countries C on T.[FIPS 10-4] = C.[FIPS 10-4]
WHERE (T.year = {year}) AND (T.month = {sum_month} OR T.month = {win_month}) AND (C.name = "{country_1}" OR C.name = "{country_2}" OR C.name = "{country_3}")
"""
df = pd.read_sql_query(cmd, conn)
conn.close()
return df
We test our query function to make sure it works as desired:
query_climate_database_3(2010, "India", "China", "Mexico")
NAME | LATITUDE | LONGITUDE | Name | Year | Month | Temp | |
---|---|---|---|---|---|---|---|
0 | MOHE | 52.1330 | 122.5170 | China | 2010 | 6 | 18.73 |
1 | MOHE | 52.1330 | 122.5170 | China | 2010 | 12 | -28.72 |
2 | HUMA | 51.7170 | 126.6500 | China | 2010 | 6 | 21.71 |
3 | HUMA | 51.7170 | 126.6500 | China | 2010 | 12 | -23.55 |
4 | TULIHE | 50.4500 | 121.7000 | China | 2010 | 6 | 17.50 |
... | ... | ... | ... | ... | ... | ... | ... |
1045 | GENERAL_JUAN_N_ALVA | 16.7500 | -99.7500 | Mexico | 2010 | 12 | 24.85 |
1046 | MAZATLAN | 23.1881 | -105.0000 | Mexico | 2010 | 6 | 27.06 |
1047 | MAZATLAN | 23.1881 | -105.0000 | Mexico | 2010 | 12 | 19.90 |
1048 | TORREONCOAH | 21.5167 | -103.4333 | Mexico | 2010 | 6 | 30.50 |
1049 | TORREONCOAH | 21.5167 | -103.4333 | Mexico | 2010 | 12 | 16.35 |
1050 rows × 7 columns
Now we’re ready to plot! This time, we will create a scatter plot that shows the relationship between a station’s distance from the equator and its temperature in a given year and two given months. Since we are dealing with two months–one in summer, one in winter–we will created a faceted plot in which each plot represents one of these two months. We will also assign different colors based on the country in which the station is located in.
def longitude_vs_temperature_plot(year, country_1, country_2, country_3, sum_month = 6, win_month = 12, **kwargs):
df = query_climate_database_3(year, country_1, country_2, country_3, sum_month, win_month)
df["Distance From Equator"] = np.abs(df["LONGITUDE"])
fig = px.scatter(df,
x = "Distance From Equator",
y = "Temp",
hover_name = "NAME",
color = "Name",
title = f"Longitude vs. Temperature in various countries, in months {sum_month} and {win_month} of year {year}",
facet_col = "Month",
**kwargs)
fig.update_layout(margin={"r":0, "t":50, "l":0, "b":0})
return fig
As always, we test our function to make sure it works properly. This time, we will test it on India, China, and Mexico in 2010. We will leave the default months of June and December.
fig = longitude_vs_temperature_plot(2010, "India", "China", "Mexico",
width = 600,
height = 300,
opacity = 0.5)
fig.show()
Surprisingly, distance from the equator does not seem to have much of an impact on temperatures in the summer. In winter, however, we see that the stations that are farther from the equator tend to be colder, as expected.