COVID-19 AND AIRPORTS

Apr 27, 2016

USA transportation networks and COVID-19

One major factor in the spread of infectious diseases like COVID-19 is the connectivity of our transportation networks. Therefore, let’s ask the following question in this problem: to what extent does the connectivity of the airport network help explain in which regions we have seen the most confirmed cases of COVID-19?

I’ll focus on the United States network and analyze data at the level of US states (e.g., Washington state, California, New York state). The analysis will have three main steps.

I’ll start by inspecting some recent COVID-19 data on the number of confirmed cases over time, to see which states are seeing the most cases.
Next, I’ll analyze the airport network to rank the states by their likelihood of seeing air traffic.
Finally, I’ll compare the state ranking by incidence of COVID-19 with those by airport traffic, to see if there is any “correlation” between the two. I don’t expect perfect overlap in these rankings, but if there is substantial overlap, it would provide evidence for the role that air transportation networks play in the spread of the disease.

import sys
print(sys.version)

import pandas as pd
print(f"Pandas version: {pd.__version__}")

from matplotlib.pyplot import figure, plot, semilogy, grid, legend
%matplotlib inline

3.7.6 (default, Jan  8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)]
Pandas version: 1.0.1

Step 1: Inspecting COVID-19 incidence data by state

Researchers at Johns Hopkins University have been tallying the number of confirmed cases of COVID-19 over time. Let’s start by assembling the raw data for analysis.

Provenance of these data. JHU made these data available in this repo on GitHub. The data are stored in files, one for each day since January 22, 2020.

df0 = pd.read_csv('./data/01-22-2020.csv')
df0.head(5)

	Province/State	Country/Region	Last Update	Confirmed	Deaths	Recovered
0	Anhui	Mainland China	1/22/2020 17:00	1.0	NaN	NaN
1	Beijing	Mainland China	1/22/2020 17:00	14.0	NaN	NaN
2	Chongqing	Mainland China	1/22/2020 17:00	6.0	NaN	NaN
3	Fujian	Mainland China	1/22/2020 17:00	1.0	NaN	NaN
4	Gansu	Mainland China	1/22/2020 17:00	NaN	NaN	NaN

df1 = pd.read_csv('./data/03-11-2020.csv')
df1.head(5)

	Province/State	Country/Region	Last Update	Confirmed	Deaths	Recovered	Latitude	Longitude
0	Hubei	China	2020-03-11T10:53:02	67773	3046	49134	30.9756	112.2707
1	NaN	Italy	2020-03-11T21:33:02	12462	827	1045	43.0000	12.0000
2	NaN	Iran	2020-03-11T18:52:03	9000	354	2959	32.0000	53.0000
3	NaN	Korea, South	2020-03-11T21:13:18	7755	60	288	36.0000	128.0000
4	France	France	2020-03-11T22:53:03	2281	48	12	46.2276	2.2137

df2 = pd.read_csv('./data/03-22-2020.csv')
df2.head(5)

	FIPS	Admin2	Province_State	Country_Region	Last_Update	Lat	Long_	Confirmed	Deaths	Combined_Key
0	36061.0	New York City	New York	US	3/22/20 23:45	40.767273	-73.971526	9654	63	New York City, New York, US
1	36059.0	Nassau	New York	US	3/22/20 23:45	40.740665	-73.589419	1900	4	Nassau, New York, US
2	36119.0	Westchester	New York	US	3/22/20 23:45	41.162784	-73.757417	1873	0	Westchester, New York, US
3	36103.0	Suffolk	New York	US	3/22/20 23:45	40.883201	-72.801217	1034	9	Suffolk, New York, US
4	36087.0	Rockland	New York	US	3/22/20 23:45	41.150279	-74.025605	455	1	Rockland, New York, US

Observe that the column conventions are changing over time, which will make working with this data quite messy if they’re not dealt with.

I’m only interested in the following four columns:

“Province/State”
“Country/Region”
“Last Update”
“Confirmed”

There may be missing values, which read_csv() converts by default to NaN values. I’ll deal with this situation later in the notebook.

Also notice that each dataframe has a column named “Last Update”, which contain date and time values stored as strings. Moreover, they appear to use different formats. Later, I’ll standardize these, and use pandas’s to_datetime() to convert these into Python datetime objects. That makes them easier to compare (in code) and do simple arithmetic on them (e.g., calculate the number of days in-between). The following code cells demonstrate these features.

print(type(df1['Last Update'].loc[0])) # Confirm that these values are strings

df0['Timestamp'] = pd.to_datetime(df0['Last Update'])
df1['Timestamp'] = pd.to_datetime(df1['Last Update'])
df1.head(5)

<class 'str'>

	Province/State	Country/Region	Last Update	Confirmed	Deaths	Recovered	Latitude	Longitude	Timestamp
0	Hubei	China	2020-03-11T10:53:02	67773	3046	49134	30.9756	112.2707	2020-03-11 10:53:02
1	NaN	Italy	2020-03-11T21:33:02	12462	827	1045	43.0000	12.0000	2020-03-11 21:33:02
2	NaN	Iran	2020-03-11T18:52:03	9000	354	2959	32.0000	53.0000	2020-03-11 18:52:03
3	NaN	Korea, South	2020-03-11T21:13:18	7755	60	288	36.0000	128.0000	2020-03-11 21:13:18
4	France	France	2020-03-11T22:53:03	2281	48	12	46.2276	2.2137	2020-03-11 22:53:03

Here is a function to get a list of available daily data files by filename.

def get_path(filebase=""):
    DATA_PATH = ""
    return f"{DATA_PATH}{filebase}"

def get_covid19_daily_filenames(root=get_path("data/")):
    from os import listdir
    from os.path import isfile
    from re import match
    
    def covid19_filepath(filebase, root):
        return f"{root}{filebase}"
    
    def is_covid19_daily_file(filebase, root):
        file_path = covid19_filepath(filebase, root)
        return isfile(file_path) and match('^\d\d-\d\d-2020.csv$', filebase)
    
    filenames = []
    for b in listdir(root):
        if is_covid19_daily_file(b, root):
            filenames.append(covid19_filepath(b, root))
    return sorted(filenames)

print(repr(get_covid19_daily_filenames()))

['data/01-22-2020.csv', 'data/01-23-2020.csv', 'data/01-24-2020.csv', 'data/01-25-2020.csv', 'data/01-26-2020.csv', 'data/01-27-2020.csv', 'data/01-28-2020.csv', 'data/01-29-2020.csv', 'data/01-30-2020.csv', 'data/01-31-2020.csv', 'data/02-01-2020.csv', 'data/02-02-2020.csv', 'data/02-03-2020.csv', 'data/02-04-2020.csv', 'data/02-05-2020.csv', 'data/02-06-2020.csv', 'data/02-07-2020.csv', 'data/02-08-2020.csv', 'data/02-09-2020.csv', 'data/02-10-2020.csv', 'data/02-11-2020.csv', 'data/02-12-2020.csv', 'data/02-13-2020.csv', 'data/02-14-2020.csv', 'data/02-15-2020.csv', 'data/02-16-2020.csv', 'data/02-17-2020.csv', 'data/02-18-2020.csv', 'data/02-19-2020.csv', 'data/02-20-2020.csv', 'data/02-21-2020.csv', 'data/02-22-2020.csv', 'data/02-23-2020.csv', 'data/02-24-2020.csv', 'data/02-25-2020.csv', 'data/02-26-2020.csv', 'data/02-27-2020.csv', 'data/02-28-2020.csv', 'data/02-29-2020.csv', 'data/03-01-2020.csv', 'data/03-02-2020.csv', 'data/03-03-2020.csv', 'data/03-04-2020.csv', 'data/03-05-2020.csv', 'data/03-06-2020.csv', 'data/03-07-2020.csv', 'data/03-08-2020.csv', 'data/03-09-2020.csv', 'data/03-10-2020.csv', 'data/03-11-2020.csv', 'data/03-12-2020.csv', 'data/03-13-2020.csv', 'data/03-14-2020.csv', 'data/03-15-2020.csv', 'data/03-16-2020.csv', 'data/03-17-2020.csv', 'data/03-18-2020.csv', 'data/03-19-2020.csv', 'data/03-20-2020.csv', 'data/03-21-2020.csv', 'data/03-22-2020.csv', 'data/03-23-2020.csv', 'data/03-24-2020.csv', 'data/03-25-2020.csv', 'data/03-26-2020.csv', 'data/03-27-2020.csv', 'data/03-28-2020.csv', 'data/03-29-2020.csv', 'data/03-30-2020.csv', 'data/03-31-2020.csv', 'data/04-01-2020.csv', 'data/04-02-2020.csv', 'data/04-03-2020.csv', 'data/04-04-2020.csv', 'data/04-05-2020.csv', 'data/04-06-2020.csv', 'data/04-07-2020.csv', 'data/04-08-2020.csv', 'data/04-09-2020.csv', 'data/04-10-2020.csv', 'data/04-11-2020.csv', 'data/04-12-2020.csv', 'data/04-13-2020.csv', 'data/04-14-2020.csv', 'data/04-15-2020.csv', 'data/04-16-2020.csv', 'data/04-17-2020.csv', 'data/04-18-2020.csv', 'data/04-19-2020.csv', 'data/04-20-2020.csv', 'data/04-21-2020.csv', 'data/04-22-2020.csv', 'data/04-23-2020.csv', 'data/04-24-2020.csv', 'data/04-25-2020.csv', 'data/04-26-2020.csv', 'data/04-27-2020.csv', 'data/04-28-2020.csv', 'data/04-29-2020.csv', 'data/04-30-2020.csv', 'data/05-01-2020.csv', 'data/05-02-2020.csv', 'data/05-03-2020.csv', 'data/05-04-2020.csv', 'data/05-05-2020.csv', 'data/05-06-2020.csv', 'data/05-07-2020.csv', 'data/05-08-2020.csv', 'data/05-09-2020.csv', 'data/05-10-2020.csv', 'data/05-11-2020.csv', 'data/05-12-2020.csv', 'data/05-13-2020.csv', 'data/05-14-2020.csv', 'data/05-15-2020.csv', 'data/05-16-2020.csv', 'data/05-17-2020.csv', 'data/05-18-2020.csv', 'data/05-19-2020.csv', 'data/05-20-2020.csv', 'data/05-21-2020.csv', 'data/05-22-2020.csv', 'data/05-23-2020.csv', 'data/05-24-2020.csv', 'data/05-25-2020.csv', 'data/05-26-2020.csv', 'data/05-27-2020.csv', 'data/05-28-2020.csv', 'data/05-29-2020.csv', 'data/05-30-2020.csv', 'data/05-31-2020.csv', 'data/06-01-2020.csv', 'data/06-02-2020.csv', 'data/06-03-2020.csv', 'data/06-04-2020.csv', 'data/06-05-2020.csv', 'data/06-06-2020.csv', 'data/06-07-2020.csv', 'data/06-08-2020.csv', 'data/06-09-2020.csv', 'data/06-10-2020.csv', 'data/06-11-2020.csv', 'data/06-12-2020.csv', 'data/06-13-2020.csv', 'data/06-14-2020.csv', 'data/06-15-2020.csv', 'data/06-16-2020.csv', 'data/06-17-2020.csv', 'data/06-18-2020.csv', 'data/06-19-2020.csv', 'data/06-20-2020.csv', 'data/06-21-2020.csv', 'data/06-22-2020.csv', 'data/06-23-2020.csv', 'data/06-24-2020.csv', 'data/06-25-2020.csv', 'data/06-26-2020.csv', 'data/06-27-2020.csv', 'data/06-28-2020.csv', 'data/06-29-2020.csv', 'data/06-30-2020.csv', 'data/07-01-2020.csv', 'data/07-02-2020.csv', 'data/07-03-2020.csv', 'data/07-04-2020.csv', 'data/07-05-2020.csv', 'data/07-06-2020.csv', 'data/07-07-2020.csv', 'data/07-08-2020.csv', 'data/07-09-2020.csv', 'data/07-10-2020.csv', 'data/07-11-2020.csv', 'data/07-12-2020.csv', 'data/07-13-2020.csv', 'data/07-14-2020.csv', 'data/07-15-2020.csv', 'data/07-16-2020.csv', 'data/07-17-2020.csv', 'data/07-18-2020.csv', 'data/07-19-2020.csv', 'data/07-20-2020.csv', 'data/07-21-2020.csv', 'data/07-22-2020.csv', 'data/07-23-2020.csv', 'data/07-24-2020.csv', 'data/07-25-2020.csv', 'data/07-26-2020.csv', 'data/07-27-2020.csv', 'data/07-28-2020.csv', 'data/07-29-2020.csv', 'data/07-30-2020.csv', 'data/07-31-2020.csv', 'data/08-01-2020.csv', 'data/08-02-2020.csv', 'data/08-03-2020.csv', 'data/08-04-2020.csv', 'data/08-05-2020.csv', 'data/08-06-2020.csv', 'data/08-07-2020.csv', 'data/08-08-2020.csv', 'data/08-09-2020.csv', 'data/08-10-2020.csv', 'data/08-11-2020.csv', 'data/08-12-2020.csv', 'data/08-13-2020.csv', 'data/08-14-2020.csv', 'data/08-15-2020.csv', 'data/08-16-2020.csv', 'data/08-17-2020.csv', 'data/08-18-2020.csv', 'data/08-19-2020.csv', 'data/08-20-2020.csv', 'data/08-21-2020.csv', 'data/08-22-2020.csv', 'data/08-23-2020.csv', 'data/08-24-2020.csv', 'data/08-25-2020.csv', 'data/08-26-2020.csv', 'data/08-27-2020.csv', 'data/08-28-2020.csv', 'data/08-29-2020.csv', 'data/08-30-2020.csv', 'data/08-31-2020.csv', 'data/09-01-2020.csv', 'data/09-02-2020.csv', 'data/09-03-2020.csv', 'data/09-04-2020.csv', 'data/09-05-2020.csv', 'data/09-06-2020.csv', 'data/09-07-2020.csv', 'data/09-08-2020.csv', 'data/09-09-2020.csv', 'data/09-10-2020.csv', 'data/09-11-2020.csv', 'data/09-12-2020.csv', 'data/09-13-2020.csv', 'data/09-14-2020.csv', 'data/09-15-2020.csv', 'data/09-16-2020.csv', 'data/09-17-2020.csv', 'data/09-18-2020.csv', 'data/09-19-2020.csv', 'data/09-20-2020.csv', 'data/09-21-2020.csv', 'data/09-22-2020.csv', 'data/09-23-2020.csv', 'data/09-24-2020.csv', 'data/09-25-2020.csv', 'data/09-26-2020.csv', 'data/09-27-2020.csv', 'data/09-28-2020.csv', 'data/09-29-2020.csv', 'data/09-30-2020.csv', 'data/10-01-2020.csv', 'data/10-02-2020.csv', 'data/10-03-2020.csv', 'data/10-04-2020.csv', 'data/10-05-2020.csv', 'data/10-06-2020.csv', 'data/10-07-2020.csv', 'data/10-08-2020.csv', 'data/10-09-2020.csv', 'data/10-10-2020.csv', 'data/10-11-2020.csv', 'data/10-12-2020.csv', 'data/10-13-2020.csv', 'data/10-14-2020.csv', 'data/10-15-2020.csv', 'data/10-16-2020.csv', 'data/10-17-2020.csv', 'data/10-18-2020.csv', 'data/10-19-2020.csv', 'data/10-20-2020.csv', 'data/10-21-2020.csv', 'data/10-22-2020.csv', 'data/10-23-2020.csv', 'data/10-24-2020.csv', 'data/10-25-2020.csv', 'data/10-26-2020.csv', 'data/10-27-2020.csv', 'data/10-28-2020.csv', 'data/10-29-2020.csv', 'data/10-30-2020.csv', 'data/10-31-2020.csv', 'data/11-01-2020.csv', 'data/11-02-2020.csv', 'data/11-03-2020.csv', 'data/11-04-2020.csv', 'data/11-05-2020.csv', 'data/11-06-2020.csv', 'data/11-07-2020.csv', 'data/11-08-2020.csv', 'data/11-09-2020.csv', 'data/11-10-2020.csv', 'data/11-11-2020.csv', 'data/11-12-2020.csv', 'data/11-13-2020.csv', 'data/11-14-2020.csv', 'data/11-15-2020.csv', 'data/11-16-2020.csv', 'data/11-17-2020.csv', 'data/11-18-2020.csv', 'data/11-19-2020.csv', 'data/11-20-2020.csv', 'data/11-21-2020.csv', 'data/11-22-2020.csv', 'data/11-23-2020.csv', 'data/11-24-2020.csv', 'data/11-25-2020.csv', 'data/11-26-2020.csv', 'data/11-27-2020.csv', 'data/11-28-2020.csv', 'data/11-29-2020.csv', 'data/11-30-2020.csv', 'data/12-01-2020.csv', 'data/12-02-2020.csv', 'data/12-03-2020.csv', 'data/12-04-2020.csv', 'data/12-05-2020.csv', 'data/12-06-2020.csv', 'data/12-07-2020.csv', 'data/12-08-2020.csv', 'data/12-09-2020.csv', 'data/12-10-2020.csv', 'data/12-11-2020.csv', 'data/12-12-2020.csv', 'data/12-13-2020.csv', 'data/12-14-2020.csv', 'data/12-15-2020.csv', 'data/12-16-2020.csv', 'data/12-17-2020.csv', 'data/12-18-2020.csv', 'data/12-19-2020.csv', 'data/12-20-2020.csv', 'data/12-21-2020.csv', 'data/12-22-2020.csv', 'data/12-23-2020.csv', 'data/12-24-2020.csv', 'data/12-25-2020.csv', 'data/12-26-2020.csv', 'data/12-27-2020.csv', 'data/12-28-2020.csv', 'data/12-29-2020.csv', 'data/12-30-2020.csv', 'data/12-31-2020.csv']

Data loading and cleaning

Given filenames, a list of filenames that might be generated by get_covid19_filenames() above, the function load_covid19_daily_data(filenames) reads all of this data and combines it into a single pandas DataFrame containing only the following columns:

“Province/State”: Same contents as the original data frames. “Country/Region”: Same contents as the original data frames. “Confirmed”: Same contents as the original data frames. “Timestamp”: The values from the “Last Update” columns, but converted to datetime objects per the demonstration discussed previously. In addition, the code also does the following:

If there are any duplicate rows, only one of the rows should be retained.
In the “Confirmed” column, any missing values will be replaced by (0). Also, this column will be converted to have an integer type.

def load_covid19_daily_data(filenames):
    from pandas import read_csv, concat, to_datetime
    df_list = []
    for filename in filenames:
        df = read_csv(filename).rename(columns={"Province_State": "Province/State",
                                                "Country_Region": "Country/Region",
                                                "Last_Update": "Last Update"})
        df = df[["Province/State", "Country/Region", "Confirmed", "Last Update"]]
        df["Last Update"] = to_datetime(df["Last Update"])
        df['Confirmed'] = df['Confirmed'].fillna(0).astype(int)
        df_list.append(df)
    df_combined = concat(df_list)
    df_combined.rename(columns={"Last Update": "Timestamp"}, inplace=True)
    df_combined.drop_duplicates(inplace=True)
    return df_combined.reset_index(drop=True)

# Demo of the function:
df_covid19 = load_covid19_daily_data(get_covid19_daily_filenames())

print(f"There are {len(df_covid19)} rows in my data frame.")
print("The first five are:")
display(df_covid19.head(5))

print("A random sample of five additional rows:")
df_covid19.sample(5).sort_index()

There are 923171 rows in my data frame.
The first five are:

	Province/State	Country/Region	Confirmed	Timestamp
0	Anhui	Mainland China	1	2020-01-22 17:00:00
1	Beijing	Mainland China	14	2020-01-22 17:00:00
2	Chongqing	Mainland China	6	2020-01-22 17:00:00
3	Fujian	Mainland China	1	2020-01-22 17:00:00
4	Gansu	Mainland China	0	2020-01-22 17:00:00

A random sample of five additional rows:

	Province/State	Country/Region	Confirmed	Timestamp
31548	Georgia	US	275	2020-04-11 22:45:33
53473	Georgia	US	82	2020-04-24 03:30:50
220029	Himachal Pradesh	India	839	2020-06-26 04:33:43
291538	Texas	US	220	2020-07-18 04:34:45
294935	Oklahoma	US	13	2020-07-19 04:34:58

df_covid19 = df_covid19.groupby(["Province/State", "Country/Region", "Timestamp"], as_index=False).sum()
             # ^^^ Above `.groupby()` needed because of a change in US reporting on March 22, 2020
df_covid19.sample(5)

	Province/State	Country/Region	Timestamp	Confirmed
25840	Colorado	US	2020-04-29 02:32:29	14253
45884	Huila	Colombia	2020-11-28 05:25:50	29357
5966	Aomori	Japan	2020-07-10 04:34:24	31
40927	Gujarat	India	2020-10-30 04:24:49	170878
27100	Cundinamarca	Colombia	2020-07-18 04:34:45	5358

US state-by-state data

From now on we’re going to be focusing on confirmed cases in the US.

is_us = (df_covid19["Country/Region"] == "US")
df_covid19[is_us].sample(5)

	Province/State	Country/Region	Timestamp	Confirmed
96082	Puerto Rico	US	2020-09-26 04:23:00	43587
129598	Virginia	US	2020-08-09 04:34:54	98041
71903	Michigan	US	2020-07-05 04:33:46	72200
81222	New York	US	2020-09-10 04:29:01	443078
54787	Kansas	US	2020-12-09 05:28:01	174945

Using Nevada as an example here are all the rows associated with “Nevada”.

is_nevada = (df_covid19["Province/State"] == "Nevada")
df_covid19[is_us & is_nevada]

	Province/State	Country/Region	Timestamp	Confirmed
78957	Nevada	US	2020-03-10 02:33:04	4
78958	Nevada	US	2020-03-11 20:00:00	17
78959	Nevada	US	2020-03-11 22:53:03	7
78960	Nevada	US	2020-03-12 23:44:33	14
78961	Nevada	US	2020-03-14 22:33:03	45
...	...	...	...	...
79248	Nevada	US	2020-12-28 05:22:06	217509
79249	Nevada	US	2020-12-29 05:22:37	218377
79250	Nevada	US	2020-12-30 05:22:34	220124
79251	Nevada	US	2020-12-31 05:22:49	222595
79252	Nevada	US	2021-01-01 05:23:07	224731

296 rows × 4 columns

Given these data, we can order by timestamp and plot confirmed cases over time.

df_covid19[is_us & is_nevada] \
    .sort_values(by="Timestamp") \
    .plot(x="Timestamp", y="Confirmed", figsize=(16, 9), style='*--')
grid()

png

STATE_NAMES = pd.read_csv(get_path('data/us_states.csv'))
print(f"There are {len(STATE_NAMES)} US states. The first and last three, along with their two-letter postal code abbreviations, are as follows (in alphabetical order):")
display(STATE_NAMES.head(3))
print("...")
display(STATE_NAMES.tail(3))

There are 50 US states. The first and last three, along with their two-letter postal code abbreviations, are as follows (in alphabetical order):

	Name	Abbrv
0	Alabama	AL
1	Alaska	AK
2	Arizona	AZ

...

	Name	Abbrv
47	West Virginia	WV
48	Wisconsin	WI
49	Wyoming	WY

The below function, get_us_states(df), does the following:

takes in as its input a data frame structured like the combined COVID-19 data frame (df_covid19), having the columns “Province/State”, “Country/Region”, “Confirmed”, “Timestamp”;
and returns a tibble containing only those rows of df that are from the United States where the “Province/State” field is exactly the name of any one of the US states.

The tibble returned by the function should only have these three columns:

“Confirmed”: The number of confirmed cases, taken from the input df. “Timestamp”: The timestamp taken from the input df. “ST”: The two-letter abbreviation for the state’s name.

def get_us_states__2(df):
    is_us = df['Country/Region'] == 'US'
    names = STATE_NAMES["Name"]
    is_us_state = is_us & df['Province/State'].isin(names)
    abbrvs = STATE_NAMES["Abbrv"]
    name2abbrv = {name: st for name, st in zip(names, abbrvs)}
    df_us = df[is_us_state].copy()
    df_us['ST'] = df_us['Province/State'].map(name2abbrv)
    del df_us["Province/State"]
    del df_us["Country/Region"]
    return df_us

df_covid19_us = get_us_states__2(df_covid19)
df_covid19_us.sample(5).sort_values(by=["ST", "Timestamp"])

	Timestamp	Confirmed	ST
47202	2020-03-30 22:52:00	4959	IL
71886	2020-06-18 04:33:18	66088	MI
73276	2020-10-20 04:24:22	110592	MS
89795	2020-04-16 23:30:51	1691	OR
128226	2020-06-22 04:33:20	1159	VT

Ranking by confirmed cases

The function rank_states_by_cases(df) takes a data frame like df_covid19_us and returns a Python list of states in decreasing order of the maximum number of confirmed cases in that state.

def rank_states_by_cases(df):
    return df.groupby("ST").max().sort_values(by="Confirmed", ascending=False).index.tolist()

covid19_rankings = rank_states_by_cases(df_covid19_us)

print(f"Computed ranking:\n==> {repr(covid19_rankings)}\n")

Computed ranking:
==> ['CA', 'TX', 'FL', 'NY', 'IL', 'OH', 'GA', 'PA', 'TN', 'NC', 'NJ', 'MI', 'WI', 'AZ', 'IN', 'MN', 'MO', 'MA', 'AL', 'VA', 'CO', 'LA', 'SC', 'OK', 'IA', 'MD', 'UT', 'KY', 'WA', 'AR', 'NV', 'KS', 'MS', 'CT', 'NE', 'NM', 'ID', 'OR', 'SD', 'ND', 'RI', 'WV', 'MT', 'DE', 'AK', 'WY', 'NH', 'ME', 'HI', 'VT']

df_covid19_us.head()

	Timestamp	Confirmed	ST
1340	2020-03-11 20:00:00	5	AL
1341	2020-03-14 16:53:03	6	AL
1342	2020-03-15 18:20:19	12	AL
1343	2020-03-16 22:33:03	29	AL
1344	2020-03-17 23:13:10	39	AL

Visualization

Let’s plot the 10 states by number of confirmed cases. The y-axis uses a logarithmic scale in this plot.

def viz_by_state(col, df, states, figsize=(16, 9), logy=False):
    from matplotlib.pyplot import figure, plot, semilogy, legend, grid
    figure(figsize=figsize)
    plotter = plot if not logy else semilogy
    for s in states:
        df0 = df[df["ST"] == s].sort_values(by="Timestamp")
        plotter(df0["Timestamp"], df0[col], "o:")
    legend(states)
    grid()
    
TOP_K = 10

viz_by_state("Confirmed", df_covid19_us, covid19_rankings[:TOP_K], logy=True)

png

Observe that this data is irregularly sampled and noisy. For instance, the updates do not occur every day in every state, and there are spikes due to reporting errors. Below I will try to smooth out the data before plotting it, to help discern the overall trends better.

Filling-in missing values

First I’m going to impute the missing daily values, so that there is at least one value per day. To see the issue more clearly, consider the data for the state of Nevada:

df_covid19_us[df_covid19_us["ST"] == "NV"].sort_values(by="Timestamp")

	Timestamp	Confirmed	ST
78957	2020-03-10 02:33:04	4	NV
78958	2020-03-11 20:00:00	17	NV
78959	2020-03-11 22:53:03	7	NV
78960	2020-03-12 23:44:33	14	NV
78961	2020-03-14 22:33:03	45	NV
...	...	...	...
79248	2020-12-28 05:22:06	217509	NV
79249	2020-12-29 05:22:37	218377	NV
79250	2020-12-30 05:22:34	220124	NV
79251	2020-12-31 05:22:49	222595	NV
79252	2021-01-01 05:23:07	224731	NV

296 rows × 3 columns

There are two observations on March 11 and no observations on March 13. I would like want one value per day for every state. I’m going to resample the values, using pandas built-in resample, a standard cleaning method when dealing with irregularly sampled time-series data. The function below implements it, storing the results in a data frame called df_us_daily.

def resample_daily(df):
    # This implementation is a bit weird, due to a known issue: https://github.com/pandas-dev/pandas/issues/28313
    df_r = df.sort_values(by=["ST", "Timestamp"]) \
             .set_index("Timestamp") \
             .groupby("ST", group_keys=False) \
             .resample("1D", closed="right") \
             .ffill() \
             .reset_index()
    return df_r.sort_values(by=["ST", "Timestamp"]).reset_index(drop=True)
    
df_us_daily = resample_daily(df_covid19_us)
df_us_daily[df_us_daily["ST"] == "NV"]

	Timestamp	Confirmed	ST
9668	2020-03-11	4	NV
9669	2020-03-12	7	NV
9670	2020-03-13	14	NV
9671	2020-03-14	14	NV
9672	2020-03-15	45	NV
...	...	...	...
9961	2020-12-29	217509	NV
9962	2020-12-30	218377	NV
9963	2020-12-31	220124	NV
9964	2021-01-01	222595	NV
9965	2021-01-02	224731	NV

298 rows × 3 columns

Observe how there are now samples on every consecutive day beginning on March 11.

Windowed daily averages

Armed with regularly sampled data, next I would like to smooth out the data using windowed daily averages.

The function daily_windowed_avg(df, days) takes in a data frame df like df_us_daily, which resample_daily() computed. It then calculates the windowed daily average using windows of size days. The function then returns a copy of df with a new column named Avg containing this average. For days with no defined average, the function simply omits those days from the output.

def daily_window_one_df(df, days):
    from numpy import nan
    df_new = df.sort_values(by="Timestamp")
    df_new["Sums"] = df_new["Confirmed"]
    for k in range(1, days):
        df_new["Sums"].iloc[k:] += df_new["Confirmed"].iloc[:-k].values
    df_new["Sums"] /= days
    df_new.rename(columns={"Sums": "Avg"}, inplace=True)
    return df_new.iloc[days-1:]

def daily_windowed_avg(df, days):
    df_avg = df.sort_values(by="Timestamp") \
               .set_index("Timestamp") \
               .groupby("ST") \
               .rolling(days) \
               .mean() \
               .reset_index() \
               .rename(columns={"Confirmed": "Avg"}) \
               .dropna()
    return df_avg.merge(df, on=["ST", "Timestamp"])

# Demo of the function:
print('=== Two states: "AK" and "GA" ===')
is_ak_ga_before = df_us_daily["ST"].isin(["AK", "GA"])
display(df_us_daily[is_ak_ga_before])

print('=== My results (days=3) ===')
df_us_daily_avg = daily_windowed_avg(df_us_daily, 3)
is_ak_ga_after = df_us_daily_avg["ST"].isin(["AK", "GA"])
display(df_us_daily_avg[is_ak_ga_after])

=== Two states: "AK" and "GA" ===

	Timestamp	Confirmed	ST
0	2020-03-11	0	AK
1	2020-03-12	1	AK
2	2020-03-13	1	AK
3	2020-03-14	1	AK
4	2020-03-15	1	AK
...	...	...	...
3062	2020-12-29	632299	GA
3063	2020-12-30	630981	GA
3064	2020-12-31	640062	GA
3065	2021-01-01	651221	GA
3066	2021-01-02	665781	GA

596 rows × 3 columns

=== My results (days=3) ===

	ST	Timestamp	Avg	Confirmed
0	AK	2020-03-13	0.666667	1
1	AK	2020-03-14	1.000000	1
2	AK	2020-03-15	1.000000	1
3	AK	2020-03-16	1.000000	1
4	AK	2020-03-17	1.000000	1
...	...	...	...	...
3042	GA	2020-12-29	627727.666667	632299
3043	GA	2020-12-30	630489.000000	630981
3044	GA	2020-12-31	634447.333333	640062
3045	GA	2021-01-01	640754.666667	651221
3046	GA	2021-01-02	652354.666667	665781

592 rows × 4 columns

Here is a visualization of the daily averages, which should appear smoother. As such, the trends should be a little more clear as well.

viz_by_state("Avg", df_us_daily_avg, covid19_rankings[:TOP_K], logy=True)

png

Step 2: Flights analysis

In this final step of this analysis, I’ll apply a Markov chain-based model to rank airport networks, and see how well it correlates with the state-by-state numbers of confirmed COVID-19 cases.

First, I’m going to load flights data from 2020 into a DataFrame called flights.

Sources: As it happens, the US Bureau of Transportation Statistics collects data on all flights originating or arriving in the United States. The dataset is available here: https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236

def load_flights(infile=get_path('data/764404810_T_ONTIME_REPORTING.csv')):
    keep_cols = ["FL_DATE", "ORIGIN_STATE_ABR", "DEST_STATE_ABR", "OP_UNIQUE_CARRIER", "OP_CARRIER_FL_NUM"]
    flights = pd.read_csv(infile)[keep_cols]
    us_sts = set(STATE_NAMES["Abbrv"])
    origin_is_state = flights['ORIGIN_STATE_ABR'].isin(us_sts)
    dest_is_state = flights['DEST_STATE_ABR'].isin(us_sts)
    return flights.loc[origin_is_state & dest_is_state].copy()

flights = load_flights()
print(f"There are {len(flights):,} direct flight segments in the `flights` data frame.")
print("Here are the first few:")
flights.head()

There are 600,349 direct flight segments in the `flights` data frame.
Here are the first few:

	FL_DATE	ORIGIN_STATE_ABR	DEST_STATE_ABR	OP_UNIQUE_CARRIER	OP_CARRIER_FL_NUM
0	2020-01-01	CA	CA	WN	5888
1	2020-01-01	CA	CA	WN	6276
2	2020-01-01	CA	CA	WN	4598
3	2020-01-01	CA	CA	WN	4761
4	2020-01-01	CA	CA	WN	5162

Here I’m going to define the outdegree of a state (e.g., the state of Nevada, the state of California) to be the total number of direct flight segments from that state to all other states. I’m going to use pandas group-by-count aggregation to compute these outdegrees. The following funtion will produce a data frame named outdegrees with two columns, the origin state (“Origin”) and outdegree value (“Outdegree”), sorted in descending order of outdegree.

def calc_outdegrees(flights):
    outdegrees = flights[['ORIGIN_STATE_ABR', 'DEST_STATE_ABR']] \
                 .groupby(['ORIGIN_STATE_ABR']) \
                 .count() \
                 .reset_index() \
                 .rename(columns={'ORIGIN_STATE_ABR': 'Origin',
                                  'DEST_STATE_ABR': 'Outdegree'}) \
                 .sort_values(by='Outdegree', ascending=False) \
                 .reset_index(drop=True)
    return outdegrees

# Demo:
outdegrees = calc_outdegrees(flights)
print(f"There are {len(outdegrees)} states with a non-zero outdegree.")
print("Here are the first ten:")
outdegrees.head(10)

There are 49 states with a non-zero outdegree.
Here are the first ten:

	Origin	Outdegree
0	CA	66748
1	TX	64966
2	FL	52241
3	GA	33947
4	IL	33250
5	NY	30721
6	NC	28349
7	CO	23859
8	VA	21582
9	AZ	17921

State transition probabilities

To run the ranking analysis, I’m going to need to construct a probability transition matrix. For the state-to-state analysis, I wish to estimate the probability of going from state i to state j. Let’s define that probability to be the number of direct flight segments from state i to state j divided by the outdegree of state i.

The calc_state_trans_probs(flights, outdegrees) function below computes these state-to-state transition probabilities. The function accepts two data frames like flights and outdegrees as defined above and returns a new data frame with exactly these columns:

“Origin”: The origin state, i.e., state i , as a two-letter abbreviation.
“Dest”: The destination state, i.e., state j , as a two-letter abbreviation.
“Count”: The number of direct flight segments from state i to state j .
“TransProb”: The transition probability of going from state i to state j , i.e., the count divided by the outdegree.

def calc_state_trans_probs(flights, outdegrees):
    probs = flights[['ORIGIN_STATE_ABR', 'DEST_STATE_ABR', 'FL_DATE']] \
            .groupby(['ORIGIN_STATE_ABR', 'DEST_STATE_ABR']) \
            .count() \
            .reset_index() \
            .rename(columns={'ORIGIN_STATE_ABR': 'Origin',
                             'DEST_STATE_ABR': 'Dest',
                             'FL_DATE': 'Count'}) \
            .merge(outdegrees, on='Origin', how='inner')
    probs['TransProb'] = probs['Count'] / probs['Outdegree']
    del probs['Outdegree']
    return probs

probs = calc_state_trans_probs(flights, outdegrees)
print(f"There are {len(probs)} state-to-state transition probabilities in my result.")
print("Here are ten with the largest transition probabilities:")
display(probs.sort_values(by="TransProb", ascending=False).head(10))

There are 1200 state-to-state transition probabilities in my result.
Here are ten with the largest transition probabilities:

	Origin	Dest	Count	TransProb
0	AK	AK	1872	0.631579
263	HI	HI	6064	0.601707
677	ND	MN	693	0.444516
618	MS	TX	587	0.438714
878	OR	CA	2717	0.437238
872	OK	TX	1457	0.432216
1190	WY	CO	368	0.428904
34	AR	TX	1058	0.426785
753	NM	TX	825	0.375854
425	LA	TX	2509	0.361527

The next code cell runs the PageRank-style algorithm on the state-to-state airport network and produces a ranking.

def eval_markov_chain(P, x0, t_max):
    x = x0
    for t in range(t_max):
        x = P.T.dot(x)
    return x

def rank_states_by_air_network(probs, t_max=100):
    from numpy import array, zeros, ones, argsort, arange
    from scipy.sparse import coo_matrix
    from pandas import DataFrame

    # Create transition matrix
    unique_origins = set(probs['Origin'])
    unique_dests = set(probs['Dest'])
    unique_states = array(sorted(unique_origins | unique_dests))
    state_ids = {st: i for i, st in enumerate(unique_states)}
    num_states = max(state_ids.values()) + 1
    
    s2s = probs.copy()
    s2s['OriginID'] = s2s['Origin'].map(state_ids)
    s2s['DestID'] = s2s['Dest'].map(state_ids)
    
    P = coo_matrix((s2s['TransProb'], (s2s['OriginID'], s2s['DestID'])),
                   shape=(num_states, num_states))

    # Run ranking algorithm
    x0 = zeros(num_states)
    x0[state_ids['WA']] = 1.0 # First state to report confirmed COVID-19 cases

    x = eval_markov_chain(P, x0, t_max)

    # Produce a results table of rank-ordered states
    ranks = argsort(-x)
    df_ranks = DataFrame({'Rank': arange(1, len(ranks)+1),
                          'State': unique_states[ranks],
                          'x(t)': x[ranks]})
    df_ranks['ID'] = df_ranks['State'].map(state_ids)
        
    return df_ranks

print("Running the ranking algorithm...")
airnet_rankings = rank_states_by_air_network(probs)

print(f"==> Here are the top-{TOP_K} states:")
display(airnet_rankings.head(TOP_K))

Running the ranking algorithm...
==> Here are the top-10 states:

	Rank	State	x(t)	ID
0	1	CA	0.111133	4
1	2	TX	0.108272	41
2	3	FL	0.086968	7
3	4	GA	0.056547	8
4	5	IL	0.055431	12
5	6	NY	0.051167	32
6	7	NC	0.047241	25
7	8	CO	0.039737	5
8	9	VA	0.035947	43
9	10	AZ	0.029857	3

We now have a ranking of states by number of confirmed COVID-19 cases, as well as a separate ranking of states by air-network connectivity. To compare them, I’ll use a measure called rank-biased overlap (RBO). Very roughly speaking, this measure is an estimate of the probability that a reader comparing the top few entries of two rankings tends to encounter the same items, so a value closer to 1 means the top entries of the two rankings are more similar.

Note that I say “top few” above because RBO is parameterized by a “patience” parameter, which is related to how many of the top entries the reader will inspect before stopping. The reason for this parameter originates in the motivation for RBO, which was to measure the similarity between search engine results. The code I am using to calculate RBO uses this implementation.

from rbo import rbo

compare_rankings = rbo(covid19_rankings, # ranking by confirmed COVID-19 cases
                           airnet_rankings['State'].values, # ranking by air-network connectivity
                           0.95) # "patience" parameter
print(f"Raw RBO result: {compare_rankings}\n\n==> RBO score is {compare_rankings.ext:.3}")

Raw RBO result: RBO(min=0.8121261555176027, res=0.02058258852563008, ext=0.8325850950430913)

==> RBO score is 0.833

We see an RBO score of 0.833, which suggests that the connectivity of the airport network may help explain the number of confirmed COVID-19 cases we are seeing in each state.

Demo

Mo Berro

Machine Learning Engineer

My research interests include distributed systems, mobile computing and programmable matter.