COVID-19 AND AIRPORTS

USA transportation networks and COVID-19

One major factor in the spread of infectious diseases like COVID-19 is the connectivity of our transportation networks. Therefore, let’s ask the following question in this problem: to what extent does the connectivity of the airport network help explain in which regions we have seen the most confirmed cases of COVID-19?

I’ll focus on the United States network and analyze data at the level of US states (e.g., Washington state, California, New York state). The analysis will have three main steps.

  1. I’ll start by inspecting some recent COVID-19 data on the number of confirmed cases over time, to see which states are seeing the most cases.
  2. Next, I’ll analyze the airport network to rank the states by their likelihood of seeing air traffic.
  3. Finally, I’ll compare the state ranking by incidence of COVID-19 with those by airport traffic, to see if there is any “correlation” between the two. I don’t expect perfect overlap in these rankings, but if there is substantial overlap, it would provide evidence for the role that air transportation networks play in the spread of the disease.

import sys
print(sys.version)

import pandas as pd
print(f"Pandas version: {pd.__version__}")

from matplotlib.pyplot import figure, plot, semilogy, grid, legend
%matplotlib inline
3.7.6 (default, Jan  8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)]
Pandas version: 1.0.1

Step 1: Inspecting COVID-19 incidence data by state

Researchers at Johns Hopkins University have been tallying the number of confirmed cases of COVID-19 over time. Let’s start by assembling the raw data for analysis.

Provenance of these data. JHU made these data available in this repo on GitHub. The data are stored in files, one for each day since January 22, 2020.

df0 = pd.read_csv('./data/01-22-2020.csv')
df0.head(5)

Province/State Country/Region Last Update Confirmed Deaths Recovered
0 Anhui Mainland China 1/22/2020 17:00 1.0 NaN NaN
1 Beijing Mainland China 1/22/2020 17:00 14.0 NaN NaN
2 Chongqing Mainland China 1/22/2020 17:00 6.0 NaN NaN
3 Fujian Mainland China 1/22/2020 17:00 1.0 NaN NaN
4 Gansu Mainland China 1/22/2020 17:00 NaN NaN NaN
df1 = pd.read_csv('./data/03-11-2020.csv')
df1.head(5)

Province/State Country/Region Last Update Confirmed Deaths Recovered Latitude Longitude
0 Hubei China 2020-03-11T10:53:02 67773 3046 49134 30.9756 112.2707
1 NaN Italy 2020-03-11T21:33:02 12462 827 1045 43.0000 12.0000
2 NaN Iran 2020-03-11T18:52:03 9000 354 2959 32.0000 53.0000
3 NaN Korea, South 2020-03-11T21:13:18 7755 60 288 36.0000 128.0000
4 France France 2020-03-11T22:53:03 2281 48 12 46.2276 2.2137
df2 = pd.read_csv('./data/03-22-2020.csv')
df2.head(5)

FIPS Admin2 Province_State Country_Region Last_Update Lat Long_ Confirmed Deaths Recovered Active Combined_Key
0 36061.0 New York City New York US 3/22/20 23:45 40.767273 -73.971526 9654 63 0 0 New York City, New York, US
1 36059.0 Nassau New York US 3/22/20 23:45 40.740665 -73.589419 1900 4 0 0 Nassau, New York, US
2 36119.0 Westchester New York US 3/22/20 23:45 41.162784 -73.757417 1873 0 0 0 Westchester, New York, US
3 36103.0 Suffolk New York US 3/22/20 23:45 40.883201 -72.801217 1034 9 0 0 Suffolk, New York, US
4 36087.0 Rockland New York US 3/22/20 23:45 41.150279 -74.025605 455 1 0 0 Rockland, New York, US

Observe that the column conventions are changing over time, which will make working with this data quite messy if they’re not dealt with.

I’m only interested in the following four columns:

  • “Province/State”
  • “Country/Region”
  • “Last Update”
  • “Confirmed”

There may be missing values, which read_csv() converts by default to NaN values. I’ll deal with this situation later in the notebook.

Also notice that each dataframe has a column named “Last Update”, which contain date and time values stored as strings. Moreover, they appear to use different formats. Later, I’ll standardize these, and use pandas’s to_datetime() to convert these into Python datetime objects. That makes them easier to compare (in code) and do simple arithmetic on them (e.g., calculate the number of days in-between). The following code cells demonstrate these features.

print(type(df1['Last Update'].loc[0])) # Confirm that these values are strings

df0['Timestamp'] = pd.to_datetime(df0['Last Update'])
df1['Timestamp'] = pd.to_datetime(df1['Last Update'])
df1.head(5)
<class 'str'>

Province/State Country/Region Last Update Confirmed Deaths Recovered Latitude Longitude Timestamp
0 Hubei China 2020-03-11T10:53:02 67773 3046 49134 30.9756 112.2707 2020-03-11 10:53:02
1 NaN Italy 2020-03-11T21:33:02 12462 827 1045 43.0000 12.0000 2020-03-11 21:33:02
2 NaN Iran 2020-03-11T18:52:03 9000 354 2959 32.0000 53.0000 2020-03-11 18:52:03
3 NaN Korea, South 2020-03-11T21:13:18 7755 60 288 36.0000 128.0000 2020-03-11 21:13:18
4 France France 2020-03-11T22:53:03 2281 48 12 46.2276 2.2137 2020-03-11 22:53:03

Here is a function to get a list of available daily data files by filename.

def get_path(filebase=""):
    DATA_PATH = ""
    return f"{DATA_PATH}{filebase}"

def get_covid19_daily_filenames(root=get_path("data/")):
    from os import listdir
    from os.path import isfile
    from re import match
    
    def covid19_filepath(filebase, root):
        return f"{root}{filebase}"
    
    def is_covid19_daily_file(filebase, root):
        file_path = covid19_filepath(filebase, root)
        return isfile(file_path) and match('^\d\d-\d\d-2020.csv$', filebase)
    
    filenames = []
    for b in listdir(root):
        if is_covid19_daily_file(b, root):
            filenames.append(covid19_filepath(b, root))
    return sorted(filenames)

print(repr(get_covid19_daily_filenames()))
['data/01-22-2020.csv', 'data/01-23-2020.csv', 'data/01-24-2020.csv', 'data/01-25-2020.csv', 'data/01-26-2020.csv', 'data/01-27-2020.csv', 'data/01-28-2020.csv', 'data/01-29-2020.csv', 'data/01-30-2020.csv', 'data/01-31-2020.csv', 'data/02-01-2020.csv', 'data/02-02-2020.csv', 'data/02-03-2020.csv', 'data/02-04-2020.csv', 'data/02-05-2020.csv', 'data/02-06-2020.csv', 'data/02-07-2020.csv', 'data/02-08-2020.csv', 'data/02-09-2020.csv', 'data/02-10-2020.csv', 'data/02-11-2020.csv', 'data/02-12-2020.csv', 'data/02-13-2020.csv', 'data/02-14-2020.csv', 'data/02-15-2020.csv', 'data/02-16-2020.csv', 'data/02-17-2020.csv', 'data/02-18-2020.csv', 'data/02-19-2020.csv', 'data/02-20-2020.csv', 'data/02-21-2020.csv', 'data/02-22-2020.csv', 'data/02-23-2020.csv', 'data/02-24-2020.csv', 'data/02-25-2020.csv', 'data/02-26-2020.csv', 'data/02-27-2020.csv', 'data/02-28-2020.csv', 'data/02-29-2020.csv', 'data/03-01-2020.csv', 'data/03-02-2020.csv', 'data/03-03-2020.csv', 'data/03-04-2020.csv', 'data/03-05-2020.csv', 'data/03-06-2020.csv', 'data/03-07-2020.csv', 'data/03-08-2020.csv', 'data/03-09-2020.csv', 'data/03-10-2020.csv', 'data/03-11-2020.csv', 'data/03-12-2020.csv', 'data/03-13-2020.csv', 'data/03-14-2020.csv', 'data/03-15-2020.csv', 'data/03-16-2020.csv', 'data/03-17-2020.csv', 'data/03-18-2020.csv', 'data/03-19-2020.csv', 'data/03-20-2020.csv', 'data/03-21-2020.csv', 'data/03-22-2020.csv', 'data/03-23-2020.csv', 'data/03-24-2020.csv', 'data/03-25-2020.csv', 'data/03-26-2020.csv', 'data/03-27-2020.csv', 'data/03-28-2020.csv', 'data/03-29-2020.csv', 'data/03-30-2020.csv', 'data/03-31-2020.csv', 'data/04-01-2020.csv', 'data/04-02-2020.csv', 'data/04-03-2020.csv', 'data/04-04-2020.csv', 'data/04-05-2020.csv', 'data/04-06-2020.csv', 'data/04-07-2020.csv', 'data/04-08-2020.csv', 'data/04-09-2020.csv', 'data/04-10-2020.csv', 'data/04-11-2020.csv', 'data/04-12-2020.csv', 'data/04-13-2020.csv', 'data/04-14-2020.csv', 'data/04-15-2020.csv', 'data/04-16-2020.csv', 'data/04-17-2020.csv', 'data/04-18-2020.csv', 'data/04-19-2020.csv', 'data/04-20-2020.csv', 'data/04-21-2020.csv', 'data/04-22-2020.csv', 'data/04-23-2020.csv', 'data/04-24-2020.csv', 'data/04-25-2020.csv', 'data/04-26-2020.csv', 'data/04-27-2020.csv', 'data/04-28-2020.csv', 'data/04-29-2020.csv', 'data/04-30-2020.csv', 'data/05-01-2020.csv', 'data/05-02-2020.csv', 'data/05-03-2020.csv', 'data/05-04-2020.csv', 'data/05-05-2020.csv', 'data/05-06-2020.csv', 'data/05-07-2020.csv', 'data/05-08-2020.csv', 'data/05-09-2020.csv', 'data/05-10-2020.csv', 'data/05-11-2020.csv', 'data/05-12-2020.csv', 'data/05-13-2020.csv', 'data/05-14-2020.csv', 'data/05-15-2020.csv', 'data/05-16-2020.csv', 'data/05-17-2020.csv', 'data/05-18-2020.csv', 'data/05-19-2020.csv', 'data/05-20-2020.csv', 'data/05-21-2020.csv', 'data/05-22-2020.csv', 'data/05-23-2020.csv', 'data/05-24-2020.csv', 'data/05-25-2020.csv', 'data/05-26-2020.csv', 'data/05-27-2020.csv', 'data/05-28-2020.csv', 'data/05-29-2020.csv', 'data/05-30-2020.csv', 'data/05-31-2020.csv', 'data/06-01-2020.csv', 'data/06-02-2020.csv', 'data/06-03-2020.csv', 'data/06-04-2020.csv', 'data/06-05-2020.csv', 'data/06-06-2020.csv', 'data/06-07-2020.csv', 'data/06-08-2020.csv', 'data/06-09-2020.csv', 'data/06-10-2020.csv', 'data/06-11-2020.csv', 'data/06-12-2020.csv', 'data/06-13-2020.csv', 'data/06-14-2020.csv', 'data/06-15-2020.csv', 'data/06-16-2020.csv', 'data/06-17-2020.csv', 'data/06-18-2020.csv', 'data/06-19-2020.csv', 'data/06-20-2020.csv', 'data/06-21-2020.csv', 'data/06-22-2020.csv', 'data/06-23-2020.csv', 'data/06-24-2020.csv', 'data/06-25-2020.csv', 'data/06-26-2020.csv', 'data/06-27-2020.csv', 'data/06-28-2020.csv', 'data/06-29-2020.csv', 'data/06-30-2020.csv', 'data/07-01-2020.csv', 'data/07-02-2020.csv', 'data/07-03-2020.csv', 'data/07-04-2020.csv', 'data/07-05-2020.csv', 'data/07-06-2020.csv', 'data/07-07-2020.csv', 'data/07-08-2020.csv', 'data/07-09-2020.csv', 'data/07-10-2020.csv', 'data/07-11-2020.csv', 'data/07-12-2020.csv', 'data/07-13-2020.csv', 'data/07-14-2020.csv', 'data/07-15-2020.csv', 'data/07-16-2020.csv', 'data/07-17-2020.csv', 'data/07-18-2020.csv', 'data/07-19-2020.csv', 'data/07-20-2020.csv', 'data/07-21-2020.csv', 'data/07-22-2020.csv', 'data/07-23-2020.csv', 'data/07-24-2020.csv', 'data/07-25-2020.csv', 'data/07-26-2020.csv', 'data/07-27-2020.csv', 'data/07-28-2020.csv', 'data/07-29-2020.csv', 'data/07-30-2020.csv', 'data/07-31-2020.csv', 'data/08-01-2020.csv', 'data/08-02-2020.csv', 'data/08-03-2020.csv', 'data/08-04-2020.csv', 'data/08-05-2020.csv', 'data/08-06-2020.csv', 'data/08-07-2020.csv', 'data/08-08-2020.csv', 'data/08-09-2020.csv', 'data/08-10-2020.csv', 'data/08-11-2020.csv', 'data/08-12-2020.csv', 'data/08-13-2020.csv', 'data/08-14-2020.csv', 'data/08-15-2020.csv', 'data/08-16-2020.csv', 'data/08-17-2020.csv', 'data/08-18-2020.csv', 'data/08-19-2020.csv', 'data/08-20-2020.csv', 'data/08-21-2020.csv', 'data/08-22-2020.csv', 'data/08-23-2020.csv', 'data/08-24-2020.csv', 'data/08-25-2020.csv', 'data/08-26-2020.csv', 'data/08-27-2020.csv', 'data/08-28-2020.csv', 'data/08-29-2020.csv', 'data/08-30-2020.csv', 'data/08-31-2020.csv', 'data/09-01-2020.csv', 'data/09-02-2020.csv', 'data/09-03-2020.csv', 'data/09-04-2020.csv', 'data/09-05-2020.csv', 'data/09-06-2020.csv', 'data/09-07-2020.csv', 'data/09-08-2020.csv', 'data/09-09-2020.csv', 'data/09-10-2020.csv', 'data/09-11-2020.csv', 'data/09-12-2020.csv', 'data/09-13-2020.csv', 'data/09-14-2020.csv', 'data/09-15-2020.csv', 'data/09-16-2020.csv', 'data/09-17-2020.csv', 'data/09-18-2020.csv', 'data/09-19-2020.csv', 'data/09-20-2020.csv', 'data/09-21-2020.csv', 'data/09-22-2020.csv', 'data/09-23-2020.csv', 'data/09-24-2020.csv', 'data/09-25-2020.csv', 'data/09-26-2020.csv', 'data/09-27-2020.csv', 'data/09-28-2020.csv', 'data/09-29-2020.csv', 'data/09-30-2020.csv', 'data/10-01-2020.csv', 'data/10-02-2020.csv', 'data/10-03-2020.csv', 'data/10-04-2020.csv', 'data/10-05-2020.csv', 'data/10-06-2020.csv', 'data/10-07-2020.csv', 'data/10-08-2020.csv', 'data/10-09-2020.csv', 'data/10-10-2020.csv', 'data/10-11-2020.csv', 'data/10-12-2020.csv', 'data/10-13-2020.csv', 'data/10-14-2020.csv', 'data/10-15-2020.csv', 'data/10-16-2020.csv', 'data/10-17-2020.csv', 'data/10-18-2020.csv', 'data/10-19-2020.csv', 'data/10-20-2020.csv', 'data/10-21-2020.csv', 'data/10-22-2020.csv', 'data/10-23-2020.csv', 'data/10-24-2020.csv', 'data/10-25-2020.csv', 'data/10-26-2020.csv', 'data/10-27-2020.csv', 'data/10-28-2020.csv', 'data/10-29-2020.csv', 'data/10-30-2020.csv', 'data/10-31-2020.csv', 'data/11-01-2020.csv', 'data/11-02-2020.csv', 'data/11-03-2020.csv', 'data/11-04-2020.csv', 'data/11-05-2020.csv', 'data/11-06-2020.csv', 'data/11-07-2020.csv', 'data/11-08-2020.csv', 'data/11-09-2020.csv', 'data/11-10-2020.csv', 'data/11-11-2020.csv', 'data/11-12-2020.csv', 'data/11-13-2020.csv', 'data/11-14-2020.csv', 'data/11-15-2020.csv', 'data/11-16-2020.csv', 'data/11-17-2020.csv', 'data/11-18-2020.csv', 'data/11-19-2020.csv', 'data/11-20-2020.csv', 'data/11-21-2020.csv', 'data/11-22-2020.csv', 'data/11-23-2020.csv', 'data/11-24-2020.csv', 'data/11-25-2020.csv', 'data/11-26-2020.csv', 'data/11-27-2020.csv', 'data/11-28-2020.csv', 'data/11-29-2020.csv', 'data/11-30-2020.csv', 'data/12-01-2020.csv', 'data/12-02-2020.csv', 'data/12-03-2020.csv', 'data/12-04-2020.csv', 'data/12-05-2020.csv', 'data/12-06-2020.csv', 'data/12-07-2020.csv', 'data/12-08-2020.csv', 'data/12-09-2020.csv', 'data/12-10-2020.csv', 'data/12-11-2020.csv', 'data/12-12-2020.csv', 'data/12-13-2020.csv', 'data/12-14-2020.csv', 'data/12-15-2020.csv', 'data/12-16-2020.csv', 'data/12-17-2020.csv', 'data/12-18-2020.csv', 'data/12-19-2020.csv', 'data/12-20-2020.csv', 'data/12-21-2020.csv', 'data/12-22-2020.csv', 'data/12-23-2020.csv', 'data/12-24-2020.csv', 'data/12-25-2020.csv', 'data/12-26-2020.csv', 'data/12-27-2020.csv', 'data/12-28-2020.csv', 'data/12-29-2020.csv', 'data/12-30-2020.csv', 'data/12-31-2020.csv']

Data loading and cleaning

Given filenames, a list of filenames that might be generated by get_covid19_filenames() above, the function load_covid19_daily_data(filenames) reads all of this data and combines it into a single pandas DataFrame containing only the following columns:

“Province/State”: Same contents as the original data frames. “Country/Region”: Same contents as the original data frames. “Confirmed”: Same contents as the original data frames. “Timestamp”: The values from the “Last Update” columns, but converted to datetime objects per the demonstration discussed previously. In addition, the code also does the following:

  1. If there are any duplicate rows, only one of the rows should be retained.
  2. In the “Confirmed” column, any missing values will be replaced by (0). Also, this column will be converted to have an integer type.
def load_covid19_daily_data(filenames):
    from pandas import read_csv, concat, to_datetime
    df_list = []
    for filename in filenames:
        df = read_csv(filename).rename(columns={"Province_State": "Province/State",
                                                "Country_Region": "Country/Region",
                                                "Last_Update": "Last Update"})
        df = df[["Province/State", "Country/Region", "Confirmed", "Last Update"]]
        df["Last Update"] = to_datetime(df["Last Update"])
        df['Confirmed'] = df['Confirmed'].fillna(0).astype(int)
        df_list.append(df)
    df_combined = concat(df_list)
    df_combined.rename(columns={"Last Update": "Timestamp"}, inplace=True)
    df_combined.drop_duplicates(inplace=True)
    return df_combined.reset_index(drop=True)

# Demo of the function:
df_covid19 = load_covid19_daily_data(get_covid19_daily_filenames())

print(f"There are {len(df_covid19)} rows in my data frame.")
print("The first five are:")
display(df_covid19.head(5))

print("A random sample of five additional rows:")
df_covid19.sample(5).sort_index()
There are 923171 rows in my data frame.
The first five are:

Province/State Country/Region Confirmed Timestamp
0 Anhui Mainland China 1 2020-01-22 17:00:00
1 Beijing Mainland China 14 2020-01-22 17:00:00
2 Chongqing Mainland China 6 2020-01-22 17:00:00
3 Fujian Mainland China 1 2020-01-22 17:00:00
4 Gansu Mainland China 0 2020-01-22 17:00:00
A random sample of five additional rows:

Province/State Country/Region Confirmed Timestamp
31548 Georgia US 275 2020-04-11 22:45:33
53473 Georgia US 82 2020-04-24 03:30:50
220029 Himachal Pradesh India 839 2020-06-26 04:33:43
291538 Texas US 220 2020-07-18 04:34:45
294935 Oklahoma US 13 2020-07-19 04:34:58
df_covid19 = df_covid19.groupby(["Province/State", "Country/Region", "Timestamp"], as_index=False).sum()
             # ^^^ Above `.groupby()` needed because of a change in US reporting on March 22, 2020
df_covid19.sample(5)

Province/State Country/Region Timestamp Confirmed
25840 Colorado US 2020-04-29 02:32:29 14253
45884 Huila Colombia 2020-11-28 05:25:50 29357
5966 Aomori Japan 2020-07-10 04:34:24 31
40927 Gujarat India 2020-10-30 04:24:49 170878
27100 Cundinamarca Colombia 2020-07-18 04:34:45 5358

US state-by-state data

From now on we’re going to be focusing on confirmed cases in the US.

is_us = (df_covid19["Country/Region"] == "US")
df_covid19[is_us].sample(5)

Province/State Country/Region Timestamp Confirmed
96082 Puerto Rico US 2020-09-26 04:23:00 43587
129598 Virginia US 2020-08-09 04:34:54 98041
71903 Michigan US 2020-07-05 04:33:46 72200
81222 New York US 2020-09-10 04:29:01 443078
54787 Kansas US 2020-12-09 05:28:01 174945

Using Nevada as an example here are all the rows associated with “Nevada”.

is_nevada = (df_covid19["Province/State"] == "Nevada")
df_covid19[is_us & is_nevada]

Province/State Country/Region Timestamp Confirmed
78957 Nevada US 2020-03-10 02:33:04 4
78958 Nevada US 2020-03-11 20:00:00 17
78959 Nevada US 2020-03-11 22:53:03 7
78960 Nevada US 2020-03-12 23:44:33 14
78961 Nevada US 2020-03-14 22:33:03 45
... ... ... ... ...
79248 Nevada US 2020-12-28 05:22:06 217509
79249 Nevada US 2020-12-29 05:22:37 218377
79250 Nevada US 2020-12-30 05:22:34 220124
79251 Nevada US 2020-12-31 05:22:49 222595
79252 Nevada US 2021-01-01 05:23:07 224731

296 rows × 4 columns

Given these data, we can order by timestamp and plot confirmed cases over time.

df_covid19[is_us & is_nevada] \
    .sort_values(by="Timestamp") \
    .plot(x="Timestamp", y="Confirmed", figsize=(16, 9), style='*--')
grid()

png

STATE_NAMES = pd.read_csv(get_path('data/us_states.csv'))
print(f"There are {len(STATE_NAMES)} US states. The first and last three, along with their two-letter postal code abbreviations, are as follows (in alphabetical order):")
display(STATE_NAMES.head(3))
print("...")
display(STATE_NAMES.tail(3))
There are 50 US states. The first and last three, along with their two-letter postal code abbreviations, are as follows (in alphabetical order):

Name Abbrv
0 Alabama AL
1 Alaska AK
2 Arizona AZ
...

Name Abbrv
47 West Virginia WV
48 Wisconsin WI
49 Wyoming WY

The below function, get_us_states(df), does the following:

  • takes in as its input a data frame structured like the combined COVID-19 data frame (df_covid19), having the columns “Province/State”, “Country/Region”, “Confirmed”, “Timestamp”;
  • and returns a tibble containing only those rows of df that are from the United States where the “Province/State” field is exactly the name of any one of the US states.

The tibble returned by the function should only have these three columns:

“Confirmed”: The number of confirmed cases, taken from the input df. “Timestamp”: The timestamp taken from the input df. “ST”: The two-letter abbreviation for the state’s name.

def get_us_states__2(df):
    is_us = df['Country/Region'] == 'US'
    names = STATE_NAMES["Name"]
    is_us_state = is_us & df['Province/State'].isin(names)
    abbrvs = STATE_NAMES["Abbrv"]
    name2abbrv = {name: st for name, st in zip(names, abbrvs)}
    df_us = df[is_us_state].copy()
    df_us['ST'] = df_us['Province/State'].map(name2abbrv)
    del df_us["Province/State"]
    del df_us["Country/Region"]
    return df_us
df_covid19_us = get_us_states__2(df_covid19)
df_covid19_us.sample(5).sort_values(by=["ST", "Timestamp"])

Timestamp Confirmed ST
47202 2020-03-30 22:52:00 4959 IL
71886 2020-06-18 04:33:18 66088 MI
73276 2020-10-20 04:24:22 110592 MS
89795 2020-04-16 23:30:51 1691 OR
128226 2020-06-22 04:33:20 1159 VT

Ranking by confirmed cases

The function rank_states_by_cases(df) takes a data frame like df_covid19_us and returns a Python list of states in decreasing order of the maximum number of confirmed cases in that state.

def rank_states_by_cases(df):
    return df.groupby("ST").max().sort_values(by="Confirmed", ascending=False).index.tolist()

covid19_rankings = rank_states_by_cases(df_covid19_us)

print(f"Computed ranking:\n==> {repr(covid19_rankings)}\n")
Computed ranking:
==> ['CA', 'TX', 'FL', 'NY', 'IL', 'OH', 'GA', 'PA', 'TN', 'NC', 'NJ', 'MI', 'WI', 'AZ', 'IN', 'MN', 'MO', 'MA', 'AL', 'VA', 'CO', 'LA', 'SC', 'OK', 'IA', 'MD', 'UT', 'KY', 'WA', 'AR', 'NV', 'KS', 'MS', 'CT', 'NE', 'NM', 'ID', 'OR', 'SD', 'ND', 'RI', 'WV', 'MT', 'DE', 'AK', 'WY', 'NH', 'ME', 'HI', 'VT']
df_covid19_us.head()

Timestamp Confirmed ST
1340 2020-03-11 20:00:00 5 AL
1341 2020-03-14 16:53:03 6 AL
1342 2020-03-15 18:20:19 12 AL
1343 2020-03-16 22:33:03 29 AL
1344 2020-03-17 23:13:10 39 AL

Visualization

Let’s plot the 10 states by number of confirmed cases. The y-axis uses a logarithmic scale in this plot.

def viz_by_state(col, df, states, figsize=(16, 9), logy=False):
    from matplotlib.pyplot import figure, plot, semilogy, legend, grid
    figure(figsize=figsize)
    plotter = plot if not logy else semilogy
    for s in states:
        df0 = df[df["ST"] == s].sort_values(by="Timestamp")
        plotter(df0["Timestamp"], df0[col], "o:")
    legend(states)
    grid()
    
TOP_K = 10
viz_by_state("Confirmed", df_covid19_us, covid19_rankings[:TOP_K], logy=True)

png

Observe that this data is irregularly sampled and noisy. For instance, the updates do not occur every day in every state, and there are spikes due to reporting errors. Below I will try to smooth out the data before plotting it, to help discern the overall trends better.

Filling-in missing values

First I’m going to impute the missing daily values, so that there is at least one value per day. To see the issue more clearly, consider the data for the state of Nevada:

df_covid19_us[df_covid19_us["ST"] == "NV"].sort_values(by="Timestamp")

Timestamp Confirmed ST
78957 2020-03-10 02:33:04 4 NV
78958 2020-03-11 20:00:00 17 NV
78959 2020-03-11 22:53:03 7 NV
78960 2020-03-12 23:44:33 14 NV
78961 2020-03-14 22:33:03 45 NV
... ... ... ...
79248 2020-12-28 05:22:06 217509 NV
79249 2020-12-29 05:22:37 218377 NV
79250 2020-12-30 05:22:34 220124 NV
79251 2020-12-31 05:22:49 222595 NV
79252 2021-01-01 05:23:07 224731 NV

296 rows × 3 columns

There are two observations on March 11 and no observations on March 13. I would like want one value per day for every state. I’m going to resample the values, using pandas built-in resample, a standard cleaning method when dealing with irregularly sampled time-series data. The function below implements it, storing the results in a data frame called df_us_daily.

def resample_daily(df):
    # This implementation is a bit weird, due to a known issue: https://github.com/pandas-dev/pandas/issues/28313
    df_r = df.sort_values(by=["ST", "Timestamp"]) \
             .set_index("Timestamp") \
             .groupby("ST", group_keys=False) \
             .resample("1D", closed="right") \
             .ffill() \
             .reset_index()
    return df_r.sort_values(by=["ST", "Timestamp"]).reset_index(drop=True)
    
df_us_daily = resample_daily(df_covid19_us)
df_us_daily[df_us_daily["ST"] == "NV"]

Timestamp Confirmed ST
9668 2020-03-11 4 NV
9669 2020-03-12 7 NV
9670 2020-03-13 14 NV
9671 2020-03-14 14 NV
9672 2020-03-15 45 NV
... ... ... ...
9961 2020-12-29 217509 NV
9962 2020-12-30 218377 NV
9963 2020-12-31 220124 NV
9964 2021-01-01 222595 NV
9965 2021-01-02 224731 NV

298 rows × 3 columns

Observe how there are now samples on every consecutive day beginning on March 11.

Windowed daily averages

Armed with regularly sampled data, next I would like to smooth out the data using windowed daily averages.

The function daily_windowed_avg(df, days) takes in a data frame df like df_us_daily, which resample_daily() computed. It then calculates the windowed daily average using windows of size days. The function then returns a copy of df with a new column named Avg containing this average. For days with no defined average, the function simply omits those days from the output.

def daily_window_one_df(df, days):
    from numpy import nan
    df_new = df.sort_values(by="Timestamp")
    df_new["Sums"] = df_new["Confirmed"]
    for k in range(1, days):
        df_new["Sums"].iloc[k:] += df_new["Confirmed"].iloc[:-k].values
    df_new["Sums"] /= days
    df_new.rename(columns={"Sums": "Avg"}, inplace=True)
    return df_new.iloc[days-1:]

def daily_windowed_avg(df, days):
    df_avg = df.sort_values(by="Timestamp") \
               .set_index("Timestamp") \
               .groupby("ST") \
               .rolling(days) \
               .mean() \
               .reset_index() \
               .rename(columns={"Confirmed": "Avg"}) \
               .dropna()
    return df_avg.merge(df, on=["ST", "Timestamp"])
# Demo of the function:
print('=== Two states: "AK" and "GA" ===')
is_ak_ga_before = df_us_daily["ST"].isin(["AK", "GA"])
display(df_us_daily[is_ak_ga_before])

print('=== My results (days=3) ===')
df_us_daily_avg = daily_windowed_avg(df_us_daily, 3)
is_ak_ga_after = df_us_daily_avg["ST"].isin(["AK", "GA"])
display(df_us_daily_avg[is_ak_ga_after])
=== Two states: "AK" and "GA" ===

Timestamp Confirmed ST
0 2020-03-11 0 AK
1 2020-03-12 1 AK
2 2020-03-13 1 AK
3 2020-03-14 1 AK
4 2020-03-15 1 AK
... ... ... ...
3062 2020-12-29 632299 GA
3063 2020-12-30 630981 GA
3064 2020-12-31 640062 GA
3065 2021-01-01 651221 GA
3066 2021-01-02 665781 GA

596 rows × 3 columns

=== My results (days=3) ===

ST Timestamp Avg Confirmed
0 AK 2020-03-13 0.666667 1
1 AK 2020-03-14 1.000000 1
2 AK 2020-03-15 1.000000 1
3 AK 2020-03-16 1.000000 1
4 AK 2020-03-17 1.000000 1
... ... ... ... ...
3042 GA 2020-12-29 627727.666667 632299
3043 GA 2020-12-30 630489.000000 630981
3044 GA 2020-12-31 634447.333333 640062
3045 GA 2021-01-01 640754.666667 651221
3046 GA 2021-01-02 652354.666667 665781

592 rows × 4 columns

Here is a visualization of the daily averages, which should appear smoother. As such, the trends should be a little more clear as well.

viz_by_state("Avg", df_us_daily_avg, covid19_rankings[:TOP_K], logy=True)

png

Step 2: Flights analysis

In this final step of this analysis, I’ll apply a Markov chain-based model to rank airport networks, and see how well it correlates with the state-by-state numbers of confirmed COVID-19 cases.

First, I’m going to load flights data from 2020 into a DataFrame called flights.

Sources: As it happens, the US Bureau of Transportation Statistics collects data on all flights originating or arriving in the United States. The dataset is available here: https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236

def load_flights(infile=get_path('data/764404810_T_ONTIME_REPORTING.csv')):
    keep_cols = ["FL_DATE", "ORIGIN_STATE_ABR", "DEST_STATE_ABR", "OP_UNIQUE_CARRIER", "OP_CARRIER_FL_NUM"]
    flights = pd.read_csv(infile)[keep_cols]
    us_sts = set(STATE_NAMES["Abbrv"])
    origin_is_state = flights['ORIGIN_STATE_ABR'].isin(us_sts)
    dest_is_state = flights['DEST_STATE_ABR'].isin(us_sts)
    return flights.loc[origin_is_state & dest_is_state].copy()

flights = load_flights()
print(f"There are {len(flights):,} direct flight segments in the `flights` data frame.")
print("Here are the first few:")
flights.head()
There are 600,349 direct flight segments in the `flights` data frame.
Here are the first few:

FL_DATE ORIGIN_STATE_ABR DEST_STATE_ABR OP_UNIQUE_CARRIER OP_CARRIER_FL_NUM
0 2020-01-01 CA CA WN 5888
1 2020-01-01 CA CA WN 6276
2 2020-01-01 CA CA WN 4598
3 2020-01-01 CA CA WN 4761
4 2020-01-01 CA CA WN 5162

Here I’m going to define the outdegree of a state (e.g., the state of Nevada, the state of California) to be the total number of direct flight segments from that state to all other states. I’m going to use pandas group-by-count aggregation to compute these outdegrees. The following funtion will produce a data frame named outdegrees with two columns, the origin state (“Origin”) and outdegree value (“Outdegree”), sorted in descending order of outdegree.

def calc_outdegrees(flights):
    outdegrees = flights[['ORIGIN_STATE_ABR', 'DEST_STATE_ABR']] \
                 .groupby(['ORIGIN_STATE_ABR']) \
                 .count() \
                 .reset_index() \
                 .rename(columns={'ORIGIN_STATE_ABR': 'Origin',
                                  'DEST_STATE_ABR': 'Outdegree'}) \
                 .sort_values(by='Outdegree', ascending=False) \
                 .reset_index(drop=True)
    return outdegrees

# Demo:
outdegrees = calc_outdegrees(flights)
print(f"There are {len(outdegrees)} states with a non-zero outdegree.")
print("Here are the first ten:")
outdegrees.head(10)
There are 49 states with a non-zero outdegree.
Here are the first ten:

Origin Outdegree
0 CA 66748
1 TX 64966
2 FL 52241
3 GA 33947
4 IL 33250
5 NY 30721
6 NC 28349
7 CO 23859
8 VA 21582
9 AZ 17921

State transition probabilities

To run the ranking analysis, I’m going to need to construct a probability transition matrix. For the state-to-state analysis, I wish to estimate the probability of going from state i to state j. Let’s define that probability to be the number of direct flight segments from state i to state j divided by the outdegree of state i.

The calc_state_trans_probs(flights, outdegrees) function below computes these state-to-state transition probabilities. The function accepts two data frames like flights and outdegrees as defined above and returns a new data frame with exactly these columns:

  • “Origin”: The origin state, i.e., state i , as a two-letter abbreviation.
  • “Dest”: The destination state, i.e., state j , as a two-letter abbreviation.
  • “Count”: The number of direct flight segments from state i to state j .
  • “TransProb”: The transition probability of going from state i to state j , i.e., the count divided by the outdegree.
def calc_state_trans_probs(flights, outdegrees):
    probs = flights[['ORIGIN_STATE_ABR', 'DEST_STATE_ABR', 'FL_DATE']] \
            .groupby(['ORIGIN_STATE_ABR', 'DEST_STATE_ABR']) \
            .count() \
            .reset_index() \
            .rename(columns={'ORIGIN_STATE_ABR': 'Origin',
                             'DEST_STATE_ABR': 'Dest',
                             'FL_DATE': 'Count'}) \
            .merge(outdegrees, on='Origin', how='inner')
    probs['TransProb'] = probs['Count'] / probs['Outdegree']
    del probs['Outdegree']
    return probs
probs = calc_state_trans_probs(flights, outdegrees)
print(f"There are {len(probs)} state-to-state transition probabilities in my result.")
print("Here are ten with the largest transition probabilities:")
display(probs.sort_values(by="TransProb", ascending=False).head(10))
There are 1200 state-to-state transition probabilities in my result.
Here are ten with the largest transition probabilities:

Origin Dest Count TransProb
0 AK AK 1872 0.631579
263 HI HI 6064 0.601707
677 ND MN 693 0.444516
618 MS TX 587 0.438714
878 OR CA 2717 0.437238
872 OK TX 1457 0.432216
1190 WY CO 368 0.428904
34 AR TX 1058 0.426785
753 NM TX 825 0.375854
425 LA TX 2509 0.361527

The next code cell runs the PageRank-style algorithm on the state-to-state airport network and produces a ranking.

def eval_markov_chain(P, x0, t_max):
    x = x0
    for t in range(t_max):
        x = P.T.dot(x)
    return x

def rank_states_by_air_network(probs, t_max=100):
    from numpy import array, zeros, ones, argsort, arange
    from scipy.sparse import coo_matrix
    from pandas import DataFrame

    # Create transition matrix
    unique_origins = set(probs['Origin'])
    unique_dests = set(probs['Dest'])
    unique_states = array(sorted(unique_origins | unique_dests))
    state_ids = {st: i for i, st in enumerate(unique_states)}
    num_states = max(state_ids.values()) + 1
    
    s2s = probs.copy()
    s2s['OriginID'] = s2s['Origin'].map(state_ids)
    s2s['DestID'] = s2s['Dest'].map(state_ids)
    
    P = coo_matrix((s2s['TransProb'], (s2s['OriginID'], s2s['DestID'])),
                   shape=(num_states, num_states))

    # Run ranking algorithm
    x0 = zeros(num_states)
    x0[state_ids['WA']] = 1.0 # First state to report confirmed COVID-19 cases

    x = eval_markov_chain(P, x0, t_max)

    # Produce a results table of rank-ordered states
    ranks = argsort(-x)
    df_ranks = DataFrame({'Rank': arange(1, len(ranks)+1),
                          'State': unique_states[ranks],
                          'x(t)': x[ranks]})
    df_ranks['ID'] = df_ranks['State'].map(state_ids)
        
    return df_ranks

print("Running the ranking algorithm...")
airnet_rankings = rank_states_by_air_network(probs)

print(f"==> Here are the top-{TOP_K} states:")
display(airnet_rankings.head(TOP_K))
Running the ranking algorithm...
==> Here are the top-10 states:

Rank State x(t) ID
0 1 CA 0.111133 4
1 2 TX 0.108272 41
2 3 FL 0.086968 7
3 4 GA 0.056547 8
4 5 IL 0.055431 12
5 6 NY 0.051167 32
6 7 NC 0.047241 25
7 8 CO 0.039737 5
8 9 VA 0.035947 43
9 10 AZ 0.029857 3

We now have a ranking of states by number of confirmed COVID-19 cases, as well as a separate ranking of states by air-network connectivity. To compare them, I’ll use a measure called rank-biased overlap (RBO). Very roughly speaking, this measure is an estimate of the probability that a reader comparing the top few entries of two rankings tends to encounter the same items, so a value closer to 1 means the top entries of the two rankings are more similar.

Note that I say “top few” above because RBO is parameterized by a “patience” parameter, which is related to how many of the top entries the reader will inspect before stopping. The reason for this parameter originates in the motivation for RBO, which was to measure the similarity between search engine results. The code I am using to calculate RBO uses this implementation.

from rbo import rbo

compare_rankings = rbo(covid19_rankings, # ranking by confirmed COVID-19 cases
                           airnet_rankings['State'].values, # ranking by air-network connectivity
                           0.95) # "patience" parameter
print(f"Raw RBO result: {compare_rankings}\n\n==> RBO score is {compare_rankings.ext:.3}")
Raw RBO result: RBO(min=0.8121261555176027, res=0.02058258852563008, ext=0.8325850950430913)

==> RBO score is 0.833

We see an RBO score of 0.833, which suggests that the connectivity of the airport network may help explain the number of confirmed COVID-19 cases we are seeing in each state.

Mo Berro
Machine Learning Engineer

My research interests include distributed systems, mobile computing and programmable matter.