Photo by Mo Berro

Climate Migration

Please visit https://mhberro.github.io/ to interact with the Climate Migration tool. Drag the “Select Year” slider to change the prediction year and click on the shaded regions to display the climate index for that region.

Problem Statement

Migration due to climate change is already occurring in the U.S and around the world. This can be impacted by many factors, such as recurrent flooding, rising temperature and humidity, and repeated fire outbreaks. As the pattern of climate change continues, many cities will become uninhabitable. The goal of this project is to create an interactive visualization of the effects of climate change on the U.S. over the next 50 years and recommend better areas to move to based on predicted climate values.

Data

I was able to retrieve the climate data by scraping it from the National Oceanic and Atmospheric Administration (NOAA) API. The NOAA database has daily climate information dating back to the 1950s for differenct zip codes. I initially planned to scrape all 42,000 zip codes in the U.S for climate data, but the NOAA API has a stict limit on requests per day, and it would take roughly 25,500 days to scrape for each zip code. Instead, I used a subset of 1000 zip codes, using the first three digits, which represent the zip code prefix, to get all the regions in the U.S. Climate data is regional, but doesn’t change drastically within small regions. This allows me to get an approximation of the climate data for all the zips by using a smaller (but still quite large), more manageable subset. The total amount of data is over 70 million records that is stored in MySQL on an EC2 instance in AWS.

Below is the model used to make wildfire predictions for the Climate Migration tool. This model is one of a number of models used to create the overall climate index prediction for the Climate Migration app. The data was obtained from scraping the National Oceanic and Atmospheric Administration (NOAA) website.

import pandas as pd
import zipfile
import numpy as np
import random

from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix, roc_auc_score
from sklearn import preprocessing, metrics
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt

Content

1. Clean data

    1. Clean weather data
    1. Clean disaster data and extract fire data
    1. Join weather and fire data

2. Train model

    1. Train test split
    1. Scale and downsample
    1. MLP classifier
    1. Random forest classifier

3. Prediction future fire

    1. Load predicted weather data
    1. Scale predicted weather data
    1. Predict fire

Clean data

1) Clean weather data

zp = zipfile.ZipFile('cleaned_data/all_zip_codes_all.zip')
weather_df = pd.read_csv(zp.open("all_zip_codes.csv"), converters={'zip': lambda x: str(x)}) #include leading zeros
weather_df.head()
#zip is same as zipcode, except that zip includes leading zeros

date zipcode zip PRCP SNOW SNWD TMAX TMIN TOBS
0 1950-01-01 601 00601 18.0 0.0 0.0 183.0 128.0 156.0
1 1950-01-02 601 00601 18.0 0.0 0.0 194.0 122.0 167.0
2 1950-01-03 601 00601 0.0 0.0 0.0 211.0 133.0 183.0
3 1950-01-04 601 00601 0.0 0.0 0.0 200.0 144.0 167.0
4 1950-01-05 601 00601 0.0 0.0 0.0 217.0 139.0 167.0
weather_df.tail()

date zipcode zip PRCP SNOW SNWD TMAX TMIN TOBS
10955519 2020-10-22 39701 39701 0.0 0.0 0.0 272.0 183.0 233.0
10955520 2020-10-23 39701 39701 0.0 0.0 0.0 289.0 206.0 222.0
10955521 2020-10-26 39701 39701 0.0 0.0 0.0 206.0 133.0 172.0
10955522 2020-10-27 39701 39701 0.0 0.0 0.0 256.0 156.0 161.0
10955523 2020-10-28 39701 39701 0.0 0.0 0.0 250.0 161.0 228.0
weather_df.columns
Index(['date', 'zipcode', 'zip', 'PRCP', 'SNOW', 'SNWD', 'TMAX', 'TMIN',
       'TOBS'],
      dtype='object')
len(weather_df)
10955524
weather_df['zipcode'].nunique()
652
weather_df['zipcode'].unique()[:100]
array([  601,   703,   901,  1201,  1301,  1420,  1501,  1602,  1701,
        1801,  2019,  2108,  2301,  2420,  2601,  3031,  3101,  3301,
        3431,  3561,  3740,  3801,  3901,  4001,  4101,  4210,  4330,
        4401,  4530,  4605,  4730,  4841,  4901,  5001,  5101,  5201,
        5301,  5401,  5602,  5701,  5819,  6001,  6103,  6226,  6320,
        6401,  6604,  6801,  7001,  7102,  7302,  7501,  7701,  7901,
        8201,  8302,  8401,  8701,  8801,  8901, 10801, 10901, 11001,
       11101, 11501, 11701, 89001, 89101, 89301, 89701, 89801, 90401,
       90501, 90601, 90701, 90802, 91001, 91101, 91201, 91301, 91401,
       91501, 91601, 91701, 91901, 92101, 92201, 92301, 92501, 92602,
       92701, 93001, 93101, 93401, 93501, 93601, 93901, 94002, 94102,
       94301])
weather_df['zip'].nunique()
652
weather_df['zip'].unique()[:100]
array(['00601', '00703', '00901', '01201', '01301', '01420', '01501',
       '01602', '01701', '01801', '02019', '02108', '02301', '02420',
       '02601', '03031', '03101', '03301', '03431', '03561', '03740',
       '03801', '03901', '04001', '04101', '04210', '04330', '04401',
       '04530', '04605', '04730', '04841', '04901', '05001', '05101',
       '05201', '05301', '05401', '05602', '05701', '05819', '06001',
       '06103', '06226', '06320', '06401', '06604', '06801', '07001',
       '07102', '07302', '07501', '07701', '07901', '08201', '08302',
       '08401', '08701', '08801', '08901', '10801', '10901', '11001',
       '11101', '11501', '11701', '89001', '89101', '89301', '89701',
       '89801', '90401', '90501', '90601', '90701', '90802', '91001',
       '91101', '91201', '91301', '91401', '91501', '91601', '91701',
       '91901', '92101', '92201', '92301', '92501', '92602', '92701',
       '93001', '93101', '93401', '93501', '93601', '93901', '94002',
       '94102', '94301'], dtype=object)
weather_df.dtypes
date        object
zipcode      int64
zip         object
PRCP       float64
SNOW       float64
SNWD       float64
TMAX       float64
TMIN       float64
TOBS       float64
dtype: object
# map zip code to fips code for weather data
zip_fips_df = pd.read_csv("cleaned_data/ZIP-COUNTY-FIPS_2017-06.csv")
zip_fips_df['ZIP'].unique()
array([36003, 36006, 36067, ...,   820,   830,   802])
weather_fips_df = weather_df.merge(zip_fips_df, how = 'left', left_on = 'zipcode', right_on='ZIP')
#get weather data with fips code
weather_fips_df.head()

date zipcode zip PRCP SNOW SNWD TMAX TMIN TOBS ZIP COUNTYNAME STATE FIPS code
0 1950-01-01 601 00601 18.0 0.0 0.0 183.0 128.0 156.0 601 Adjuntas Municipio PR 72001
1 1950-01-01 601 00601 18.0 0.0 0.0 183.0 128.0 156.0 601 Ponce Municipio PR 72113
2 1950-01-02 601 00601 18.0 0.0 0.0 194.0 122.0 167.0 601 Adjuntas Municipio PR 72001
3 1950-01-02 601 00601 18.0 0.0 0.0 194.0 122.0 167.0 601 Ponce Municipio PR 72113
4 1950-01-03 601 00601 0.0 0.0 0.0 211.0 133.0 183.0 601 Adjuntas Municipio PR 72001
len(weather_fips_df)
16380869

2) Clean disaster data and extract fire data

# data description: https://www.fema.gov/openfema-data-page/disaster-declarations-summaries-v2
disaster_df = pd.read_csv("cleaned_data/DisasterDeclarationsSummaries_1.1.csv")
disaster_df.columns
Index(['femaDeclarationString', 'disasterNumber', 'state', 'declarationType',
       'declarationDate', 'fyDeclared', 'incidentType', 'declarationTitle',
       'ihProgramDeclared', 'iaProgramDeclared', 'paProgramDeclared',
       'hmProgramDeclared', 'incidentBeginDate', 'incidentEndDate',
       'disasterCloseoutDate', 'fipsStateCode', 'helperState',
       'fipsCountyCode', 'helperCounty', 'fipscode', 'placeCode',
       'designatedArea', 'declarationRequestNumber', 'hash', 'lastRefresh',
       'id'],
      dtype='object')
disaster_df.head()

femaDeclarationString disasterNumber state declarationType declarationDate fyDeclared incidentType declarationTitle ihProgramDeclared iaProgramDeclared ... helperState fipsCountyCode helperCounty fipscode placeCode designatedArea declarationRequestNumber hash lastRefresh id
0 DR-1-GA 1 GA DR 1953-05-02T04:00:00.000Z 1953 Tornado TORNADO 0 1 ... 13 0 0 13000 0 Statewide 53013 2f28952448e0a666d367ca3f854c81ec 2020-10-05T14:21:20.694Z 5f7b2be031a8c6681cfb4342
1 DR-2-TX 2 TX DR 1953-05-15T04:00:00.000Z 1953 Tornado TORNADO & HEAVY RAINFALL 0 1 ... 48 0 0 48000 0 Statewide 53003 c5a1a4a1030d6730d9c562cdbe7c830f 2020-10-05T14:21:20.696Z 5f7b2be031a8c6681cfb4345
2 DR-5-MT 5 MT DR 1953-06-06T04:00:00.000Z 1953 Flood FLOODS 0 1 ... 30 0 0 30000 0 Statewide 53006 59c5483387ca13c6a3c1bc692f4860e1 2020-10-05T14:21:20.698Z 5f7b2be031a8c6681cfb4348
3 DR-7-MA 7 MA DR 1953-06-11T04:00:00.000Z 1953 Tornado TORNADO 0 1 ... 25 0 0 25000 0 Statewide 53009 6bab17e16984fc75f61a8445df3e95d9 2020-10-05T14:21:20.699Z 5f7b2be031a8c6681cfb434b
4 DR-8-IA 8 IA DR 1953-06-11T04:00:00.000Z 1953 Flood FLOOD 0 1 ... 19 0 0 19000 0 Statewide 53008 e258e9dd25fac73939f59c8ffb5308f5 2020-10-05T14:21:20.700Z 5f7b2be031a8c6681cfb434e

5 rows × 26 columns

disaster_df['declarationDate']
0        1953-05-02T04:00:00.000Z
1        1953-05-15T04:00:00.000Z
2        1953-06-06T04:00:00.000Z
3        1953-06-11T04:00:00.000Z
4        1953-06-11T04:00:00.000Z
                   ...           
60218    2020-10-08T17:30:00.000Z
60219    2020-10-08T17:30:00.000Z
60220    2020-10-08T17:30:00.000Z
60221    2020-10-08T17:30:00.000Z
60222    2020-09-20T16:40:00.000Z
Name: declarationDate, Length: 60223, dtype: object
disaster_df['incidentType'].nunique(), disaster_df['declarationTitle'].nunique()
(23, 2128)
disaster_df['incidentType'].unique()
array(['Tornado', 'Flood', 'Fire', 'Other', 'Earthquake', 'Hurricane',
       'Volcano', 'Severe Storm(s)', 'Toxic Substances', 'Typhoon',
       'Dam/Levee Break', 'Drought', 'Snow', 'Severe Ice Storm',
       'Freezing', 'Coastal Storm', 'Fishing Losses', 'Mud/Landslide',
       'Human Cause', 'Terrorist', 'Tsunami', 'Chemical', 'Biological'],
      dtype=object)
disaster_df[disaster_df['incidentType'] == "Fire"]['declarationTitle'].unique() 
#use "incidentType"
array(['FOREST FIRE', 'FIRES', 'FIRE (LOS ANGELES COUNTY)', ...,
       'ZOGG FIRE', 'SQF FIRE COMPLEX', 'SOUTH OBENCHAIN FIRE'],
      dtype=object)
disaster_df['fire'] = disaster_df['incidentType'].apply(lambda x:1 if x == 'Fire' else 0)
len(disaster_df[~(disaster_df['declarationDate'] == disaster_df['incidentBeginDate'])])
disaster_df[~(disaster_df['declarationDate'] == disaster_df['incidentBeginDate'])].head()

femaDeclarationString disasterNumber state declarationType declarationDate fyDeclared incidentType declarationTitle ihProgramDeclared iaProgramDeclared ... fipsCountyCode helperCounty fipscode placeCode designatedArea declarationRequestNumber hash lastRefresh id fire
4974 DR-546-MA 546 MA DR 1978-02-10T05:00:00.000Z 1978 Flood COASTAL STORMS, FLOOD, ICE & SNOW 0 1 ... 1 1 25001 99001 Barnstable (County) 78044 6b54739846a781dd5d3ac3228c12438b 2020-10-05T14:21:28.411Z 5f7b2be831a8c6681cfb9850 0
4975 DR-546-MA 546 MA DR 1978-02-10T05:00:00.000Z 1978 Flood COASTAL STORMS, FLOOD, ICE & SNOW 0 1 ... 7 7 25007 99007 Dukes (County) 78044 7d9e79eacde69ff26f56c6abfc41bcbe 2020-10-05T14:21:28.412Z 5f7b2be831a8c6681cfb9852 0
4976 DR-546-MA 546 MA DR 1978-02-10T05:00:00.000Z 1978 Flood COASTAL STORMS, FLOOD, ICE & SNOW 0 1 ... 5 5 25005 99005 Bristol (County)(in (P)MSA 1120,1200,2480,5400... 78044 44a18d427378b578ca5082f23649f90c 2020-10-05T14:21:28.411Z 5f7b2be831a8c6681cfb9857 0
4977 DR-546-MA 546 MA DR 1978-02-10T05:00:00.000Z 1978 Flood COASTAL STORMS, FLOOD, ICE & SNOW 0 1 ... 9 9 25009 99009 Essex (County)(in PMSA 1120,4160,7090) 78044 69e5446f5f1aa650c9dd62b132584a9d 2020-10-05T14:21:28.414Z 5f7b2be831a8c6681cfb985f 0
4979 DR-546-MA 546 MA DR 1978-02-10T05:00:00.000Z 1978 Flood COASTAL STORMS, FLOOD, ICE & SNOW 0 1 ... 19 19 25019 99019 Nantucket (County) 78044 d63a74daea9f57409505b4aa19b3a090 2020-10-05T14:21:28.416Z 5f7b2be831a8c6681cfb986b 0

5 rows × 27 columns

disaster_df.dtypes
femaDeclarationString       object
disasterNumber               int64
state                       object
declarationType             object
declarationDate             object
fyDeclared                   int64
incidentType                object
declarationTitle            object
ihProgramDeclared            int64
iaProgramDeclared            int64
paProgramDeclared            int64
hmProgramDeclared            int64
incidentBeginDate           object
incidentEndDate             object
disasterCloseoutDate        object
fipsStateCode                int64
helperState                  int64
fipsCountyCode               int64
helperCounty                 int64
fipscode                     int64
placeCode                    int64
designatedArea              object
declarationRequestNumber     int64
hash                        object
lastRefresh                 object
id                          object
fire                         int64
dtype: object
#truncate start date from timestamp to only include yyyy-mm-dd
disaster_df['start_date'] = disaster_df['incidentBeginDate'].apply(lambda x:x[:10])
#disaster_df['end_date'] = disaster_df['incidentEndDate'].apply(lambda x:x[:10]) #the data is not clean
sum(disaster_df['incidentBeginDate'] == disaster_df['incidentEndDate'])
#incidentBeginData is not always same with incidentEndDate, so we need to generate one row for each date from begin to end date
8398
disaster_df['incidentEndDate'].head()
0    1953-05-02T04:00:00.000Z
1    1953-05-15T04:00:00.000Z
2    1953-06-06T04:00:00.000Z
3    1953-06-11T04:00:00.000Z
4    1953-06-11T04:00:00.000Z
Name: incidentEndDate, dtype: object
#clean up end date from timestamp to only include yyyy-mm-dd
end_date = []
for i in disaster_df.iterrows():
    try:
        end_date.append(i[1]['incidentEndDate'][:10])
    except:
        end_date.append(i[1]['start_date'])
        
disaster_df['end_date'] = end_date
disaster_df.head()

femaDeclarationString disasterNumber state declarationType declarationDate fyDeclared incidentType declarationTitle ihProgramDeclared iaProgramDeclared ... fipscode placeCode designatedArea declarationRequestNumber hash lastRefresh id fire start_date end_date
0 DR-1-GA 1 GA DR 1953-05-02T04:00:00.000Z 1953 Tornado TORNADO 0 1 ... 13000 0 Statewide 53013 2f28952448e0a666d367ca3f854c81ec 2020-10-05T14:21:20.694Z 5f7b2be031a8c6681cfb4342 0 1953-05-02 1953-05-02
1 DR-2-TX 2 TX DR 1953-05-15T04:00:00.000Z 1953 Tornado TORNADO & HEAVY RAINFALL 0 1 ... 48000 0 Statewide 53003 c5a1a4a1030d6730d9c562cdbe7c830f 2020-10-05T14:21:20.696Z 5f7b2be031a8c6681cfb4345 0 1953-05-15 1953-05-15
2 DR-5-MT 5 MT DR 1953-06-06T04:00:00.000Z 1953 Flood FLOODS 0 1 ... 30000 0 Statewide 53006 59c5483387ca13c6a3c1bc692f4860e1 2020-10-05T14:21:20.698Z 5f7b2be031a8c6681cfb4348 0 1953-06-06 1953-06-06
3 DR-7-MA 7 MA DR 1953-06-11T04:00:00.000Z 1953 Tornado TORNADO 0 1 ... 25000 0 Statewide 53009 6bab17e16984fc75f61a8445df3e95d9 2020-10-05T14:21:20.699Z 5f7b2be031a8c6681cfb434b 0 1953-06-11 1953-06-11
4 DR-8-IA 8 IA DR 1953-06-11T04:00:00.000Z 1953 Flood FLOOD 0 1 ... 19000 0 Statewide 53008 e258e9dd25fac73939f59c8ffb5308f5 2020-10-05T14:21:20.700Z 5f7b2be031a8c6681cfb434e 0 1953-06-11 1953-06-11

5 rows × 29 columns

sum(disaster_df['fire']), sum(disaster_df['incidentType'] == "Fire")
(3461, 3461)
len(disaster_df)
60223
#how many unique fips code
zip_fips_df["FIPS code"].nunique()
3223
zip_fips_df["ZIP"].nunique() 
39456
disaster_df["fipscode"].nunique()
3319
#get all the cols we need
cols = ['fipscode', 'start_date', 'end_date', 'fire']
fire = disaster_df[cols][disaster_df['fire'] == 1]
fire.head(3)

fipscode start_date end_date fire
9 33000 1953-07-02 1953-07-02 1
63 6000 1956-12-29 1956-12-29 1
107 16000 1960-07-22 1960-07-22 1
fire[fire['start_date'] != fire['end_date']][:3]

fipscode start_date end_date fire
6501 6053 1985-06-26 1985-07-19 1
6502 6079 1985-06-26 1985-07-19 1
6503 6085 1985-06-26 1985-07-19 1
len(fire)
3461
#save cleaned fire data with date and fipscode
import csv

with open('cleaned_data/cleaned_fire_data.csv', 'a') as outputfile:
    writer = csv.writer(outputfile, delimiter=',')
    header = ['fipscode', 'date', 'fire']
    writer.writerow(header)
    
    for i in fire.iterrows():
        if i[1]['start_date'] == i[1]['end_date']:
            val = [i[1]['fipscode'], i[1]['start_date'], i[1]['fire']]
            writer.writerow(val)
        else:
            date_range = list(pd.date_range(start=i[1]['start_date'],end=i[1]['end_date']))
            for d in date_range:
                val = [i[1]['fipscode'], str(d)[:10], i[1]['fire']]
                writer.writerow(val)
#exam the data
clean_fire_df = pd.read_csv("cleaned_fire_data.csv")
clean_fire_df.columns = ['fipscode', 'date', 'fire']
clean_fire_df.head()

fipscode date fire
0 6000 1956-12-29 1
1 16000 1960-07-22 1
2 6000 1961-11-16 1
3 16009 1967-08-30 1
4 16017 1967-08-30 1
clean_fire_df.dtypes
fipscode     int64
date        object
fire         int64
dtype: object

3) Join weather data and fire data

complete_df = weather_fips_df.merge(clean_fire_df, how = 'left', left_on = ['FIPS code', 'date'], right_on=['fipscode', 'date'])
# explore the joined data
len(complete_df) #the same zip code may
16385692
complete_df[~complete_df['fipscode'].isnull()][:5]

date zipcode zip PRCP SNOW SNWD TMAX TMIN TOBS ZIP COUNTYNAME STATE FIPS code fipscode fire
118197 1999-12-03 1420 01420 0.0 0.0 0.0 100.0 11.0 94.0 1420 Worcester County MA 25027 25027.0 1.0
118198 1999-12-04 1420 01420 10.0 0.0 0.0 111.0 28.0 94.0 1420 Worcester County MA 25027 25027.0 1.0
118199 1999-12-05 1420 01420 0.0 0.0 0.0 156.0 -11.0 94.0 1420 Worcester County MA 25027 25027.0 1.0
118200 1999-12-06 1420 01420 107.0 0.0 0.0 128.0 67.0 94.0 1420 Worcester County MA 25027 25027.0 1.0
118201 1999-12-07 1420 01420 122.0 0.0 0.0 111.0 28.0 94.0 1420 Worcester County MA 25027 25027.0 1.0
#fill in all no-matching label with zero
complete_df['fire'] = complete_df['fire'].fillna(0)
# the percentage of positive data is around 0.3%. The data is super unbalanced

len(complete_df[complete_df['fire'] == 1])/len(complete_df)
0.002887641242127583
complete_df.columns
Index(['date', 'zipcode', 'zip', 'PRCP', 'SNOW', 'SNWD', 'TMAX', 'TMIN',
       'TOBS', 'ZIP', 'COUNTYNAME', 'STATE', 'FIPS code', 'fipscode', 'fire'],
      dtype='object')
#save clean data ready for model training
keep_cols = ['date', 'zip', 'FIPS code', 'PRCP', 'SNOW', 'SNWD', 'TMAX', 'TMIN', 'TOBS', 'fire']
complete_df[keep_cols].to_csv("cleaned_data/model_data_all.csv", index=False)

Train model

1) train test split

# load ready-for-model data
df = pd.read_csv("cleaned_data/model_data_all.csv")
df.head(3)

date zip FIPS code PRCP SNOW SNWD TMAX TMIN TOBS fire
0 1950-01-01 601 72001 18.0 0.0 0.0 183.0 128.0 156.0 0.0
1 1950-01-01 601 72113 18.0 0.0 0.0 183.0 128.0 156.0 0.0
2 1950-01-02 601 72001 18.0 0.0 0.0 194.0 122.0 167.0 0.0
len(df)
16385692
X = df[['PRCP', 'SNOW', 'SNWD', 'TMAX', 'TMIN', 'TOBS']]
y = df['fire']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

2) scale and downsample

# scale training data
scaler = preprocessing.StandardScaler().fit(X_train)
scaled_X_train = scaler.transform(X_train)
#downsample the zero labels by 99%
mask = y_train.apply(lambda c: c>0 or random.random() > 0.99)
X_train_bal = scaled_X_train[mask]
y_train_bal = y_train[mask]
len(X_train), len(X_train_bal), len(y_train[y_train == 1]), len(y_train_bal[y_train_bal ==1]), len(y_train[y_train == 0]), len(y_train_bal[y_train_bal ==0])
(14747122, 189151, 42655, 42655, 14704467, 146496)
#the ratio of positive vs. negative after downsample
len(y_train_bal[y_train_bal ==1])/len(y_train_bal[y_train_bal ==0])
0.2911683595456531
# scale test data 
scaled_X_test = scaler.transform(X_test)

3) MLP classifier

clf_bal = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)

clf_bal.fit(X_train_bal, y_train_bal)

/Users/mo-b/anaconda3/lib/python3.7/site-packages/sklearn/neural_network/_multilayer_perceptron.py:470: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)





MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(5, 2), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=1, shuffle=True, solver='lbfgs',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)
# accuracy
clf_bal.score(scaled_X_test, y_test)
0.9735952690455703
clf_bal.classes_
array([0., 1.])
# auc score
prob_pred = clf_bal.predict_proba(scaled_X_test)[:, 1] # only get the predicted probability for class 1
auc_score = roc_auc_score(y_test, prob_pred)
auc_score
0.7337490009777365
y_pred = clf_bal.predict(scaled_X_test)
confusion_matrix(y_test, y_pred)

# tn, fp
# fn, tp
array([[1594599,   39310],
       [   3956,     705]])
#plot roc curve
# metrics.plot_roc_curve(clf_bal, scaled_X_test, y_test)
# plt.title("Fire prediction - ROC curve for MLP Classifier")
# plt.show() 
# fine tune model
parameters = []
auc_scores = []
hidden_layer_sizes = [(5,2), (5,5), (10,5)] #size of each hidden layer
solvers = ['adam','sgd', 'lbfgs']
alphas = [0.001, 0.0001, 0.00001] #regularization term

for hidden_layer_size in hidden_layer_sizes:
    for solver in solvers:
        for alpha in alphas:
            mlp = MLPClassifier(alpha=alpha, hidden_layer_sizes=hidden_layer_size, random_state=1, solver=solver)
            mlp.fit(X_train_bal, y_train_bal)
            prob_pred = mlp.predict_proba(scaled_X_test)[:, 1]
            auc_score = roc_auc_score(y_test, prob_pred)
            parameters.append({"hidden_layer_size":hidden_layer_size, "solver":solver, "alpha":alpha})
            auc_scores.append(auc_score)
            
#get the parameters for best data
max_idx = np.argmax(auc_scores)
auc_scores[max_idx], parameters[max_idx]
(0.7594437575487328,
 {'hidden_layer_size': (10, 5), 'solver': 'adam', 'alpha': 1e-05})
#train model with best parameters
best_mlp = MLPClassifier(alpha=1e-05, hidden_layer_sizes=(10, 5), random_state=1, solver='adam')
best_mlp.fit(X_train_bal, y_train_bal)
MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(10, 5), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=1, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)
#roc for best mlp classifier for testing data
metrics.plot_roc_curve(best_mlp, scaled_X_test, y_test)
plt.title("Fire prediction - ROC curve for MLP Classifier")
plt.show() 

png

#roc for best mlp classifier for training data
metrics.plot_roc_curve(best_mlp, scaled_X_train, y_train)
plt.title("Fire prediction - ROC curve for MLP Classifier with training data")
plt.show() 

png

4) random forest classifier

rf = RandomForestClassifier(max_depth=10, random_state=0)
rf.fit(X_train_bal, y_train_bal)
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=10, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)
# auc score
prob_pred = rf.predict_proba(scaled_X_test)[:, 1] # only get the predicted probability for class 1
auc_score = roc_auc_score(y_test, prob_pred)
auc_score
0.797966470622066
# fine tune model
rf_parameters = []
rf_auc_scores = []
n_estimators = [10, 50, 100]
criterions = ['gini','entropy']
max_depths = [10, 50, 100]
for criterion in criterions:
    for n_estimator in n_estimators:
        for max_depth in max_depths:
            rfm = RandomForestClassifier(max_depth=max_depth,n_estimators =n_estimator,criterion=criterion, random_state=1)
            rfm.fit(X_train_bal, y_train_bal)
            prob_pred = rfm.predict_proba(scaled_X_test)[:, 1]
            auc_score = roc_auc_score(y_test, prob_pred)
            rf_parameters.append({"n_estimator":n_estimator, "criterion":criterion, "max_depth":max_depth})
            rf_auc_scores.append(auc_score)
#get best parameters for random forest classifier
max_idx = np.argmax(rf_auc_scores)
rf_auc_scores[max_idx], rf_parameters[max_idx]
(0.8416267396854684,
 {'n_estimator': 100, 'criterion': 'entropy', 'max_depth': 50})
#train model with best parameters
best_rf = RandomForestClassifier(n_estimators=100, criterion='entropy', max_depth=50 )
best_rf.fit(X_train_bal, y_train_bal)
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=50, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
best_rf.score(scaled_X_test, y_test)
0.9057342682949157
best_rf.score(scaled_X_train, y_train)
0.9066968456624961
#plot roc curve
metrics.plot_roc_curve(best_rf, scaled_X_test, y_test)
plt.title("Fire prediction - ROC curve for random forest Classifier")
plt.show()

png

#plot roc curve
metrics.plot_roc_curve(best_rf, scaled_X_train, y_train)
plt.title("Fire prediction - ROC curve for random forest Classifier with training data")
plt.show()

png

  • Based on the auc scores, random forest model (0.84) is better than MLPClassifier (0.76). We will use the fine tuned random forest model to predict the probability of fire event with forecast NOAA weather data.
best_rf.feature_importances_
array([0.08900428, 0.0059088 , 0.02556477, 0.30069328, 0.278647  ,
       0.30018187])

Predict future fire

1) load predicted weather data

pred_weather_df = pd.read_csv("cleaned_data/NOAA_weather_forecast_predictions.csv")
pred_weather_df.head()

zipcode year PRCP SNOW SNWD TMAX TMIN TOBS
0 16601 2030 36.704178 6.832971 5.034125 155.709286 57.087498 79.935546
1 16601 2040 38.987804 8.066879 -0.581523 157.545249 60.140693 78.845724
2 16601 2050 41.276487 9.303487 -6.195984 159.381684 63.195606 77.752849
3 16601 2060 43.571408 10.541236 -11.812570 161.216156 66.252145 76.652903
4 16601 2070 45.849895 11.775522 -17.431140 163.057915 69.306966 75.563086

2) scale predicted weather data

weather_forecast = pred_weather_df[['PRCP', 'SNOW', 'SNWD', 'TMAX', 'TMIN', 'TOBS']]
scaled_weather_forecast = scaler.transform(weather_forecast)

3) predict fire

fire_prob_pred = best_rf.predict_proba(scaled_weather_forecast)[:, 1]
pred_weather_df['Fire'] = fire_prob_pred
pred_weather_df.head()

zipcode year PRCP SNOW SNWD TMAX TMIN TOBS Fire
0 16601 2030 36.704178 6.832971 5.034125 155.709286 57.087498 79.935546 0.22
1 16601 2040 38.987804 8.066879 -0.581523 157.545249 60.140693 78.845724 0.00
2 16601 2050 41.276487 9.303487 -6.195984 159.381684 63.195606 77.752849 0.01
3 16601 2060 43.571408 10.541236 -11.812570 161.216156 66.252145 76.652903 0.00
4 16601 2070 45.849895 11.775522 -17.431140 163.057915 69.306966 75.563086 0.05
#save fire prediction to csv file
pred_weather_df.to_csv('cleaned_data/fire_prediction_w_noaa_weather_forecast.csv', index=False)
len(pred_weather_df)
3260
check = pd.read_csv('cleaned_data/fire_prediction_w_noaa_weather_forecast.csv')
check.head()

zipcode year PRCP SNOW SNWD TMAX TMIN TOBS Fire
0 16601 2030 36.704178 6.832971 5.034125 155.709286 57.087498 79.935546 0.22
1 16601 2040 38.987804 8.066879 -0.581523 157.545249 60.140693 78.845724 0.00
2 16601 2050 41.276487 9.303487 -6.195984 159.381684 63.195606 77.752849 0.01
3 16601 2060 43.571408 10.541236 -11.812570 161.216156 66.252145 76.652903 0.00
4 16601 2070 45.849895 11.775522 -17.431140 163.057915 69.306966 75.563086 0.05
Mo Berro
Machine Learning Engineer

My research interests include distributed systems, mobile computing and programmable matter.