Please visit https://mhberro.github.io/ to interact with the Climate Migration tool. Drag the “Select Year” slider to change the prediction year and click on the shaded regions to display the climate index for that region.
Problem Statement
Migration due to climate change is already occurring in the U.S and around the world. This can be impacted by many factors, such as recurrent flooding, rising temperature and humidity, and repeated fire outbreaks. As the pattern of climate change continues, many cities will become uninhabitable. The goal of this project is to create an interactive visualization of the effects of climate change on the U.S. over the next 50 years and recommend better areas to move to based on predicted climate values.
Data
I was able to retrieve the climate data by scraping it from the National Oceanic and Atmospheric Administration (NOAA) API. The NOAA database has daily climate information dating back to the 1950s for differenct zip codes. I initially planned to scrape all 42,000 zip codes in the U.S for climate data, but the NOAA API has a stict limit on requests per day, and it would take roughly 25,500 days to scrape for each zip code. Instead, I used a subset of 1000 zip codes, using the first three digits, which represent the zip code prefix, to get all the regions in the U.S. Climate data is regional, but doesn’t change drastically within small regions. This allows me to get an approximation of the climate data for all the zips by using a smaller (but still quite large), more manageable subset. The total amount of data is over 70 million records that is stored in MySQL on an EC2 instance in AWS.
Below is the model used to make wildfire predictions for the Climate Migration tool. This model is one of a number of models used to create the overall climate index prediction for the Climate Migration app. The data was obtained from scraping the National Oceanic and Atmospheric Administration (NOAA) website.
import pandas as pd
import zipfile
import numpy as np
import random
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix, roc_auc_score
from sklearn import preprocessing, metrics
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
Content
1. Clean data
-
- Clean weather data
-
- Clean disaster data and extract fire data
-
- Join weather and fire data
2. Train model
-
- Train test split
-
- Scale and downsample
-
- MLP classifier
-
- Random forest classifier
3. Prediction future fire
-
- Load predicted weather data
-
- Scale predicted weather data
-
- Predict fire
Clean data
1) Clean weather data
zp = zipfile.ZipFile('cleaned_data/all_zip_codes_all.zip')
weather_df = pd.read_csv(zp.open("all_zip_codes.csv"), converters={'zip': lambda x: str(x)}) #include leading zeros
weather_df.head()
#zip is same as zipcode, except that zip includes leading zeros
|
date |
zipcode |
zip |
PRCP |
SNOW |
SNWD |
TMAX |
TMIN |
TOBS |
0 |
1950-01-01 |
601 |
00601 |
18.0 |
0.0 |
0.0 |
183.0 |
128.0 |
156.0 |
1 |
1950-01-02 |
601 |
00601 |
18.0 |
0.0 |
0.0 |
194.0 |
122.0 |
167.0 |
2 |
1950-01-03 |
601 |
00601 |
0.0 |
0.0 |
0.0 |
211.0 |
133.0 |
183.0 |
3 |
1950-01-04 |
601 |
00601 |
0.0 |
0.0 |
0.0 |
200.0 |
144.0 |
167.0 |
4 |
1950-01-05 |
601 |
00601 |
0.0 |
0.0 |
0.0 |
217.0 |
139.0 |
167.0 |
|
date |
zipcode |
zip |
PRCP |
SNOW |
SNWD |
TMAX |
TMIN |
TOBS |
10955519 |
2020-10-22 |
39701 |
39701 |
0.0 |
0.0 |
0.0 |
272.0 |
183.0 |
233.0 |
10955520 |
2020-10-23 |
39701 |
39701 |
0.0 |
0.0 |
0.0 |
289.0 |
206.0 |
222.0 |
10955521 |
2020-10-26 |
39701 |
39701 |
0.0 |
0.0 |
0.0 |
206.0 |
133.0 |
172.0 |
10955522 |
2020-10-27 |
39701 |
39701 |
0.0 |
0.0 |
0.0 |
256.0 |
156.0 |
161.0 |
10955523 |
2020-10-28 |
39701 |
39701 |
0.0 |
0.0 |
0.0 |
250.0 |
161.0 |
228.0 |
Index(['date', 'zipcode', 'zip', 'PRCP', 'SNOW', 'SNWD', 'TMAX', 'TMIN',
'TOBS'],
dtype='object')
10955524
weather_df['zipcode'].nunique()
652
weather_df['zipcode'].unique()[:100]
array([ 601, 703, 901, 1201, 1301, 1420, 1501, 1602, 1701,
1801, 2019, 2108, 2301, 2420, 2601, 3031, 3101, 3301,
3431, 3561, 3740, 3801, 3901, 4001, 4101, 4210, 4330,
4401, 4530, 4605, 4730, 4841, 4901, 5001, 5101, 5201,
5301, 5401, 5602, 5701, 5819, 6001, 6103, 6226, 6320,
6401, 6604, 6801, 7001, 7102, 7302, 7501, 7701, 7901,
8201, 8302, 8401, 8701, 8801, 8901, 10801, 10901, 11001,
11101, 11501, 11701, 89001, 89101, 89301, 89701, 89801, 90401,
90501, 90601, 90701, 90802, 91001, 91101, 91201, 91301, 91401,
91501, 91601, 91701, 91901, 92101, 92201, 92301, 92501, 92602,
92701, 93001, 93101, 93401, 93501, 93601, 93901, 94002, 94102,
94301])
weather_df['zip'].nunique()
652
weather_df['zip'].unique()[:100]
array(['00601', '00703', '00901', '01201', '01301', '01420', '01501',
'01602', '01701', '01801', '02019', '02108', '02301', '02420',
'02601', '03031', '03101', '03301', '03431', '03561', '03740',
'03801', '03901', '04001', '04101', '04210', '04330', '04401',
'04530', '04605', '04730', '04841', '04901', '05001', '05101',
'05201', '05301', '05401', '05602', '05701', '05819', '06001',
'06103', '06226', '06320', '06401', '06604', '06801', '07001',
'07102', '07302', '07501', '07701', '07901', '08201', '08302',
'08401', '08701', '08801', '08901', '10801', '10901', '11001',
'11101', '11501', '11701', '89001', '89101', '89301', '89701',
'89801', '90401', '90501', '90601', '90701', '90802', '91001',
'91101', '91201', '91301', '91401', '91501', '91601', '91701',
'91901', '92101', '92201', '92301', '92501', '92602', '92701',
'93001', '93101', '93401', '93501', '93601', '93901', '94002',
'94102', '94301'], dtype=object)
date object
zipcode int64
zip object
PRCP float64
SNOW float64
SNWD float64
TMAX float64
TMIN float64
TOBS float64
dtype: object
# map zip code to fips code for weather data
zip_fips_df = pd.read_csv("cleaned_data/ZIP-COUNTY-FIPS_2017-06.csv")
zip_fips_df['ZIP'].unique()
array([36003, 36006, 36067, ..., 820, 830, 802])
weather_fips_df = weather_df.merge(zip_fips_df, how = 'left', left_on = 'zipcode', right_on='ZIP')
#get weather data with fips code
weather_fips_df.head()
|
date |
zipcode |
zip |
PRCP |
SNOW |
SNWD |
TMAX |
TMIN |
TOBS |
ZIP |
COUNTYNAME |
STATE |
FIPS code |
0 |
1950-01-01 |
601 |
00601 |
18.0 |
0.0 |
0.0 |
183.0 |
128.0 |
156.0 |
601 |
Adjuntas Municipio |
PR |
72001 |
1 |
1950-01-01 |
601 |
00601 |
18.0 |
0.0 |
0.0 |
183.0 |
128.0 |
156.0 |
601 |
Ponce Municipio |
PR |
72113 |
2 |
1950-01-02 |
601 |
00601 |
18.0 |
0.0 |
0.0 |
194.0 |
122.0 |
167.0 |
601 |
Adjuntas Municipio |
PR |
72001 |
3 |
1950-01-02 |
601 |
00601 |
18.0 |
0.0 |
0.0 |
194.0 |
122.0 |
167.0 |
601 |
Ponce Municipio |
PR |
72113 |
4 |
1950-01-03 |
601 |
00601 |
0.0 |
0.0 |
0.0 |
211.0 |
133.0 |
183.0 |
601 |
Adjuntas Municipio |
PR |
72001 |
16380869
2) Clean disaster data and extract fire data
# data description: https://www.fema.gov/openfema-data-page/disaster-declarations-summaries-v2
disaster_df = pd.read_csv("cleaned_data/DisasterDeclarationsSummaries_1.1.csv")
Index(['femaDeclarationString', 'disasterNumber', 'state', 'declarationType',
'declarationDate', 'fyDeclared', 'incidentType', 'declarationTitle',
'ihProgramDeclared', 'iaProgramDeclared', 'paProgramDeclared',
'hmProgramDeclared', 'incidentBeginDate', 'incidentEndDate',
'disasterCloseoutDate', 'fipsStateCode', 'helperState',
'fipsCountyCode', 'helperCounty', 'fipscode', 'placeCode',
'designatedArea', 'declarationRequestNumber', 'hash', 'lastRefresh',
'id'],
dtype='object')
|
femaDeclarationString |
disasterNumber |
state |
declarationType |
declarationDate |
fyDeclared |
incidentType |
declarationTitle |
ihProgramDeclared |
iaProgramDeclared |
... |
helperState |
fipsCountyCode |
helperCounty |
fipscode |
placeCode |
designatedArea |
declarationRequestNumber |
hash |
lastRefresh |
id |
0 |
DR-1-GA |
1 |
GA |
DR |
1953-05-02T04:00:00.000Z |
1953 |
Tornado |
TORNADO |
0 |
1 |
... |
13 |
0 |
0 |
13000 |
0 |
Statewide |
53013 |
2f28952448e0a666d367ca3f854c81ec |
2020-10-05T14:21:20.694Z |
5f7b2be031a8c6681cfb4342 |
1 |
DR-2-TX |
2 |
TX |
DR |
1953-05-15T04:00:00.000Z |
1953 |
Tornado |
TORNADO & HEAVY RAINFALL |
0 |
1 |
... |
48 |
0 |
0 |
48000 |
0 |
Statewide |
53003 |
c5a1a4a1030d6730d9c562cdbe7c830f |
2020-10-05T14:21:20.696Z |
5f7b2be031a8c6681cfb4345 |
2 |
DR-5-MT |
5 |
MT |
DR |
1953-06-06T04:00:00.000Z |
1953 |
Flood |
FLOODS |
0 |
1 |
... |
30 |
0 |
0 |
30000 |
0 |
Statewide |
53006 |
59c5483387ca13c6a3c1bc692f4860e1 |
2020-10-05T14:21:20.698Z |
5f7b2be031a8c6681cfb4348 |
3 |
DR-7-MA |
7 |
MA |
DR |
1953-06-11T04:00:00.000Z |
1953 |
Tornado |
TORNADO |
0 |
1 |
... |
25 |
0 |
0 |
25000 |
0 |
Statewide |
53009 |
6bab17e16984fc75f61a8445df3e95d9 |
2020-10-05T14:21:20.699Z |
5f7b2be031a8c6681cfb434b |
4 |
DR-8-IA |
8 |
IA |
DR |
1953-06-11T04:00:00.000Z |
1953 |
Flood |
FLOOD |
0 |
1 |
... |
19 |
0 |
0 |
19000 |
0 |
Statewide |
53008 |
e258e9dd25fac73939f59c8ffb5308f5 |
2020-10-05T14:21:20.700Z |
5f7b2be031a8c6681cfb434e |
5 rows × 26 columns
disaster_df['declarationDate']
0 1953-05-02T04:00:00.000Z
1 1953-05-15T04:00:00.000Z
2 1953-06-06T04:00:00.000Z
3 1953-06-11T04:00:00.000Z
4 1953-06-11T04:00:00.000Z
...
60218 2020-10-08T17:30:00.000Z
60219 2020-10-08T17:30:00.000Z
60220 2020-10-08T17:30:00.000Z
60221 2020-10-08T17:30:00.000Z
60222 2020-09-20T16:40:00.000Z
Name: declarationDate, Length: 60223, dtype: object
disaster_df['incidentType'].nunique(), disaster_df['declarationTitle'].nunique()
(23, 2128)
disaster_df['incidentType'].unique()
array(['Tornado', 'Flood', 'Fire', 'Other', 'Earthquake', 'Hurricane',
'Volcano', 'Severe Storm(s)', 'Toxic Substances', 'Typhoon',
'Dam/Levee Break', 'Drought', 'Snow', 'Severe Ice Storm',
'Freezing', 'Coastal Storm', 'Fishing Losses', 'Mud/Landslide',
'Human Cause', 'Terrorist', 'Tsunami', 'Chemical', 'Biological'],
dtype=object)
disaster_df[disaster_df['incidentType'] == "Fire"]['declarationTitle'].unique()
#use "incidentType"
array(['FOREST FIRE', 'FIRES', 'FIRE (LOS ANGELES COUNTY)', ...,
'ZOGG FIRE', 'SQF FIRE COMPLEX', 'SOUTH OBENCHAIN FIRE'],
dtype=object)
disaster_df['fire'] = disaster_df['incidentType'].apply(lambda x:1 if x == 'Fire' else 0)
len(disaster_df[~(disaster_df['declarationDate'] == disaster_df['incidentBeginDate'])])
disaster_df[~(disaster_df['declarationDate'] == disaster_df['incidentBeginDate'])].head()
|
femaDeclarationString |
disasterNumber |
state |
declarationType |
declarationDate |
fyDeclared |
incidentType |
declarationTitle |
ihProgramDeclared |
iaProgramDeclared |
... |
fipsCountyCode |
helperCounty |
fipscode |
placeCode |
designatedArea |
declarationRequestNumber |
hash |
lastRefresh |
id |
fire |
4974 |
DR-546-MA |
546 |
MA |
DR |
1978-02-10T05:00:00.000Z |
1978 |
Flood |
COASTAL STORMS, FLOOD, ICE & SNOW |
0 |
1 |
... |
1 |
1 |
25001 |
99001 |
Barnstable (County) |
78044 |
6b54739846a781dd5d3ac3228c12438b |
2020-10-05T14:21:28.411Z |
5f7b2be831a8c6681cfb9850 |
0 |
4975 |
DR-546-MA |
546 |
MA |
DR |
1978-02-10T05:00:00.000Z |
1978 |
Flood |
COASTAL STORMS, FLOOD, ICE & SNOW |
0 |
1 |
... |
7 |
7 |
25007 |
99007 |
Dukes (County) |
78044 |
7d9e79eacde69ff26f56c6abfc41bcbe |
2020-10-05T14:21:28.412Z |
5f7b2be831a8c6681cfb9852 |
0 |
4976 |
DR-546-MA |
546 |
MA |
DR |
1978-02-10T05:00:00.000Z |
1978 |
Flood |
COASTAL STORMS, FLOOD, ICE & SNOW |
0 |
1 |
... |
5 |
5 |
25005 |
99005 |
Bristol (County)(in (P)MSA 1120,1200,2480,5400... |
78044 |
44a18d427378b578ca5082f23649f90c |
2020-10-05T14:21:28.411Z |
5f7b2be831a8c6681cfb9857 |
0 |
4977 |
DR-546-MA |
546 |
MA |
DR |
1978-02-10T05:00:00.000Z |
1978 |
Flood |
COASTAL STORMS, FLOOD, ICE & SNOW |
0 |
1 |
... |
9 |
9 |
25009 |
99009 |
Essex (County)(in PMSA 1120,4160,7090) |
78044 |
69e5446f5f1aa650c9dd62b132584a9d |
2020-10-05T14:21:28.414Z |
5f7b2be831a8c6681cfb985f |
0 |
4979 |
DR-546-MA |
546 |
MA |
DR |
1978-02-10T05:00:00.000Z |
1978 |
Flood |
COASTAL STORMS, FLOOD, ICE & SNOW |
0 |
1 |
... |
19 |
19 |
25019 |
99019 |
Nantucket (County) |
78044 |
d63a74daea9f57409505b4aa19b3a090 |
2020-10-05T14:21:28.416Z |
5f7b2be831a8c6681cfb986b |
0 |
5 rows × 27 columns
femaDeclarationString object
disasterNumber int64
state object
declarationType object
declarationDate object
fyDeclared int64
incidentType object
declarationTitle object
ihProgramDeclared int64
iaProgramDeclared int64
paProgramDeclared int64
hmProgramDeclared int64
incidentBeginDate object
incidentEndDate object
disasterCloseoutDate object
fipsStateCode int64
helperState int64
fipsCountyCode int64
helperCounty int64
fipscode int64
placeCode int64
designatedArea object
declarationRequestNumber int64
hash object
lastRefresh object
id object
fire int64
dtype: object
#truncate start date from timestamp to only include yyyy-mm-dd
disaster_df['start_date'] = disaster_df['incidentBeginDate'].apply(lambda x:x[:10])
#disaster_df['end_date'] = disaster_df['incidentEndDate'].apply(lambda x:x[:10]) #the data is not clean
sum(disaster_df['incidentBeginDate'] == disaster_df['incidentEndDate'])
#incidentBeginData is not always same with incidentEndDate, so we need to generate one row for each date from begin to end date
8398
disaster_df['incidentEndDate'].head()
0 1953-05-02T04:00:00.000Z
1 1953-05-15T04:00:00.000Z
2 1953-06-06T04:00:00.000Z
3 1953-06-11T04:00:00.000Z
4 1953-06-11T04:00:00.000Z
Name: incidentEndDate, dtype: object
#clean up end date from timestamp to only include yyyy-mm-dd
end_date = []
for i in disaster_df.iterrows():
try:
end_date.append(i[1]['incidentEndDate'][:10])
except:
end_date.append(i[1]['start_date'])
disaster_df['end_date'] = end_date
|
femaDeclarationString |
disasterNumber |
state |
declarationType |
declarationDate |
fyDeclared |
incidentType |
declarationTitle |
ihProgramDeclared |
iaProgramDeclared |
... |
fipscode |
placeCode |
designatedArea |
declarationRequestNumber |
hash |
lastRefresh |
id |
fire |
start_date |
end_date |
0 |
DR-1-GA |
1 |
GA |
DR |
1953-05-02T04:00:00.000Z |
1953 |
Tornado |
TORNADO |
0 |
1 |
... |
13000 |
0 |
Statewide |
53013 |
2f28952448e0a666d367ca3f854c81ec |
2020-10-05T14:21:20.694Z |
5f7b2be031a8c6681cfb4342 |
0 |
1953-05-02 |
1953-05-02 |
1 |
DR-2-TX |
2 |
TX |
DR |
1953-05-15T04:00:00.000Z |
1953 |
Tornado |
TORNADO & HEAVY RAINFALL |
0 |
1 |
... |
48000 |
0 |
Statewide |
53003 |
c5a1a4a1030d6730d9c562cdbe7c830f |
2020-10-05T14:21:20.696Z |
5f7b2be031a8c6681cfb4345 |
0 |
1953-05-15 |
1953-05-15 |
2 |
DR-5-MT |
5 |
MT |
DR |
1953-06-06T04:00:00.000Z |
1953 |
Flood |
FLOODS |
0 |
1 |
... |
30000 |
0 |
Statewide |
53006 |
59c5483387ca13c6a3c1bc692f4860e1 |
2020-10-05T14:21:20.698Z |
5f7b2be031a8c6681cfb4348 |
0 |
1953-06-06 |
1953-06-06 |
3 |
DR-7-MA |
7 |
MA |
DR |
1953-06-11T04:00:00.000Z |
1953 |
Tornado |
TORNADO |
0 |
1 |
... |
25000 |
0 |
Statewide |
53009 |
6bab17e16984fc75f61a8445df3e95d9 |
2020-10-05T14:21:20.699Z |
5f7b2be031a8c6681cfb434b |
0 |
1953-06-11 |
1953-06-11 |
4 |
DR-8-IA |
8 |
IA |
DR |
1953-06-11T04:00:00.000Z |
1953 |
Flood |
FLOOD |
0 |
1 |
... |
19000 |
0 |
Statewide |
53008 |
e258e9dd25fac73939f59c8ffb5308f5 |
2020-10-05T14:21:20.700Z |
5f7b2be031a8c6681cfb434e |
0 |
1953-06-11 |
1953-06-11 |
5 rows × 29 columns
sum(disaster_df['fire']), sum(disaster_df['incidentType'] == "Fire")
(3461, 3461)
60223
#how many unique fips code
zip_fips_df["FIPS code"].nunique()
3223
zip_fips_df["ZIP"].nunique()
39456
disaster_df["fipscode"].nunique()
3319
#get all the cols we need
cols = ['fipscode', 'start_date', 'end_date', 'fire']
fire = disaster_df[cols][disaster_df['fire'] == 1]
fire.head(3)
|
fipscode |
start_date |
end_date |
fire |
9 |
33000 |
1953-07-02 |
1953-07-02 |
1 |
63 |
6000 |
1956-12-29 |
1956-12-29 |
1 |
107 |
16000 |
1960-07-22 |
1960-07-22 |
1 |
fire[fire['start_date'] != fire['end_date']][:3]
|
fipscode |
start_date |
end_date |
fire |
6501 |
6053 |
1985-06-26 |
1985-07-19 |
1 |
6502 |
6079 |
1985-06-26 |
1985-07-19 |
1 |
6503 |
6085 |
1985-06-26 |
1985-07-19 |
1 |
3461
#save cleaned fire data with date and fipscode
import csv
with open('cleaned_data/cleaned_fire_data.csv', 'a') as outputfile:
writer = csv.writer(outputfile, delimiter=',')
header = ['fipscode', 'date', 'fire']
writer.writerow(header)
for i in fire.iterrows():
if i[1]['start_date'] == i[1]['end_date']:
val = [i[1]['fipscode'], i[1]['start_date'], i[1]['fire']]
writer.writerow(val)
else:
date_range = list(pd.date_range(start=i[1]['start_date'],end=i[1]['end_date']))
for d in date_range:
val = [i[1]['fipscode'], str(d)[:10], i[1]['fire']]
writer.writerow(val)
#exam the data
clean_fire_df = pd.read_csv("cleaned_fire_data.csv")
clean_fire_df.columns = ['fipscode', 'date', 'fire']
clean_fire_df.head()
|
fipscode |
date |
fire |
0 |
6000 |
1956-12-29 |
1 |
1 |
16000 |
1960-07-22 |
1 |
2 |
6000 |
1961-11-16 |
1 |
3 |
16009 |
1967-08-30 |
1 |
4 |
16017 |
1967-08-30 |
1 |
fipscode int64
date object
fire int64
dtype: object
3) Join weather data and fire data
complete_df = weather_fips_df.merge(clean_fire_df, how = 'left', left_on = ['FIPS code', 'date'], right_on=['fipscode', 'date'])
# explore the joined data
len(complete_df) #the same zip code may
16385692
complete_df[~complete_df['fipscode'].isnull()][:5]
|
date |
zipcode |
zip |
PRCP |
SNOW |
SNWD |
TMAX |
TMIN |
TOBS |
ZIP |
COUNTYNAME |
STATE |
FIPS code |
fipscode |
fire |
118197 |
1999-12-03 |
1420 |
01420 |
0.0 |
0.0 |
0.0 |
100.0 |
11.0 |
94.0 |
1420 |
Worcester County |
MA |
25027 |
25027.0 |
1.0 |
118198 |
1999-12-04 |
1420 |
01420 |
10.0 |
0.0 |
0.0 |
111.0 |
28.0 |
94.0 |
1420 |
Worcester County |
MA |
25027 |
25027.0 |
1.0 |
118199 |
1999-12-05 |
1420 |
01420 |
0.0 |
0.0 |
0.0 |
156.0 |
-11.0 |
94.0 |
1420 |
Worcester County |
MA |
25027 |
25027.0 |
1.0 |
118200 |
1999-12-06 |
1420 |
01420 |
107.0 |
0.0 |
0.0 |
128.0 |
67.0 |
94.0 |
1420 |
Worcester County |
MA |
25027 |
25027.0 |
1.0 |
118201 |
1999-12-07 |
1420 |
01420 |
122.0 |
0.0 |
0.0 |
111.0 |
28.0 |
94.0 |
1420 |
Worcester County |
MA |
25027 |
25027.0 |
1.0 |
#fill in all no-matching label with zero
complete_df['fire'] = complete_df['fire'].fillna(0)
# the percentage of positive data is around 0.3%. The data is super unbalanced
len(complete_df[complete_df['fire'] == 1])/len(complete_df)
0.002887641242127583
Index(['date', 'zipcode', 'zip', 'PRCP', 'SNOW', 'SNWD', 'TMAX', 'TMIN',
'TOBS', 'ZIP', 'COUNTYNAME', 'STATE', 'FIPS code', 'fipscode', 'fire'],
dtype='object')
#save clean data ready for model training
keep_cols = ['date', 'zip', 'FIPS code', 'PRCP', 'SNOW', 'SNWD', 'TMAX', 'TMIN', 'TOBS', 'fire']
complete_df[keep_cols].to_csv("cleaned_data/model_data_all.csv", index=False)
Train model
1) train test split
# load ready-for-model data
df = pd.read_csv("cleaned_data/model_data_all.csv")
|
date |
zip |
FIPS code |
PRCP |
SNOW |
SNWD |
TMAX |
TMIN |
TOBS |
fire |
0 |
1950-01-01 |
601 |
72001 |
18.0 |
0.0 |
0.0 |
183.0 |
128.0 |
156.0 |
0.0 |
1 |
1950-01-01 |
601 |
72113 |
18.0 |
0.0 |
0.0 |
183.0 |
128.0 |
156.0 |
0.0 |
2 |
1950-01-02 |
601 |
72001 |
18.0 |
0.0 |
0.0 |
194.0 |
122.0 |
167.0 |
0.0 |
16385692
X = df[['PRCP', 'SNOW', 'SNWD', 'TMAX', 'TMIN', 'TOBS']]
y = df['fire']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
2) scale and downsample
# scale training data
scaler = preprocessing.StandardScaler().fit(X_train)
scaled_X_train = scaler.transform(X_train)
#downsample the zero labels by 99%
mask = y_train.apply(lambda c: c>0 or random.random() > 0.99)
X_train_bal = scaled_X_train[mask]
y_train_bal = y_train[mask]
len(X_train), len(X_train_bal), len(y_train[y_train == 1]), len(y_train_bal[y_train_bal ==1]), len(y_train[y_train == 0]), len(y_train_bal[y_train_bal ==0])
(14747122, 189151, 42655, 42655, 14704467, 146496)
#the ratio of positive vs. negative after downsample
len(y_train_bal[y_train_bal ==1])/len(y_train_bal[y_train_bal ==0])
0.2911683595456531
# scale test data
scaled_X_test = scaler.transform(X_test)
3) MLP classifier
clf_bal = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)
clf_bal.fit(X_train_bal, y_train_bal)
/Users/mo-b/anaconda3/lib/python3.7/site-packages/sklearn/neural_network/_multilayer_perceptron.py:470: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9,
beta_2=0.999, early_stopping=False, epsilon=1e-08,
hidden_layer_sizes=(5, 2), learning_rate='constant',
learning_rate_init=0.001, max_fun=15000, max_iter=200,
momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
power_t=0.5, random_state=1, shuffle=True, solver='lbfgs',
tol=0.0001, validation_fraction=0.1, verbose=False,
warm_start=False)
# accuracy
clf_bal.score(scaled_X_test, y_test)
0.9735952690455703
array([0., 1.])
# auc score
prob_pred = clf_bal.predict_proba(scaled_X_test)[:, 1] # only get the predicted probability for class 1
auc_score = roc_auc_score(y_test, prob_pred)
auc_score
0.7337490009777365
y_pred = clf_bal.predict(scaled_X_test)
confusion_matrix(y_test, y_pred)
# tn, fp
# fn, tp
array([[1594599, 39310],
[ 3956, 705]])
#plot roc curve
# metrics.plot_roc_curve(clf_bal, scaled_X_test, y_test)
# plt.title("Fire prediction - ROC curve for MLP Classifier")
# plt.show()
# fine tune model
parameters = []
auc_scores = []
hidden_layer_sizes = [(5,2), (5,5), (10,5)] #size of each hidden layer
solvers = ['adam','sgd', 'lbfgs']
alphas = [0.001, 0.0001, 0.00001] #regularization term
for hidden_layer_size in hidden_layer_sizes:
for solver in solvers:
for alpha in alphas:
mlp = MLPClassifier(alpha=alpha, hidden_layer_sizes=hidden_layer_size, random_state=1, solver=solver)
mlp.fit(X_train_bal, y_train_bal)
prob_pred = mlp.predict_proba(scaled_X_test)[:, 1]
auc_score = roc_auc_score(y_test, prob_pred)
parameters.append({"hidden_layer_size":hidden_layer_size, "solver":solver, "alpha":alpha})
auc_scores.append(auc_score)
#get the parameters for best data
max_idx = np.argmax(auc_scores)
auc_scores[max_idx], parameters[max_idx]
(0.7594437575487328,
{'hidden_layer_size': (10, 5), 'solver': 'adam', 'alpha': 1e-05})
#train model with best parameters
best_mlp = MLPClassifier(alpha=1e-05, hidden_layer_sizes=(10, 5), random_state=1, solver='adam')
best_mlp.fit(X_train_bal, y_train_bal)
MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9,
beta_2=0.999, early_stopping=False, epsilon=1e-08,
hidden_layer_sizes=(10, 5), learning_rate='constant',
learning_rate_init=0.001, max_fun=15000, max_iter=200,
momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
power_t=0.5, random_state=1, shuffle=True, solver='adam',
tol=0.0001, validation_fraction=0.1, verbose=False,
warm_start=False)
#roc for best mlp classifier for testing data
metrics.plot_roc_curve(best_mlp, scaled_X_test, y_test)
plt.title("Fire prediction - ROC curve for MLP Classifier")
plt.show()
#roc for best mlp classifier for training data
metrics.plot_roc_curve(best_mlp, scaled_X_train, y_train)
plt.title("Fire prediction - ROC curve for MLP Classifier with training data")
plt.show()
4) random forest classifier
rf = RandomForestClassifier(max_depth=10, random_state=0)
rf.fit(X_train_bal, y_train_bal)
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=10, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_jobs=None, oob_score=False, random_state=0, verbose=0,
warm_start=False)
# auc score
prob_pred = rf.predict_proba(scaled_X_test)[:, 1] # only get the predicted probability for class 1
auc_score = roc_auc_score(y_test, prob_pred)
auc_score
0.797966470622066
# fine tune model
rf_parameters = []
rf_auc_scores = []
n_estimators = [10, 50, 100]
criterions = ['gini','entropy']
max_depths = [10, 50, 100]
for criterion in criterions:
for n_estimator in n_estimators:
for max_depth in max_depths:
rfm = RandomForestClassifier(max_depth=max_depth,n_estimators =n_estimator,criterion=criterion, random_state=1)
rfm.fit(X_train_bal, y_train_bal)
prob_pred = rfm.predict_proba(scaled_X_test)[:, 1]
auc_score = roc_auc_score(y_test, prob_pred)
rf_parameters.append({"n_estimator":n_estimator, "criterion":criterion, "max_depth":max_depth})
rf_auc_scores.append(auc_score)
#get best parameters for random forest classifier
max_idx = np.argmax(rf_auc_scores)
rf_auc_scores[max_idx], rf_parameters[max_idx]
(0.8416267396854684,
{'n_estimator': 100, 'criterion': 'entropy', 'max_depth': 50})
#train model with best parameters
best_rf = RandomForestClassifier(n_estimators=100, criterion='entropy', max_depth=50 )
best_rf.fit(X_train_bal, y_train_bal)
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='entropy', max_depth=50, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
best_rf.score(scaled_X_test, y_test)
0.9057342682949157
best_rf.score(scaled_X_train, y_train)
0.9066968456624961
#plot roc curve
metrics.plot_roc_curve(best_rf, scaled_X_test, y_test)
plt.title("Fire prediction - ROC curve for random forest Classifier")
plt.show()
#plot roc curve
metrics.plot_roc_curve(best_rf, scaled_X_train, y_train)
plt.title("Fire prediction - ROC curve for random forest Classifier with training data")
plt.show()
- Based on the auc scores, random forest model (0.84) is better than MLPClassifier (0.76). We will use the fine tuned random forest model to predict the probability of fire event with forecast NOAA weather data.
best_rf.feature_importances_
array([0.08900428, 0.0059088 , 0.02556477, 0.30069328, 0.278647 ,
0.30018187])
Predict future fire
1) load predicted weather data
pred_weather_df = pd.read_csv("cleaned_data/NOAA_weather_forecast_predictions.csv")
|
zipcode |
year |
PRCP |
SNOW |
SNWD |
TMAX |
TMIN |
TOBS |
0 |
16601 |
2030 |
36.704178 |
6.832971 |
5.034125 |
155.709286 |
57.087498 |
79.935546 |
1 |
16601 |
2040 |
38.987804 |
8.066879 |
-0.581523 |
157.545249 |
60.140693 |
78.845724 |
2 |
16601 |
2050 |
41.276487 |
9.303487 |
-6.195984 |
159.381684 |
63.195606 |
77.752849 |
3 |
16601 |
2060 |
43.571408 |
10.541236 |
-11.812570 |
161.216156 |
66.252145 |
76.652903 |
4 |
16601 |
2070 |
45.849895 |
11.775522 |
-17.431140 |
163.057915 |
69.306966 |
75.563086 |
2) scale predicted weather data
weather_forecast = pred_weather_df[['PRCP', 'SNOW', 'SNWD', 'TMAX', 'TMIN', 'TOBS']]
scaled_weather_forecast = scaler.transform(weather_forecast)
3) predict fire
fire_prob_pred = best_rf.predict_proba(scaled_weather_forecast)[:, 1]
pred_weather_df['Fire'] = fire_prob_pred
|
zipcode |
year |
PRCP |
SNOW |
SNWD |
TMAX |
TMIN |
TOBS |
Fire |
0 |
16601 |
2030 |
36.704178 |
6.832971 |
5.034125 |
155.709286 |
57.087498 |
79.935546 |
0.22 |
1 |
16601 |
2040 |
38.987804 |
8.066879 |
-0.581523 |
157.545249 |
60.140693 |
78.845724 |
0.00 |
2 |
16601 |
2050 |
41.276487 |
9.303487 |
-6.195984 |
159.381684 |
63.195606 |
77.752849 |
0.01 |
3 |
16601 |
2060 |
43.571408 |
10.541236 |
-11.812570 |
161.216156 |
66.252145 |
76.652903 |
0.00 |
4 |
16601 |
2070 |
45.849895 |
11.775522 |
-17.431140 |
163.057915 |
69.306966 |
75.563086 |
0.05 |
#save fire prediction to csv file
pred_weather_df.to_csv('cleaned_data/fire_prediction_w_noaa_weather_forecast.csv', index=False)
3260
check = pd.read_csv('cleaned_data/fire_prediction_w_noaa_weather_forecast.csv')
|
zipcode |
year |
PRCP |
SNOW |
SNWD |
TMAX |
TMIN |
TOBS |
Fire |
0 |
16601 |
2030 |
36.704178 |
6.832971 |
5.034125 |
155.709286 |
57.087498 |
79.935546 |
0.22 |
1 |
16601 |
2040 |
38.987804 |
8.066879 |
-0.581523 |
157.545249 |
60.140693 |
78.845724 |
0.00 |
2 |
16601 |
2050 |
41.276487 |
9.303487 |
-6.195984 |
159.381684 |
63.195606 |
77.752849 |
0.01 |
3 |
16601 |
2060 |
43.571408 |
10.541236 |
-11.812570 |
161.216156 |
66.252145 |
76.652903 |
0.00 |
4 |
16601 |
2070 |
45.849895 |
11.775522 |
-17.431140 |
163.057915 |
69.306966 |
75.563086 |
0.05 |