Affordable Housing Model

Problem Statement

Purchasing a new home always begins as an exciting endeavor. There’s a certain thrill a home buyer experiences when deciding on the home layout, amenities, and location. In the U.S. some individuals, indeed, develop an addiction to house hunting. Since the 2008 housing collapse, home purchases have steadily risen from 4.12 million units sold in 2008 to 6.12 million in 2021. Furthermore, the yearly home purchasing rate is climbing faster as people are looking to relocate from apartments in expensive areas like San Francisco and Los Angeles to homes in less expensive cities. But, regardless of their preferences and buying power, every home purchaser will be forced to confront a less exciting notion – can I afford this? This realization is like a cold shower to their dreams. Now instead of looking at colorful backsplashes, homebuyers must consider interest rates, closing costs, HOA fees, down payments, repairs, property taxes, home insurance, mortgage insurance, appliance insurance, and a long list of other expenses that drastically inflate the cost of the home.

A home is by far the biggest purchase most people make. With all the unexpected costs associated with a home, people oftentimes find themselves blindsided and overwhelmed. That dream has quickly become a nightmare as home evictions and foreclosures are affecting millions of families. As of 2021, 11 million families in the U.S. are behind on their rent or mortgage payments. To put that into context, there haven’t been this many families this far behind since the Great Depression. This phenomenon is particularly troubling as it has substantial and far-reaching consequences not only on an individual family level but on a societal and economical level. For a large percentage of these families, the best recourse is to sadly vacate their homes and find a new home they can afford. However, they’re again in a position of purchasing/renting a home without understanding all the costs associated with this new home – thus perpetuating this crippling cycle.

Several mortgage calculators exist to assist homebuyers; however, few tools provide a pricing prediction model that reflects the full spectrum of costs associated with the home such as transportation costs. Traditional home appraisals are based on market value – how much money comparable properties within a certain proximity to the target home sold for. However, they don’t represent the true value of the home. My goal is to build a model that will accurately predict affordable homes to purchase or rent, according to costs not typically captured by traditional tools. Home buyers will be able to look at the prediction results and gain a more accurate understanding of how much a home purchase will impact their finances. Additionally, the model will analyze locations throughout the U.S. as more people are looking to relocate outside of their current metro area. Nationwide, 25% of homebuyers searched to move to a different metro area between Jan ‘23 - Mar ‘23. With this tool people can find the home of their dreams that they can truly afford.

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KernelDensity
import seaborn as sns
from sklearn.preprocessing import StandardScaler
import plotly.express as px

Data Source

The data I’ll use to build my prediction model is the Location Affordability Index (LAI), which comes from the United States Department of Housing and Urban Development (HUD):

https://hudgis-hud.opendata.arcgis.com/datasets/location-affordability-index-v-3/explore?location=20.578684%2C0.315617%2C2.75&showTable=true

The LAI was first launched in November 2013. Its goal is to provide a robust and standardized household housing and transportation cost estimate for the majority of the densely populated parts of the U.S. The data is generated from several federal sources and is produced from a mixture of statistical modeling and data analysis. The auto ownership, housing costs, and transit usage are modeled using simultaneous equation modeling (SEM) to capture the interrelationship of these factors. This dataset contains 73,763 records with 437 features per record.

k = 3
response1 = 'median_gross_rent'
response2 = 'median_smoc_mortgage'

raw_data = pd.read_csv("Location_Affordability_Index_v.3.csv")
data = raw_data.select_dtypes(include=np.number)
data = data.set_index("OBJECTID", drop = True)
data = data.dropna(axis = 1, thresh = .9 * len(data))
for col in data.columns:
    data[col] = data[col].fillna(data[col].mean())
data = data.dropna(axis = 0)
C:\Users\mo.berro\AppData\Local\Temp\ipykernel_12820\1877644915.py:5: DtypeWarning: Columns (48) have mixed types. Specify dtype option on import or set low_memory=False.
  raw_data = pd.read_csv("Location_Affordability_Index_v.3.csv")

Methodology

I will be using the Median Gross Rent and Median Mortgage as the response variables for my models. Since the intention is to predict these 2 variables accurately, I’ll initially need to observe their distribution. An early exploratory analysis using PCA to 2 dimensions and plotting the points shows that there are clusters that can be found in the data.

sns.kdeplot(data=data[['median_smoc_mortgage']])
sns.kdeplot(data=data[['median_gross_rent']])
fig = px.density_heatmap(data, x="median_smoc_mortgage", y="median_gross_rent", nbinsx  = 25, nbinsy  = 45)
fig.show()

png

png

As you can see above in the KDE Plots of the response variables, both have a fairly normal distribution. This means that using a linear regression model’s Normality assumption should hold, which is the final model I plan on creating and testing. Furthermore, in the 2D KDE Plot we can see the relationship between the two response variables is fairly strong, meaning the classification model should be able to group these datapoints into similar buckets. Because of the correlation between the two potential response variables, I will choose which ever one yields a higher calssification accuracy.

print(data.shape)
(73763, 437)

Above we can see that we have a little over 70,000 rows of data and 437 columns total. This is a lot of data and potentially contatins a lot of correlation. Because of this, I will clean the data by deleting all columns with greater than 90% nulls, filling all nulls with the mean of the column, scaling the data, and removing columns with greater than 80% correlation to an already existing one.

Below we are going to tune and eveluate the parameters and performance of 4 different classification algorithms. KNN, Logistic Regression, GMM, and Random Forest. Because of the structure of the data and the distribution of the KDE plots, my hypothesis before this analysis is that GMM will be the best classifier. As mentioned earlier, I am going to run all of the parameter tuning and accuracy evaluation for both Mdeian Mortgage and Rent response variables.

def remove_outliers_quantile(df, feature_series, iqr_multiple = 1.5):
    """
    Removes outlier in data based on quantile and multiple. Need to move to another file soon.
    """
    percentile_25 = feature_series.quantile(.25)
    percentile_75 = feature_series.quantile(.75)
    iqr = percentile_75 - percentile_25
    upper_limit = percentile_75 + iqr_multiple * iqr
    lower_limit = percentile_25 - iqr_multiple * iqr
    output = df.loc[(feature_series > lower_limit) & (feature_series < upper_limit)]
    if len(output) > 0:
        return output
    return df

data_no = data.copy()
remove_outliers = True
if remove_outliers:
    for col in data_no:
#         print(col, len(data_no))
        if len(col) == 0:
            break
        data_no = remove_outliers_quantile(data_no, data_no[col], 6)
print(data_no.shape)
(56361, 437)


X = data.drop(columns = [response1, response2])
y1 = data[response1]
y2 = data[response2]
y1_bins, bins = pd.qcut(y1, q = k, retbins = True, labels = np.arange(k))
y2_bins, bins = pd.qcut(y2, q = k, retbins = True, labels = np.arange(k))
drop_corr = False

if drop_corr:
    threshold = .9
    corr_matrix = X.corr().abs()
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool_))
    omit = []
    to_drop = [column for column in upper.columns if any(upper[column] > threshold) and column not in omit]
    X=X.drop(to_drop, axis=1)  


Below we will figure out the optimal number of principal compnents to use.

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca = PCA(n_components=10, whiten = True)
pca.fit(X_scaled.T)
X_pca = pca.components_.T
x = np.arange(len(pca.explained_variance_))
print(x)
plt.plot(x, pca.explained_variance_)
plt.show()
[0 1 2 3 4 5 6 7 8 9]

png

Using the elbow method from the plot in Figure 2, I decided using 3 components would be best. Hence, for the clustering model, I grouped data into 3 classes/bins based on the median mortgage and median gross rent in a given region.

This will help control for other factors and reduce noise in the final model. For instance, in the highest mortgage group,Ie would expect the median household income to be higher thanin thethe lower 2 groups. Since this would be a potential variable in future analysiI we could end up with a model that would be fairly accurate for that top group, but much less accurate for the lower groups, and vice versa. As such, grouping by this factor will ensure the final model overall will better fit the data.

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca = PCA(n_components=3, whiten = True)
pca.fit(X_scaled.T)
X_pca = pca.components_.T
y1_bins.value_counts()
y1_bins_temp = y1_bins
y1_bins_temp.name = 'bins'
combined = pd.concat([y1_bins_temp, y1], axis = 1)
print(combined)
print(combined.groupby('bins').mean())
print(combined.groupby('bins').std())

y2_bins.value_counts()
y2_bins_temp = y2_bins
y2_bins_temp.name = 'bins'
combined = pd.concat([y2_bins_temp, y2], axis = 1)
print(combined)
print(combined.groupby('bins').mean())
print(combined.groupby('bins').std())
         bins  median_gross_rent
OBJECTID                        
1           1         879.000000
2           0         738.000000
3           0         622.000000
4           0         533.000000
5           1         927.000000
...       ...                ...
73759       1        1029.443988
73760       1        1029.443988
73761       1        1029.443988
73762       1        1029.443988
73763       1        1029.443988

[73763 rows x 2 columns]
      median_gross_rent
bins                   
0            648.601047
1            934.612413
2           1505.719370
      median_gross_rent
bins                   
0             97.030643
1             86.699584
2            415.807965
         bins  median_smoc_mortgage
OBJECTID                           
1           1           1237.000000
2           1           1258.000000
3           0            737.000000
4           0            834.000000
5           0            860.000000
...       ...                   ...
73759       1           1564.982588
73760       1           1564.982588
73761       1           1564.982588
73762       1           1564.982588
73763       1           1564.982588

[73763 rows x 2 columns]
      median_smoc_mortgage
bins                      
0               976.909723
1              1400.666301
2              2318.579884
      median_smoc_mortgage
bins                      
0               126.084188
1               139.904376
2               583.642096

Above we did PCA on the dataset, extracting the top two principal compnents. This is in order to both visualize and improve the accuracy of the classification algorithms. By reduscing the dimensions, we are able to project the data into a lower dimension and capture the structure and clusters in the data more accurately.

print("Actual Boundary Plot for Rent")
plt.scatter(X_pca[:,0], X_pca[:,1],  c=y1_bins, cmap='viridis')
plt.show()

print("Actual Boundary Plot for Mortgage")
plt.scatter(X_pca[:,0], X_pca[:,1],  c=y2_bins, cmap='viridis')
plt.show()
Actual Boundary Plot for Rent

png

Actual Boundary Plot for Mortgage

png

As we can see above, by using 3 principal components, it looks as if the first plot for Rent will have better clusters. But we will see mathmatically soon.

#Need to tune number of neighbors of KNN

knn = KNeighborsClassifier(n_neighbors=15)
knn.fit(X_pca, y1_bins)
knn_predictions = knn.predict(X_pca)

logReg = LogisticRegression(multi_class='auto')
logReg.fit(X_pca, y1_bins)
logReg_predictions = logReg.predict(X_pca)

clf_GB = GaussianNB()
clf_GB.fit(X_pca, y1_bins)
clf_GB_predictions = clf_GB.predict(X_pca)

rf = RandomForestClassifier()
rf.fit(X_pca, y1_bins)
rf_predictions = clf_GB.predict(X_pca)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning:

Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
print("KNN Boundary Plot")
plt.scatter(X_pca[:,0], X_pca[:,1],  c=knn_predictions, cmap='viridis')
plt.show()
print("Logistic Regression Boundary Plot")
plt.scatter(X_pca[:,0], X_pca[:,1],  c=logReg_predictions, cmap='viridis')
print(len(np.unique(logReg_predictions)))
plt.show()
print("Naive Bayes Boundary Plot")
plt.scatter(X_pca[:,0], X_pca[:,1],  c=clf_GB_predictions, cmap='viridis')
plt.show()
print("Random Forest Boundary Plot")
plt.scatter(X_pca[:,0], X_pca[:,1],  c=rf_predictions, cmap='viridis')
plt.show()
KNN Boundary Plot

png

Logistic Regression Boundary Plot
3

png

Naive Bayes Boundary Plot

png

Random Forest Boundary Plot

png

def get_correct(labels, predictions):
    correct_array = [1 for idx in range(len(X_test)) if predictions[idx] == labels[idx]]
    return sum(correct_array)
               
def get_accuracy(labels, predictions):
    correct_labels = get_correct(labels, predictions)
    return correct_labels/len(predictions)

def prediction_format(n):
    return str(np.round(n * 100, 4)) + '%'


X_acc = X_pca
X_train, X_test, y_train, y_test = train_test_split(X_acc, y1_bins, test_size=0.2)
labels = list(y_test)

knn = KNeighborsClassifier(n_neighbors=15)
knn.fit(X_train, y_train)
knn_predictions = knn.predict(X_test)

logReg = LogisticRegression()
logReg.fit(X_train, y_train)
logReg_predictions = logReg.predict(X_test)

clf_GB = GaussianNB()
clf_GB.fit(X_train, y_train)
clf_GB_predictions = clf_GB.predict(X_test)

rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf_predictions = rf.predict(X_test)

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning:

Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
knn_accuracy = prediction_format(get_accuracy(labels, knn_predictions))
logReg_accuracy = prediction_format(get_accuracy(labels, logReg_predictions))
clf_GB_accuracy = prediction_format(get_accuracy(labels, clf_GB_predictions))
rf_accuracy = prediction_format(get_accuracy(labels, rf_predictions))
print("For Response Variable of Rent")
print(f"""KNN Accuracy : {knn_accuracy} \nLogistic Regression Accuracy: {logReg_accuracy} 
      \nNaive Bayes Accuracy: {clf_GB_accuracy} \n Random Forest Accuracy {rf_accuracy}""")
For Response Variable of Rent
KNN Accuracy : 68.8335% 
Logistic Regression Accuracy: 66.8745% 
      
Naive Bayes Accuracy: 66.0679% 
 Random Forest Accuracy 68.0201%
X_acc = X_pca
X_train, X_test, y_train, y_test = train_test_split(X_acc, y2_bins, test_size=0.2)
labels = list(y_test)

knn = KNeighborsClassifier(n_neighbors=15)
knn.fit(X_train, y_train)
knn_predictions = knn.predict(X_test)

logReg = LogisticRegression()
logReg.fit(X_train, y_train)
logReg_predictions = logReg.predict(X_test)

clf_GB = GaussianNB()
clf_GB.fit(X_train, y_train)
clf_GB_predictions = clf_GB.predict(X_test)

rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf_predictions = rf.predict(X_test)
knn_accuracy = prediction_format(get_accuracy(labels, knn_predictions))
logReg_accuracy = prediction_format(get_accuracy(labels, logReg_predictions))
clf_GB_accuracy = prediction_format(get_accuracy(labels, clf_GB_predictions))
rf_accuracy = prediction_format(get_accuracy(labels, rf_predictions))

print("For Response Variable of Mortgage")
print(f"""KNN Accuracy : {knn_accuracy} \nLogistic Regression Accuracy: {logReg_accuracy} 
      \nNaive Bayes Accuracy: {clf_GB_accuracy} \n Random Forest Accuracy {rf_accuracy}""")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning:

Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.



For Response Variable of Mortgage
KNN Accuracy : 72.7784% 
Logistic Regression Accuracy: 68.0878% 
      
Naive Bayes Accuracy: 69.5858% 
 Random Forest Accuracy 71.8837%

As we can see using a train size of 80% of the data and test size of 20%. The most accurate model was predicitng the mortgage value bin using the KNN algorithm. Because of this we are going to observe the elbow method and further improve the algorithm.

error_rate = []
neigbors = []
# Will take some time
for i in range(1,200, 10):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))
    neigbors.append(i)


plt.figure(figsize=(10,6))
plt.plot(neigbors,error_rate,color='blue', linestyle= 'dashed', marker='o',
 markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
Text(0, 0.5, 'Error Rate')

png

As you can see above the optimal number of neighbors is around 77 when the marginal returns start to become negative. Below we will run the algotihm again and look at the accuracy.

from sklearn.metrics import confusion_matrix
knn = KNeighborsClassifier(n_neighbors=77)
knn.fit(X_train, y_train)
knn_predictions = knn.predict(X_test)
knn_accuracy = prediction_format(get_accuracy(labels, knn_predictions))
print(f"Knn Accuracy Neighbors = 10 : {knn_accuracy}")
confusion_matrix(y_test,knn_predictions)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning:

Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.



Knn Accuracy Neighbors = 10 : 73.5444%





array([[3963,  898,   60],
       [1168, 2992,  784],
       [ 108,  885, 3895]], dtype=int64)
rent_0 = [i for i in range(len(y1_bins)) if y1_bins.values[i] == 0]
rent_1 = [i for i in range(len(y1_bins)) if y1_bins.values[i] == 1]
rent_2 = [i for i in range(len(y1_bins)) if y1_bins.values[i] == 2]

mort_0 = [i for i in range(len(y2_bins)) if y2_bins.values[i] == 0]
mort_1 = [i for i in range(len(y2_bins)) if y2_bins.values[i] == 1]
mort_2 = [i for i in range(len(y2_bins)) if y2_bins.values[i] == 2]

rent_0_x = X_scaled[[rent_0],:][0]
rent_1_x = X_scaled[[rent_1],:][0]
rent_2_x = X_scaled[[rent_2],:][0]

rent_0_y = y1[[x+1 for x in rent_0]]
rent_1_y = y1[[x+1 for x in rent_1]]
rent_2_y = y1[[x+1 for x in rent_2]]

mort_0_x = X_scaled[[mort_0],:][0]
mort_1_x = X_scaled[[mort_1],:][0]
mort_2_x = X_scaled[[mort_2],:][0]

mort_0_y = y2[[x+1 for x in mort_0]]
mort_1_y = y2[[x+1 for x in mort_1]]
mort_2_y = y2[[x+1 for x in mort_2]]



rent_1_x.shape
(24547, 435)
X_rent_0_train, X_rent_0_test, y_rent_0_train, y_rent_0_test = train_test_split(rent_0_x, rent_0_y, test_size=0.2)
labels_rent_0 = list(y_rent_0_test)
from sklearn.linear_model import LassoCV, RidgeCV, Ridge, Lasso
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, mean_squared_error

alphas = {'alpha': [1,10,25, 50,100, 250, 500, 1000, 2500, 5000, 10000]}
mse = make_scorer(mean_squared_error, greater_is_better=False)
clf_ridge = Ridge()
clf_lasso = Lasso()

gridsearch_ridge = GridSearchCV(clf_ridge, alphas, cv=10, scoring=mse)
gridsearch_ridge.fit(X_train, y_train)
# print("mean_test_score :", gridsearch_ridge.cv_results_['mean_test_score'])
# for item in gridsearch_ridge.cv_results_:
#     print( "\t%s %s %s" % ('\tGRIDSCORES\t',  "R" , item))

best_clf = gridsearch_ridge.best_estimator_
best_clf.fit(X_train,y_train)
predict = best_clf.predict(X_test)
corr_matrix = np.corrcoef(y_test, predict)
corr = corr_matrix[0,1]
R_sq = corr**2

print(R_sq)
# importance_ridge = np.abs(best_clf.coef_)
0.5813973231151033
from pandas import read_csv
# from sklearn.linear_model import LassoCV, RidgeCV, Ridge

X_train_rent_mo, X_test_rent_mo, y_train_rent_mo, y_test_rent_mo = train_test_split(X, y1, test_size=0.2)
X_train_mort_mo, X_test_mort_mo, y_train_mort_mo, y_test_mort_mo = train_test_split(X, y2, test_size=0.2)

# load the dataset
model = Ridge(alpha=.1)
model.fit(X_train_rent_mo, y_train_rent_mo)
predict = model.predict(X_test_rent_mo)

corr_matrix = np.corrcoef(y_test_rent_mo, predict)
corr = corr_matrix[0,1]
rent_R_sq = corr**2

model_mort = Ridge(alpha=.1)
model_mort.fit(X_train_mort_mo, y_train_mort_mo)
predict_mort = model_mort.predict(X_test_mort_mo)

corr_matrix_mort = np.corrcoef(y_test_mort_mo, predict_mort)
corr_mort = corr_matrix_mort[0,1]
mort_R_sq = corr_mort**2

pd.options.display.float_format = '{:20,.2f}'.format

# print("data point 1 is:\n", X_test_mort_mo[X_test_mort_mo['GEOID'] == 48113013606.00])
print("data point 1 is:\n", X_test_mort_mo.iloc[0])
print("mortgage prediction is:\n", predict_mort[0])
print("mortgage actual is:\n", y_test_mort_mo.iloc[0])

print("rent R_sq:", rent_R_sq)
print("mortgage R_sq:", mort_R_sq)
data point 1 is:
 GEOID                25,001,014,600.00
STATE                            25.00
COUNTY                            1.00
TRACT                        14,600.00
CNTY_FIPS                    25,001.00
                          ...         
hh8_pctile_all                   71.25
hh8_pctile_own                   63.84
hh8_pctile_rent                  92.29
SHAPE_Length                      0.20
SHAPE_Area                        0.00
Name: 31191, Length: 435, dtype: float64
mortgage prediction is:
 1679.1039642485057
mortgage actual is:
 1688.0
rent R_sq: 0.8401476053473638
mortgage R_sq: 0.9167923725805177
model = Ridge(alpha=.1)
model.fit(X_scaled, y_scaled)
predict = model.predict(X)

corr_matrix = np.corrcoef(y_scaled, predict)
corr = corr_matrix[0,1]
R_sq = corr**2
 
print(R_sq)

rent_0_model = LassoCV(cv=10, random_state=42).fit(X_rent_0_train, y_rent_0_train)
rent_0_model = LassoCV(cv=10, random_state=42).fit(X_rent_0_train, y_rent_0_train)
rent_1_model = LassoCV(cv=10, random_state=42).fit(rent_1_x, rent_1_y)
rent_2_model = LassoCV(cv=10, random_state=42).fit(rent_2_x, rent_2_y)

mort_0_model = LassoCV(cv=10, random_state=42).fit(mort_0_x, mort_0_y)
mort_1_model = LassoCV(cv=10, random_state=42).fit(mort_1_x, mort_1_y)
mort_2_model = LassoCV(cv=10, random_state=42).fit(mort_2_x, mort_2_y)
lasso_predictions = rent_0_model.predict(X_rent_0_test)
rent_0_y = np.array(y_rent_0_test)


corr_matrix = np.corrcoef(rent_0_y, lasso_predictions)
corr = corr_matrix[0,1]
R_sq = corr**2
 
print(R_sq)

# lasso_accuracy = prediction_format(get_accuracy(rent_0_y, lasso_predictions))
# print(f"Knn Accuracy Neighbors = 10 : {lasso_accuracy}")
# confusion_matrix(y_test,lasso_predictions)
0.4353159897588665
labels = data.columns
labels = np.delete(labels, np.where(labels == 'median_smoc_mortgage'))
labels = np.delete(labels, np.where(labels == 'median_gross_rent'))
print(labels.shape)
(435,)
mort_0_coef = np.column_stack((labels,mort_0_model.coef_))
mort_1_coef = np.column_stack((labels,mort_1_model.coef_))
mort_2_coef = np.column_stack((labels,mort_2_model.coef_))
rent_0_coef = np.column_stack((labels,rent_0_model.coef_))
rent_1_coef = np.column_stack((labels,rent_1_model.coef_))
rent_2_coef = np.column_stack((labels,rent_2_model.coef_))
ind = []

for i in range(0,len(mort_0_coef)):
    if mort_0_coef[i,1] > 0.0:
        ind.append(i)
        
mort_0_coef = mort_0_coef[ind,:]
mort_0_coef
        
array([['GEOID', 4.342395032877434],
       ['COUNTY', 0.19163855171714017],
       ['TRACT', 1.915720516573811],
       ['owner_occupied_hu', 3.9010156566655114],
       ['pct_renters', 0.9759289212728534],
       ['pct_transit_j2w', 16.801745685351438],
       ['avg_h_cost', 483.10025961958985],
       ['autos_per_hh_renters', 2.6193724107531065],
       ['autos_per_hh_owner', 18.32358328022762],
       ['avg_hh_size_renters', 0.006226173972520718],
       ['avg_hh_size_owners', 6.244454577937837],
       ['median_rooms_per_owner_hu', 22.274103219450602],
       ['area_income_owner_frac', 1.8647653550308763],
       ['block_density', 7.016480602849526],
       ['avg_block_acres', 0.02264915317569273],
       ['retail_gravity', 6.787118262438466],
       ['hh1_model_autos_per_hh_owners', 44.87663318368796],
       ['hh1_alpha', 0.8004556717800442],
       ['hh1_beta', 0.09373416831011712],
       ['hh1_transit_cost_renters', 5.909863103060637],
       ['hh1_pctile_all', 5.480612743945632],
       ['hh2_model_autos_per_hh_owners', 6.724416041135141],
       ['hh2_alpha', 0.012718908759593524],
       ['hh2_h_renters', 19.984155168374418],
       ['hh2_ht_renters', 0.0002357629508448666],
       ['hh2_pctile_all', 10.61308251028594],
       ['hh3_control_hh_income', 16.389977407919197],
       ['hh3_model_autos_per_hh_owners', 18.52382781110702],
       ['hh3_model_vmt_per_hh_renters', 0.40985446056415736],
       ['hh3_income_bin', 10.596264830785321],
       ['hh3_vmt_cost_renters', 0.7808956216807588],
       ['hh3_h_renters', 6.736699447982402],
       ['hh3_pctile_all', 14.633838017943209],
       ['hh4_control_hh_income', 0.3507352871405541],
       ['hh4_model_autos_per_hh_owners', 3.0751160724637105],
       ['hh4_income_bin', 1.2211752234982678],
       ['hh4_h_renters', 5.880256074488133],
       ['hh4_pctile_all', 6.025288286118755],
       ['hh5_control_hh_income_frac', 0.34420991111668114],
       ['hh5_control_hh_income', 1.0081561470108613],
       ['hh5_model_vmt_per_hh_renters', 11.06235011038174],
       ['hh5_income_bin', 1.9381190984826295],
       ['hh5_vmt_cost_renters', 12.129134170792748],
       ['hh5_h_renters', 5.20216775894603],
       ['hh5_pctile_all', 7.107837179641571],
       ['hh6_control_hh_income', 3.5730236499347403],
       ['hh6_income_bin', 0.07365569258878046],
       ['hh6_h_renters', 7.5183945139603985],
       ['hh6_pctile_all', 10.447098816908174],
       ['hh7_control_hh_income_frac', 0.24434185689388713],
       ['hh7_control_hh_income', 1.1399035470397183],
       ['hh7_income_bin', 1.815065036259146],
       ['hh7_pctile_all', 0.20676082541379026],
       ['hh8_control_hh_income', 0.5454678807045896],
       ['hh8_model_pct_transit_commute_1', 32.218615461928124],
       ['hh8_beta', 0.36222189425289225],
       ['hh8_transit_cost_renters', 0.5557159953376791],
       ['hh8_pctile_rent', 1.5304707373366022],
       ['SHAPE_Length', 1.730891068519614]], dtype=object)
ind = []

for i in range(0,len(mort_1_coef)):
    if mort_1_coef[i,1] > 0.0:
        ind.append(i)
        
mort_1_coef = mort_1_coef[ind,:]
mort_1_coef
array([['GEOID', 6.250565140989124],
       ['STATE', 1.135850126865088],
       ['TRACT', 6.3923680948013155],
       ['CNTY_FIPS', 1.1915778153842793],
       ['renter_occupied_hu', 3.5854740094062407],
       ['pct_renter_occupied_hu', 41.74791641420234],
       ['pct_transit_j2w', 12.978064681379637],
       ['avg_h_cost', 444.1380862147298],
       ['autos_per_hh_renters', 0.9367460176195294],
       ['autos_per_hh_owner', 19.805208224171405],
       ['commuters_per_hh_renters', 4.03169687964537],
       ['avg_hh_size', 3.623094081550447],
       ['median_hh_income', 2.2072394703203244],
       ['median_rooms_per_owner_hu', 23.573337655968345],
       ['area_income_owner_frac', 28.79144008464533],
       ['block_density', 14.734337329928122],
       ['avg_block_acres', 0.8880935297384656],
       ['area_stfid', 0.2342796697916641],
       ['hh1_auto_own_cost_renters', 2.319057167509444],
       ['hh1_auto_own_cost', 3.260511697029176],
       ['hh1_pctile_all', 2.3592663503192868],
       ['hh1_pctile_own', 0.08823393267670537],
       ['hh2_fixes', 0.3937550892461732],
       ['hh2_model_autos_per_hh_owners', 13.847629647089988],
       ['hh2_auto_own_cost_owners', 0.48446465092744845],
       ['hh2_h_renters', 6.770055017834036],
       ['hh2_pctile_own', 2.2139007565659563],
       ['hh3_control_hh_income_frac', 0.15757990101768204],
       ['hh3_fixes', 1.8440022380540473],
       ['hh3_model_autos_per_hh_owners', 3.171533952272114],
       ['hh3_pctile_all', 2.819423122040537],
       ['hh3_pctile_rent', 1.443635813025792],
       ['hh4_control_hh_income_frac', 1.7746546346681755],
       ['hh4_fixes', 0.5627993092408384],
       ['hh4_model_h_cost_renters', 3.2531171485774935],
       ['hh4_model_pct_transit_commute_1', 15.170926277174532],
       ['hh4_transit_cost_renters', 0.46968122724407996],
       ['hh4_transit_trips_renters', 2.0028072713707403e-08],
       ['hh5_fixes', 0.1645359388276737],
       ['hh5_model_vmt_per_hh_renters', 27.376988254653984],
       ['hh5_income_bin', 3.9831899515990585],
       ['hh5_auto_own_cost_owners', 9.013181632907429],
       ['hh5_auto_own_cost_renters', 4.174568500394612],
       ['hh5_vmt_cost_renters', 0.4413695050155019],
       ['hh5_t_cost_renters', 8.230613651131529],
       ['hh5_h', 14.926922345320207],
       ['hh5_pctile_all', 6.694619046764159],
       ['hh5_pctile_rent', 2.857004871390125],
       ['hh6_fixes', 0.026920521123497645],
       ['hh6_model_h_cost_renters', 7.0874793304970325],
       ['hh6_model_pct_transit_commute_1', 9.3558496776132],
       ['hh6_alpha', 0.7725468392149517],
       ['hh6_pctile_all', 11.430273562256145],
       ['hh7_fixes', 0.15282501729306272],
       ['hh7_model_h_cost_renters', 2.6981465121918373],
       ['hh7_income_bin', 1.3142993055813221],
       ['hh7_auto_own_cost', 0.4647861679261522],
       ['hh8_control_hh_income_frac', 1.9891365214937597],
       ['hh8_model_h_cost_renters', 0.23831837148087914],
       ['hh8_model_pct_transit_commute_1', 4.274998686676373],
       ['hh8_pctile_rent', 5.033471607819599],
       ['SHAPE_Length', 0.4492620560340255],
       ['SHAPE_Area', 1.1080951536643915]], dtype=object)
ind = []

for i in range(0,len(mort_2_coef)):
    if mort_2_coef[i,1] > 0.0:
        ind.append(i)
        
mort_2_coef = mort_2_coef[ind,:]
mort_2_coef
array([['pct_renter_occupied_hu', 176.65980205747218],
       ['pct_transit_j2w', 22.659621221582427],
       ['avg_h_cost', 412.8582461800739],
       ['autos_per_hh_owner', 31.82253563700712],
       ['commuters_per_hh_renters', 3.945596385272496],
       ['avg_hh_size_owners', 61.03757013631235],
       ['avg_hh_size', 0.9869182185505022],
       ['median_hh_income', 52.510009267244975],
       ['median_rooms_per_owner_hu', 54.13735414775618],
       ['area_income_owner_frac', 129.11356921657625],
       ['block_density', 22.22191303085079],
       ['retail_gravity', 87.33801418169598],
       ['area_stfid', 19.540157680554344],
       ['hh1_fixes', 61.509192395240035],
       ['hh1_model_h_cost_renters', 76.68142572418903],
       ['hh1_model_pct_transit_commute_2', 98.50405010501073],
       ['hh1_alpha', 10.932363268410313],
       ['hh1_income_bin', 19.51608220742767],
       ['hh1_transit_trips_owners', 1.1651648006727664],
       ['hh1_h_owners', 57.44443949439223],
       ['hh1_auto_own_cost', 30.80906511600763],
       ['hh1_pctile_all', 8.788855224004578],
       ['hh1_pctile_own', 1.2878323093933557],
       ['hh1_pctile_rent', 13.477391669825677],
       ['hh2_control_hh_income_frac', 51.140621956975686],
       ['hh2_fixes', 4.125582329404459],
       ['hh2_transit_trips_owners', 3.5133277706738655],
       ['hh2_transit_cost', 58.99398898530165],
       ['hh2_pctile_all', 2.931261600833152],
       ['hh2_pctile_own', 18.97319777514435],
       ['hh2_pctile_rent', 0.5048678605314753],
       ['hh3_fixes', 17.742977617261165],
       ['hh3_transit_cost', 17.048388342812633],
       ['hh3_pctile_all', 66.70079526695018],
       ['hh4_control_hh_income_frac', 18.978543098176612],
       ['hh4_fixes', 2.274080745635461],
       ['hh4_h_owners', 26.99631197873531],
       ['hh4_transit_cost', 26.87987950015642],
       ['hh4_pctile_own', 30.375823755450405],
       ['hh5_control_hh_income_frac', 4.771414634606718],
       ['hh5_fixes', 24.658028717447397],
       ['hh5_model_vmt_per_hh', 40.90373520796679],
       ['hh5_income_bin', 0.5384067217388464],
       ['hh5_h_owners', 2.736974735388769],
       ['hh5_vmt_cost', 54.32994221098927],
       ['hh6_control_hh_income_frac', 5.682550748572394],
       ['hh6_fixes', 5.079558033835491],
       ['hh6_h_owners', 3.6525669187781884],
       ['hh7_control_hh_income_frac', 11.32066540293657],
       ['hh7_fixes', 0.13336721244073654],
       ['hh7_income_bin', 0.7994782334262507],
       ['hh7_auto_own_cost_owners', 33.340779182744164],
       ['hh7_h_owners', 24.613400249107922],
       ['hh7_auto_own_cost', 15.622061675702895],
       ['hh8_control_hh_income_frac', 5.46614483428138],
       ['hh8_fixes', 0.0025519981693439856],
       ['hh8_h_owners', 23.03304731597945],
       ['SHAPE_Length', 10.008102719639377]], dtype=object)
ind = []

for i in range(0,len(rent_0_coef)):
    if rent_0_coef[i,1] > 0.0:
        ind.append(i)
        
rent_0_coef = rent_0_coef[ind,:]
rent_0_coef
array([['households', 6.3992697639052825],
       ['renter_occupied_hu', 8.963711420604396],
       ['pct_renter_occupied_hu', 40.624155905562574],
       ['pct_transit_j2w', 1.8374951923895353],
       ['avg_h_cost', 104.48043921382131],
       ['autos_per_hh_renters', 7.569797880835349],
       ['commuters_per_hh_renters', 0.9324673645543244],
       ['commuters_per_hh', 13.340692134826531],
       ['avg_hh_size_renters', 14.233925536598838],
       ['median_rooms_per_renter_hu', 32.96779906469971],
       ['median_rooms_per_owner_hu', 3.8441748067933155],
       ['pct_hu_1_detached', 7.243497481608775],
       ['gross_hh_density', 32.93549841904835],
       ['area_income_frac', 81.2034259947272],
       ['area_median_hh_income', 35.80128596279289],
       ['block_density', 0.04421901250198968],
       ['avg_block_acres', 0.3225231277888082],
       ['retail_density_simple', 9.214602066668204],
       ['job_gravity', 41.1635981382515],
       ['area_stfid', 0.8426657618451407],
       ['hh1_control_hh_income', 0.724786674706888],
       ['hh1_model_autos_per_hh_owners', 1.2715045467741122],
       ['hh1_model_autos_per_hh_renters', 3.900159607803279],
       ['hh1_model_autos_per_hh', 6.066811097901843],
       ['hh1_h_renters', 3.7030342357133392],
       ['hh2_model_vmt_per_hh_owners', 11.774379115013572],
       ['hh2_model_pct_transit_commute_1', 6.33552426043548],
       ['hh2_model_h_cost', 24.653799214997044],
       ['hh2_alpha', 0.7421570679136885],
       ['hh2_transit_cost_owners', 0.11640258754369821],
       ['hh2_transit_cost_renters', 3.8255112544884375],
       ['hh2_h', 48.8567534898955],
       ['hh2_ht', 15.294975069823307],
       ['hh2_pctile_own', 3.990124932404667],
       ['hh2_pctile_rent', 0.1317587853468248],
       ['hh3_control_hh_income', 17.149882674911797],
       ['hh3_model_vmt_per_hh_owners', 9.693006634304435],
       ['hh3_alpha', 3.6206589128453786],
       ['hh3_income_bin', 10.266943033589836],
       ['hh3_transit_cost_owners', 8.173549017849668],
       ['hh3_ht_owners', 0.11030494782534046],
       ['hh3_h_renters', 13.338736808658608],
       ['hh3_h', 9.625571458755632],
       ['hh3_pctile_own', 3.218696621312924],
       ['hh4_model_vmt_per_hh_owners', 0.4660155966313314],
       ['hh4_model_pct_transit_commute_1', 55.72937697320102],
       ['hh4_alpha', 0.013553764155267397],
       ['hh4_income_bin', 0.005850415899724274],
       ['hh4_vmt_cost_owners', 5.5136046559762235],
       ['hh4_h_renters', 1.1728088189608663],
       ['hh5_control_hh_income_frac', 0.4901229400576104],
       ['hh5_model_pct_transit_commuters', 33.74297251955882],
       ['hh5_model_pct_transit_commute_1', 9.027597015736637],
       ['hh5_h_renters', 0.0010952907602872432],
       ['hh6_control_hh_income_frac', 0.15023931150917028],
       ['hh6_control_hh_income', 0.9051657706558329],
       ['hh6_model_pct_transit_commute_1', 13.992662445118812],
       ['hh6_income_bin', 13.782538136479706],
       ['hh6_transit_cost_renters', 7.141216249590956],
       ['hh6_pctile_own', 1.211736987730039],
       ['hh7_control_hh_income_frac', 0.012138440863709968],
       ['hh7_control_hh_income', 0.7043485197483663],
       ['hh7_income_bin', 2.362233368409586],
       ['hh8_control_hh_income_frac', 0.4850360059221901],
       ['hh8_control_hh_income', 1.3957344612103373],
       ['hh8_pctile_all', 14.418957952591265],
       ['SHAPE_Area', 3.2143253661294127]], dtype=object)
ind = []

for i in range(0,len(rent_1_coef)):
    if rent_1_coef[i,1] > 0.0:
        ind.append(i)
        
rent_1_coef = rent_1_coef[ind,:]
rent_1_coef
array([['COUNTY', 0.05524057451308355],
       ['TRACT', 2.6592308417154884],
       ['CNTY_FIPS', 2.3447559082543576],
       ['households', 2.175551894641939],
       ['renter_occupied_hu', 2.2613255313250553],
       ['pct_renter_occupied_hu', 4.202413681985223],
       ['avg_h_cost', 78.11314796499232],
       ['autos_per_hh_renters', 11.885115140534094],
       ['commuters_per_hh_owners', 2.12763161721547],
       ['avg_hh_size_renters', 8.332336603131049],
       ['avg_hh_size', 8.148891619170104],
       ['area_income_renter_frac', 6.743609161836111],
       ['median_hh_income', 6.296070372996913],
       ['median_rooms_per_renter_hu', 17.169481560532745],
       ['pct_hu_1_detached', 27.864459650361205],
       ['gross_hh_density', 5.5998140853542075],
       ['area_income_frac', 6.2076952824408895],
       ['area_median_hh_income', 18.805780611757374],
       ['avg_block_acres', 3.450543183705727],
       ['job_density_simple', 2.6792268787913724],
       ['retail_density_simple', 0.9647915427872197],
       ['median_commute', 4.385582964058425],
       ['area_stfid', 0.9711916372869209],
       ['hh1_control_hh_income', 4.1410709582541],
       ['hh1_fixes', 0.13955567337712574],
       ['hh1_alpha', 1.6020690867254104],
       ['hh1_income_bin', 3.9505147268808125],
       ['hh1_transit_cost_renters', 10.636207906031816],
       ['hh1_transit_trips_renters', 6.373504144880811e-08],
       ['hh1_pctile_own', 1.1198643125031147],
       ['hh1_pctile_rent', 2.471817838469886],
       ['hh2_model_pct_transit_commuters', 1.9939967372175373],
       ['hh2_model_h_cost', 35.43487133368318],
       ['hh2_transit_cost_owners', 0.19204120489601453],
       ['hh2_auto_own_cost_renters', 4.138206756413459],
       ['hh2_h', 7.903249398909929],
       ['hh2_ht', 15.653338363048324],
       ['hh2_pctile_all', 0.823102094272021],
       ['hh2_pctile_own', 2.4395450841239086],
       ['hh2_pctile_rent', 2.7823722520855885],
       ['hh3_fixes', 0.006310522017100858],
       ['hh3_model_vmt_per_hh', 20.991652878298083],
       ['hh3_transit_cost_owners', 3.420535267836989],
       ['hh3_t_renters', 17.664125380774404],
       ['hh3_t', 11.671559107749554],
       ['hh3_ht', 0.3444215017488101],
       ['hh4_alpha', 1.107395384886942],
       ['hh4_income_bin', 0.6366122333743773],
       ['hh4_transit_cost_owners', 6.018393279868836],
       ['hh4_pctile_all', 0.03448219060778989],
       ['hh5_model_vmt_per_hh', 16.452106265535882],
       ['hh5_vmt_cost', 15.129517386781924],
       ['hh5_t', 3.5959518353747053],
       ['hh6_pctile_own', 0.5219529182322586],
       ['hh7_transit_cost_renters', 2.962062414959052],
       ['hh7_transit_trips', 0.2358858090697327],
       ['hh8_fixes', 0.015981486987857823],
       ['hh8_alpha', 0.06555299567059127],
       ['hh8_transit_trips', 1.2369301427726507],
       ['hh8_pctile_all', 3.2486985029714663]], dtype=object)
ind = []

for i in range(0,len(rent_2_coef)):
    if rent_2_coef[i,1] > 0.0:
        ind.append(i)
        
rent_2_coef = rent_2_coef[ind,:]
rent_2_coef
array([['COUNTY', 12.71438946523027],
       ['owner_occupied_hu', 4.033926213250657],
       ['pct_renter_occupied_hu', 13.902947765385317],
       ['avg_h_cost', 250.31355866564547],
       ['autos_per_hh_renters', 16.83845196791879],
       ['commuters_per_hh_owners', 18.443280558193827],
       ['commuters_per_hh', 15.040505789941756],
       ['avg_hh_size_renters', 63.24671642112163],
       ['avg_hh_size_owners', 7.909559415927303],
       ['area_income_renter_frac', 48.03396836679268],
       ['median_rooms_per_renter_hu', 67.46719738870159],
       ['pct_hu_1_detached', 49.49658161985653],
       ['gross_hh_density', 29.44293657453713],
       ['retail_density_simple', 2.3163072488921626],
       ['job_gravity', 67.04668694755095],
       ['median_commute', 4.894370539171933],
       ['area_stfid', 9.572818124984364],
       ['hh1_fixes', 6.220486087276407],
       ['hh1_model_h_cost_renters', 73.25660890499707],
       ['hh1_beta', 1.222275888627313],
       ['hh1_income_bin', 11.178464127700066],
       ['hh1_vmt_cost_owners', 17.369038605194234],
       ['hh1_vmt_cost_renters', 2.3976851718873333],
       ['hh2_model_vmt_per_hh_owners', 3.3515291133146685],
       ['hh2_model_h_cost', 39.490516334512144],
       ['hh2_beta', 2.276829898812266],
       ['hh2_vmt_cost_owners', 0.3110955617618067],
       ['hh2_h', 19.55333636488516],
       ['hh2_ht', 0.6080527928531865],
       ['hh2_pctile_all', 16.371410875042056],
       ['hh2_pctile_rent', 5.647928816573068],
       ['hh3_model_h_cost_renters', 2.8030995060116086],
       ['hh3_income_bin', 16.72276148856934],
       ['hh4_model_h_cost_renters', 13.29949429289613],
       ['hh4_income_bin', 5.684196346000215],
       ['hh5_model_h_cost_renters', 5.388759799985472],
       ['hh5_model_vmt_per_hh', 28.47295698655181],
       ['hh5_vmt_cost', 27.34196458427194],
       ['hh6_fixes', 17.429392565896055],
       ['hh6_beta', 1.1644108630151506],
       ['hh7_fixes', 1.2420040611694876],
       ['hh7_model_h_cost_renters', 1.2312373662738203],
       ['hh8_fixes', 16.05417789856697],
       ['hh8_model_h_cost_renters', 0.18141101397399054],
       ['hh8_vmt_cost_owners', 62.631258796930254],
       ['hh8_transit_cost_owners', 2.3339679080669287],
       ['hh8_auto_own_cost_renters', 16.698857897719325],
       ['hh8_vmt_cost_renters', 25.253262973823606],
       ['hh8_t_cost_renters', 1.319626682916903],
       ['hh8_transit_cost', 50.82201400070386],
       ['hh8_pctile_all', 7.696143713227803]], dtype=object)
Mo Berro
Machine Learning Engineer

My research interests include distributed systems, mobile computing and programmable matter.