Problem Statement
American Football is well-known to be a team sport, requiring the skills of a wide variety of players and careful coordination of all teammates. The ultimate goal of the team is to win as many games as possible, hopefully becoming the top team in all of college football. But how do the stats associated with each player contribute to the odds of reaching the top 25 by the end of the year?
In this project, I dive into how different players’ stats predict the ranking a team will have at the end of the season. Can I forecast a top 25 finish with just the quarterback’s rating or perhaps a few offensive statistics? Or, as is the case on the field, will I need the full team’s effort to accurately predict the outcome for the season?
This problem is one of interest and significance for a number of reasons. For one, it was an intellectual curiosity for me, simply to determine if I could determine a top 25 finish. There is additionally a market for this among gamblers - being able to model the likelihood of a team performing well could be a lucrative opportunity in Vegas. Finally, with all the crazy things going on in 2020 and the very real possibility that the 2020 football season will be a very unusual one, it was nice to have see how predicted outcomes match actual outcomes.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency
from scipy.stats import shapiro
from statsmodels.graphics.gofplots import qqplot
Data Source
The metrics used in this analysis are the result of significant screen scraping from espn.com using C# and storing the data in my local MSSQL instance. I pulled statistics for 5 key players from all NCAA FCS division football teams across the 2018 and 2019 seasons, along with the team’s conference and whether they finished in the AP Top 25 that year.
The player stats used were for the players with the most:
- passing yards
- receiving yards
- rushing yards
- tackles
- interceptions
This yielded a data set of approximately 130 teams (teams do occasionally come into or move out of the division) and 44 predictors with which to predict the season’s final outcome.
Content
1. Clean data
-
- Clean ESPN NCAAF Team Leader data
-
- Group teams into respective conferences
2. Exploratory Data Analysis
-
- Conference top 25 vs non 25 breakdown
-
- Feature distribution and significance
-
- Feature correlation
-
- Exploratory data analysis summary
3. Create and train models
Clean data
1) Clean ESPN NCAAF Team Leader data
conferencedata = pd.read_csv('NCAAF Team Leaders_2018.csv')
rawdata2018 = pd.read_csv('2018.csv')
rawdata2018 = pd.concat([rawdata2018, conferencedata['conference']], axis=1)
display(rawdata2018.head())
csq=chi2_contingency(pd.crosstab(rawdata2018['AP_top_25'], rawdata2018['conference_categorical']))
print("Relationship between top 25 and conference P-value: ",csq[1])
#print(ncaaf['FumblesTouchdowns'].sum())
#ncaaf = ncaaf.drop(columns=['FumblesRecovered', 'FumblesTouchdowns', 'fumblesrecovered2', 'fumblestouchdowns2'])
cols = [c for c in rawdata2018.columns if c.lower()[:4] != 'team' and c.lower()[:4] != 'play' and c.lower()[:4] != 'espn' and c.lower()[:4] != 'conf' and c != 'r']
data2018 = rawdata2018[cols]
data2018 = data2018.rename(columns = {data2018.columns[43]: "y"})
display(data2018.head()) #all numerical
print(data2018.columns)
print(data2018['y'].describe())
print(data2018['QBRating'].describe())
team | Completions | Attempts | PassingYards | CompletionPercentage | AverageCompletion | LongestCompletion | QBTouchdowns | Interceptions | Sacks | ... | sacksyardslost2 | passesdefended2 | interceptions2 | interceptionyards2 | longestinterception2 | interceptiontouchdowns2 | fumblesforced2 | conference_categorical | AP_top_25 | conference | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Air Force | 48 | 78 | 844 | 61.537998 | 10.821 | 69 | 4 | 3 | 5 | ... | 5 | 1 | 3 | 0 | 0 | 0 | 0 | 9 | 0 | Mountain West Conference |
1 | Akron | 178 | 342 | 2329 | 52.047001 | 6.810 | 56 | 15 | 8 | 31 | ... | 0 | 3 | 4 | 149 | 147 | 2 | 1 | 8 | 0 | Mid-American Conference |
2 | Alabama | 245 | 355 | 3966 | 69.014000 | 11.172 | 81 | 43 | 6 | 13 | ... | 0 | 5 | 3 | 71 | 38 | 1 | 1 | 11 | 1 | Southeastern Conference |
3 | Appalachian State | 159 | 254 | 2039 | 62.598000 | 8.028 | 90 | 21 | 6 | 14 | ... | 0 | 5 | 4 | 113 | 64 | 1 | 0 | 12 | 0 | Sun Belt Conference |
4 | Arizona | 170 | 302 | 2530 | 56.291000 | 8.377 | 75 | 26 | 8 | 14 | ... | 0 | 3 | 3 | 63 | 62 | 1 | 0 | 10 | 0 | Pac-12 Conference |
5 rows × 47 columns
Relationship between top 25 and conference P-value: 0.061018655720283685
Completions | Attempts | PassingYards | CompletionPercentage | AverageCompletion | LongestCompletion | QBTouchdowns | Interceptions | Sacks | SackYardsLost | ... | totaltackles2 | sacks2 | sacksyardslost2 | passesdefended2 | interceptions2 | interceptionyards2 | longestinterception2 | interceptiontouchdowns2 | fumblesforced2 | y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 48 | 78 | 844 | 61.537998 | 10.821 | 69 | 4 | 3 | 5 | -32 | ... | 104 | 1 | 5 | 1 | 3 | 0 | 0 | 0 | 0 | 0 |
1 | 178 | 342 | 2329 | 52.047001 | 6.810 | 56 | 15 | 8 | 31 | -199 | ... | 75 | 0 | 0 | 3 | 4 | 149 | 147 | 2 | 1 | 0 |
2 | 245 | 355 | 3966 | 69.014000 | 11.172 | 81 | 43 | 6 | 13 | -110 | ... | 60 | 0 | 0 | 5 | 3 | 71 | 38 | 1 | 1 | 1 |
3 | 159 | 254 | 2039 | 62.598000 | 8.028 | 90 | 21 | 6 | 14 | -78 | ... | 51 | 0 | 0 | 5 | 4 | 113 | 64 | 1 | 0 | 0 |
4 | 170 | 302 | 2530 | 56.291000 | 8.377 | 75 | 26 | 8 | 14 | -108 | ... | 38 | 0 | 0 | 3 | 3 | 63 | 62 | 1 | 0 | 0 |
5 rows × 44 columns
Index(['Completions', 'Attempts', 'PassingYards', 'CompletionPercentage',
'AverageCompletion', 'LongestCompletion', 'QBTouchdowns',
'Interceptions', 'Sacks', 'SackYardsLost', 'QBRating', 'Receptions',
'ReceivingYards', 'AverageReceivingYards', 'LongestReception',
'ReceivingTouchdowns', 'RushingAttempts', 'RushingYards',
'AverageRushingYards', 'LongestRush', 'RushingTouchdowns',
'SoloTackles', 'AssistedTackles', 'TotalTackles', 'Sacks.1',
'SacksYardsLost', 'PassesDefended', 'Interceptions.1',
'InterceptionYards', 'LongestInterception', 'InterceptionTouchdowns',
'FumblesForced', 'solotackles2', 'assistedtackles2', 'totaltackles2',
'sacks2', 'sacksyardslost2', 'passesdefended2', 'interceptions2',
'interceptionyards2', 'longestinterception2', 'interceptiontouchdowns2',
'fumblesforced2', 'y'],
dtype='object')
count 130.000000
mean 0.192308
std 0.395638
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: y, dtype: float64
count 130.000000
mean 136.015916
std 19.700496
min 76.365967
25% 122.536299
50% 136.583679
75% 147.605255
max 199.450623
Name: QBRating, dtype: float64
Exploratory Data Analysis
1) Conference top 25 vs non 25 breakdown
conf = pd.concat([data2018['y'], rawdata2018['conference']], axis=1).groupby('conference').sum()
conf['non25'] = rawdata2018.groupby('conference')['AP_top_25'].count() - conf['y']
conf = conf.rename(columns={"y": "top25"})
#display(conf)
conf.plot.bar(stacked=True)
The below bar plot shows how each conference compares in terms of number of top 25 teams in regards to other conferences. Of particular note, Clemson, which consistently makes the playoffs, belongs to a conference that has only one other team that’s a top 25 team.
<matplotlib.axes._subplots.AxesSubplot at 0x1a190e5e90>
2) Feature distribution and significance
#sns.distplot(data['CompletionPercentage'])
sns.distplot(data2018['QBRating'])
print(shapiro(data2018['QBRating']))
qqplot(data2018['QBRating'], line='s')
plt.show()
print(shapiro(data2018['AverageCompletion']))
qqplot(data2018['AverageCompletion'], line='s')
plt.show()
print(shapiro(data2018['QBTouchdowns']))
qqplot(data2018['QBTouchdowns'], line='s')
plt.show()
data2018.plot.scatter(x='CompletionPercentage', y='y');
data2018.plot.scatter(x='QBRating', y='CompletionPercentage');
data2018.plot.scatter(x='QBRating', y='PassingYards');
data2018.plot.scatter(x='QBRating', y='y');
The W-statistic for QBRating, AverageCompletion, and QBTouchdowns, shows that it is highly likely the data is drawn from a Gaussian distribution. Additionally, the p-value is slightly higher or significantly lower than the alpha threshold of .1, indicating that it is unlikely that these results would be observed under the null hypothesis and the null hypothesis can be rejected.
(0.9831368327140808, 0.10715162754058838)
(0.9665281772613525, 0.002681687707081437)
(0.9542575478553772, 0.0002441542746964842)
3) Feature correlation
corrmat2018 = data2018.corr()
#display(corrmat)
#average passing yards (46.4%) and qbrating (48.8%) are the most correlated with whether or not making it to top25
f, ax = plt.subplots(figsize=(12, 12))
sns.heatmap(corrmat2018, vmax=.8, square=True);
Below is a heat map to see the correlation of the predictors with each other and the target field. While much of the broad heat map is true red (indicating nearly 0 correlation), there is a distinct pattern of squares, indicating significant correlation between fields. This tends to indicate a single player’s statistics, which are highly correlated with each other. Additionally, the top left contains a larger square, where the quarterback and primary receiver’s statistics show high levels of correlation. Finally there are two diagonal patterns above and below the center diagonal in the bottom right of the chart. These are high correlations in the same stats between the two defensive stats leaders evaluated; in some cases, these could be the same player, leading to high correlation.
pairs2018 = corrmat2018['y'].abs().sort_values(ascending=False)
pairs2018 = pairs2018[pairs2018!=1]
print(pairs2018[0:5])
print(pairs2018)
Here we can see the top five features that have the highest correlation to top 25 teams. This shows evidence that have a very good passing game is important to a teams ranking.
QBTouchdowns 0.439608
QBRating 0.420781
PassingYards 0.408534
AverageCompletion 0.360998
Completions 0.344052
Name: y, dtype: float64
QBTouchdowns 0.439608
QBRating 0.420781
PassingYards 0.408534
AverageCompletion 0.360998
Completions 0.344052
RushingTouchdowns 0.325614
CompletionPercentage 0.315978
Attempts 0.297411
interceptions2 0.231004
RushingYards 0.203099
ReceivingTouchdowns 0.183028
PassesDefended 0.182274
Interceptions.1 0.175058
RushingAttempts 0.173930
passesdefended2 0.156667
AverageRushingYards 0.144478
interceptionyards2 0.140950
LongestCompletion 0.139173
LongestRush 0.123929
FumblesForced 0.117996
sacks2 0.117728
longestinterception2 0.117477
Interceptions 0.102940
LongestReception 0.099535
ReceivingYards 0.093496
solotackles2 0.087416
sacksyardslost2 0.075888
Sacks.1 0.075836
interceptiontouchdowns2 0.074948
totaltackles2 0.072841
Receptions 0.064869
InterceptionTouchdowns 0.058030
AssistedTackles 0.054361
SacksYardsLost 0.049912
assistedtackles2 0.042727
fumblesforced2 0.039301
LongestInterception 0.033558
TotalTackles 0.030793
AverageReceivingYards 0.023466
InterceptionYards 0.017590
SackYardsLost 0.005622
Sacks 0.003528
SoloTackles 0.001159
Name: y, dtype: float64
sns.set()
cols = pairs2018[0:5].index
sns.pairplot(data2018[cols], height = 2.5)
plt.show();
box = pd.concat([data2018['QBRating'], data2018['y']], axis=1)
box = box.rename(columns={"y": "top 25"})
box.loc[(box["top 25"] == 0),"top 25"]='non-top 25'
box.loc[(box["top 25"] == 1),"top 25"]='top 25'
fig = sns.catplot(x="top 25", y='QBRating', kind='box', data=box)
4) Exploratory data analysis summary
It quickly became evident that QBrating, itself an aggregated score based on other factors, is a nice summary field and quite predictive. In the box plots below, we see that the 25th percentile QB rating for a top 25 team was still higher than the 50th percentile for a non-top 25 team, showing the dif- ferentiation strength in this field.
box = pd.concat([data2018['QBRating'], rawdata2018['conference']], axis=1)
display(box)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x='QBRating', y="conference", data=box)
QBRating | conference | |
---|---|---|
0 | 161.665482 | Mountain West Conference |
1 | 119.046684 | Mid-American Conference |
2 | 199.450623 | Southeastern Conference |
3 | 152.592667 | Sun Belt Conference |
4 | 149.770386 | Pac-12 Conference |
... | ... | ... |
125 | 175.485870 | Big 12 Conference |
126 | 114.257156 | Conference USA |
127 | 146.653625 | Mid-American Conference |
128 | 132.480041 | Big Ten Conference |
129 | 96.965309 | Mountain West Conference |
130 rows × 2 columns
To understand the combined strength of the QB rating and conference, I have done a combined box plot. When compared with the team’s conference, we see the wide variation in skill level that some conferences have. The Big 12 and Big Ten, for example, have nearly 100 point ranges for the rating, while Pac-12 and Sun Belt Conferences are quite compact.
rawdata2019 = pd.read_csv('2019.csv')
display(rawdata2019.head())
csq=chi2_contingency(pd.crosstab(rawdata2019['AP_top_25'], rawdata2019['conference_categorical']))
print("Relationship between top 25 and conference P-value: ",csq[1])
#print(ncaaf['FumblesTouchdowns'].sum())
#ncaaf = ncaaf.drop(columns=['FumblesRecovered', 'FumblesTouchdowns', 'fumblesrecovered2', 'fumblestouchdowns2'])
cols = [c for c in rawdata2019.columns if c.lower()[:4] != 'team' and c.lower()[:4] != 'play' and c.lower()[:4] != 'espn' and c.lower()[:4] != 'conf' and c != 'r']
data2019 = rawdata2019[cols]
data2019 = data2019.rename(columns = {data2019.columns[43]: "y"})
display(data2019.head()) #all numerical
print(data2019.columns)
print(data2019['y'].describe())
print(data2019['QBRating'].describe())
team | Completions | Attempts | PassingYards | CompletionPercentage | AverageCompletion | LongestCompletion | QBTouchdowns | Interceptions | Sacks | ... | sacks2 | sacksyardslost2 | passesdefended2 | interceptions2 | interceptionyards2 | longestinterception2 | interceptiontouchdowns2 | fumblesforced2 | conference_categorical | AP_top_25 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Air Force | 56 | 111 | 1316 | 50.450001 | 11.856 | 81 | 13 | 6 | 4 | ... | 0 | 0 | 4 | 3 | 99 | 92 | 1 | 0 | 9 | 1 |
1 | Akron | 150 | 279 | 1822 | 53.763000 | 6.530 | 87 | 11 | 6 | 43 | ... | 1 | 8 | 2 | 1 | 64 | 64 | 1 | 0 | 8 | 0 |
2 | Alabama | 180 | 252 | 2840 | 71.429001 | 11.270 | 85 | 33 | 3 | 10 | ... | 0 | 0 | 3 | 4 | 54 | 36 | 0 | 0 | 11 | 1 |
3 | Appalachian State | 225 | 359 | 2718 | 62.674000 | 7.571 | 73 | 28 | 6 | 18 | ... | 0 | 0 | 8 | 5 | 54 | 30 | 2 | 1 | 12 | 1 |
4 | Arizona | 160 | 266 | 1954 | 60.150002 | 7.346 | 75 | 14 | 11 | 19 | ... | 0 | 0 | 7 | 4 | 29 | 14 | 0 | 0 | 10 | 0 |
5 rows × 46 columns
Relationship between top 25 and conference P-value: 0.07510413466966305
Completions | Attempts | PassingYards | CompletionPercentage | AverageCompletion | LongestCompletion | QBTouchdowns | Interceptions | Sacks | SackYardsLost | ... | totaltackles2 | sacks2 | sacksyardslost2 | passesdefended2 | interceptions2 | interceptionyards2 | longestinterception2 | interceptiontouchdowns2 | fumblesforced2 | y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 56 | 111 | 1316 | 50.450001 | 11.856 | 81 | 13 | 6 | 4 | -22 | ... | 18 | 0 | 0 | 4 | 3 | 99 | 92 | 1 | 0 | 1 |
1 | 150 | 279 | 1822 | 53.763000 | 6.530 | 87 | 11 | 6 | 43 | -219 | ... | 138 | 1 | 8 | 2 | 1 | 64 | 64 | 1 | 0 | 0 |
2 | 180 | 252 | 2840 | 71.429001 | 11.270 | 85 | 33 | 3 | 10 | -63 | ... | 59 | 0 | 0 | 3 | 4 | 54 | 36 | 0 | 0 | 1 |
3 | 225 | 359 | 2718 | 62.674000 | 7.571 | 73 | 28 | 6 | 18 | -107 | ... | 45 | 0 | 0 | 8 | 5 | 54 | 30 | 2 | 1 | 1 |
4 | 160 | 266 | 1954 | 60.150002 | 7.346 | 75 | 14 | 11 | 19 | -120 | ... | 47 | 0 | 0 | 7 | 4 | 29 | 14 | 0 | 0 | 0 |
5 rows × 44 columns
Index(['Completions', 'Attempts', 'PassingYards', 'CompletionPercentage',
'AverageCompletion', 'LongestCompletion', 'QBTouchdowns',
'Interceptions', 'Sacks', 'SackYardsLost', 'QBRating', 'Receptions',
'ReceivingYards', 'AverageReceivingYards', 'LongestReception',
'ReceivingTouchdowns', 'RushingAttempts', 'RushingYards',
'AverageRushingYards', 'LongestRush', 'RushingTouchdowns',
'SoloTackles', 'AssistedTackles', 'TotalTackles', 'Sacks.1',
'SacksYardsLost', 'PassesDefended', 'Interceptions.1',
'InterceptionYards', 'LongestInterception', 'InterceptionTouchdowns',
'FumblesForced', 'solotackles2', 'assistedtackles2', 'totaltackles2',
'sacks2', 'sacksyardslost2', 'passesdefended2', 'interceptions2',
'interceptionyards2', 'longestinterception2', 'interceptiontouchdowns2',
'fumblesforced2', 'y'],
dtype='object')
count 131.000000
mean 0.190840
std 0.394471
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: y, dtype: float64
count 131.000000
mean 138.697629
std 21.322280
min 82.885071
25% 125.215012
50% 138.550446
75% 148.743362
max 206.931274
Name: QBRating, dtype: float64
Create and train models
I applied a variety of techniques to the data set to determine which was the best differentiator of the 0/1 target, inclusion in the AP Top 25. Modeling techniques attempted include:
- adaboost
- decision tree
- k nearest neighbors clustering
- logistic regression
- naive Bayes algorithm
- neural network
- random forest
- support vector machine
I set the data up such that the 2018 season was the training data set and the 2019 was our out of time validation. Predictive accuracy shown below is on the 2019 predictions, based on models created with the prior season. One point that I would like to make clear - I am examining correlation and not implying causation in this analysis. The statistics credited to any given player are not achieved by that player alone; a quarterback cannot put up large passing yardage numbers without an effective offensive line blocking or a wide receiver who can get open and make the play. As such, I am seeking to understand the ability of these statistics to predict the team’s ranking, but am not suggesting that, absent the remaining team members, these metrics and therefore the team outcomes are possible.
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import cross_val_score, GridSearchCV, cross_validate, train_test_split
from sklearn import preprocessing
np.random.seed(1)
x_train = data2018.iloc[:,:-1]
#x_data = preprocessing.normalize(x_data, norm='l2')
x_train = preprocessing.scale(x_train)
y_train = data2018.iloc[:,-1]
x_test = data2019.iloc[:,:-1]
x_test = preprocessing.scale(x_test)
y_test = data2019.iloc[:,-1]
#x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = 0.2, shuffle = True)
neuralnet = MLPClassifier(hidden_layer_sizes=(4,2), learning_rate_init=0.01, max_iter=100000).fit(x_train, y_train)
print(confusion_matrix(y_test, neuralnet.predict(x_test)))
print("neural networks accuracy on test:",neuralnet.score(x_test, y_test))
from sklearn.svm import SVC
svm = SVC()
svm.fit(x_train, y_train)
print("pre-tune svm accuracy on training:",svm.score(x_train, y_train))
print("pre-tune svm accuracy on test:",svm.score(x_test, y_test))
from sklearn.linear_model import LogisticRegression
logitmodel = LogisticRegression(penalty='elasticnet',solver='saga',l1_ratio = 0.5, max_iter=1000000).fit(x_train,y_train)
logitpred = logitmodel.predict(x_test)
print(logitpred)
logitacc = accuracy_score(y_test,logitpred)
print("Logistic Regression Accuracy:",logitacc)
confusion_matrix(y_test,logitpred)
from sklearn.neighbors import KNeighborsClassifier
knnmodel = KNeighborsClassifier().fit(x_train,y_train)
knnpred = knnmodel.predict(x_test)
knnacc = accuracy_score(y_test,knnpred)
print("KNN Accuracy:",knnacc)
confusion_matrix(y_test,knnpred)
from sklearn.naive_bayes import GaussianNB
nbmodel = GaussianNB(var_smoothing=10**(-3)).fit(x_train, y_train)
nbpred = nbmodel.predict(x_test)
nbacc = accuracy_score(y_test,nbpred)
print("Naive Bayes Accuracy:",nbacc)
confusion_matrix(y_test,nbpred)
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dtmodeltuned = GridSearchCV(estimator=dt,
param_grid={'max_depth': np.arange(1,31)},
scoring='roc_auc',
cv=5)
dtmodeltuned.fit(x_train, y_train)
y_pred_dt_test = dtmodeltuned.predict(x_test)
print(dtmodeltuned.best_params_)
print("Decision Tree Accuracy:",accuracy_score(y_pred_dt_test, y_test))
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
#rfmodel = rf.fit(x_train, y_train)
rfmodeltuned = GridSearchCV(estimator=rf, param_grid={'max_depth':np.arange(1,31)},
cv=5,
scoring='roc_auc')
rfmodeltuned.fit(x_train, y_train)
#y_pred_rf = model_rf.predict(x_train)
#print(accuracy_score(y_pred_rf, y_train))
y_pred_rf_test = rfmodeltuned.predict(x_test)
print(rfmodeltuned.best_params_)
print("Random Forest Accuracy:",accuracy_score(y_pred_rf_test, y_test))
display(pd.DataFrame(rfmodeltuned.cv_results_).sort_values(by='rank_test_score',ascending=True).head(10))
#importances = rfmodeltuned.feature_importances_
#print(importances)
#indices = np.argsort(importances)[::-1]
#print("important features:",data.columns[indices])
from sklearn.ensemble import AdaBoostClassifier
adaboost = AdaBoostClassifier(n_estimators=10000)
adaboost = adaboost.fit(x_train, y_train)
print("AdaBoost Accuracy:",adaboost.score(x_test, y_test))
#from sklearn.cluster import KMeans
#kmeans = KMeans(n_clusters=2, random_state=0).fit(x_data)
[[94 12]
[13 12]]
neural networks accuracy on test: 0.8091603053435115
pre-tune svm accuracy on training: 0.9461538461538461
pre-tune svm accuracy on test: 0.816793893129771
[0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 1 0 0]
Logistic Regression Accuracy: 0.816793893129771
KNN Accuracy: 0.8549618320610687
Naive Bayes Accuracy: 0.7862595419847328
{'max_depth': 3}
Decision Tree Accuracy: 0.8091603053435115
{'max_depth': 8}
Random Forest Accuracy: 0.8320610687022901
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_max_depth | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7 | 0.335908 | 0.107007 | 0.036530 | 0.027093 | 8 | {'max_depth': 8} | 0.895238 | 0.733333 | 0.776190 | 0.704762 | 0.961905 | 0.814286 | 0.098331 | 1 |
11 | 0.190920 | 0.015072 | 0.013918 | 0.003101 | 12 | {'max_depth': 12} | 0.866667 | 0.761905 | 0.771429 | 0.714286 | 0.957143 | 0.814286 | 0.086871 | 1 |
26 | 0.213712 | 0.012762 | 0.015596 | 0.002735 | 27 | {'max_depth': 27} | 0.861905 | 0.709524 | 0.809524 | 0.757143 | 0.923810 | 0.812381 | 0.075509 | 3 |
18 | 0.184676 | 0.006516 | 0.013240 | 0.001120 | 19 | {'max_depth': 19} | 0.923810 | 0.752381 | 0.752381 | 0.657143 | 0.971429 | 0.811429 | 0.117479 | 4 |
4 | 0.565932 | 0.270572 | 0.077565 | 0.061481 | 5 | {'max_depth': 5} | 0.885714 | 0.714286 | 0.809524 | 0.695238 | 0.952381 | 0.811429 | 0.098312 | 5 |
10 | 0.188528 | 0.023570 | 0.013572 | 0.002284 | 11 | {'max_depth': 11} | 0.919048 | 0.757143 | 0.780952 | 0.652381 | 0.942857 | 0.810476 | 0.107724 | 6 |
22 | 0.204820 | 0.019892 | 0.013236 | 0.001621 | 23 | {'max_depth': 23} | 0.914286 | 0.761905 | 0.700000 | 0.695238 | 0.952381 | 0.804762 | 0.108254 | 7 |
28 | 0.212919 | 0.044503 | 0.015496 | 0.005727 | 29 | {'max_depth': 29} | 0.876190 | 0.738095 | 0.790476 | 0.652381 | 0.966667 | 0.804762 | 0.108797 | 8 |
12 | 0.219141 | 0.038183 | 0.015095 | 0.003886 | 13 | {'max_depth': 13} | 0.861905 | 0.761905 | 0.785714 | 0.652381 | 0.961905 | 0.804762 | 0.103323 | 8 |
20 | 0.197889 | 0.010379 | 0.013528 | 0.001187 | 21 | {'max_depth': 21} | 0.885714 | 0.771429 | 0.776190 | 0.633333 | 0.952381 | 0.803810 | 0.109229 | 10 |
AdaBoost Accuracy: 0.8396946564885496
Evaluation and Final Results
When the model of the 2018 season is used to predict the top 25 finishers of 2019, we find that the various techniques tested had 79-87% accuracy on the test sample. These results are reasonable, but not particularly strong. This shows that the individual stats are insufficient to generate the predictive power I would like.
While the existing data is relatively unbiased (it contains all teams from the 2018 season, but I have not tested if the season itself could be biased in a meaningful way), it is unclear if the data itself is sufficient to generate the results I desire. If I were to attempt to further improve the model, there are a number of things we could try to make a more robust training sample:
-
Increase sample size: It’s likely that increasing the number of seasons used in the training would result in a more robust model being created.
-
Increase the breadth of data: By including other team-level statistics, I could build a stronger model. This might include strength of schedule statistics, win-loss records, and perhaps some information on coaching staff, for example.
-
Increase the depth of data: While I currently have statistics on 5 players within each team, I could certainly pull some stats on the rest of the team and likely create additional leverage for the model.
In conclusion, I was able to construct several reasonable but not incredibly powerful predictors of the top 25 finishers for a football season. Most techniques performed similarly, with random forest having the strongest out- of-time validation with 87.0% accuracy. Several data expansion recommendations would likely improve the strength of the model and could be considered for future testing.