American Football Analytics

Problem Statement

American Football is well-known to be a team sport, requiring the skills of a wide variety of players and careful coordination of all teammates. The ultimate goal of the team is to win as many games as possible, hopefully becoming the top team in all of college football. But how do the stats associated with each player contribute to the odds of reaching the top 25 by the end of the year?

In this project, I dive into how different players’ stats predict the ranking a team will have at the end of the season. Can I forecast a top 25 finish with just the quarterback’s rating or perhaps a few offensive statistics? Or, as is the case on the field, will I need the full team’s effort to accurately predict the outcome for the season?

This problem is one of interest and significance for a number of reasons. For one, it was an intellectual curiosity for me, simply to determine if I could determine a top 25 finish. There is additionally a market for this among gamblers - being able to model the likelihood of a team performing well could be a lucrative opportunity in Vegas. Finally, with all the crazy things going on in 2020 and the very real possibility that the 2020 football season will be a very unusual one, it was nice to have see how predicted outcomes match actual outcomes.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency
from scipy.stats import shapiro
from statsmodels.graphics.gofplots import qqplot

Data Source

The metrics used in this analysis are the result of significant screen scraping from espn.com using C# and storing the data in my local MSSQL instance. I pulled statistics for 5 key players from all NCAA FCS division football teams across the 2018 and 2019 seasons, along with the team’s conference and whether they finished in the AP Top 25 that year.

The player stats used were for the players with the most:

  • passing yards
  • receiving yards
  • rushing yards
  • tackles
  • interceptions

This yielded a data set of approximately 130 teams (teams do occasionally come into or move out of the division) and 44 predictors with which to predict the season’s final outcome.

Content

1. Clean data

    1. Clean ESPN NCAAF Team Leader data
    1. Group teams into respective conferences

2. Exploratory Data Analysis

    1. Conference top 25 vs non 25 breakdown
    1. Feature distribution and significance
    1. Feature correlation
    1. Exploratory data analysis summary

3. Create and train models


Clean data

1) Clean ESPN NCAAF Team Leader data

conferencedata = pd.read_csv('NCAAF Team Leaders_2018.csv')
rawdata2018 = pd.read_csv('2018.csv')
rawdata2018 = pd.concat([rawdata2018, conferencedata['conference']], axis=1)
display(rawdata2018.head())

csq=chi2_contingency(pd.crosstab(rawdata2018['AP_top_25'], rawdata2018['conference_categorical']))
print("Relationship between top 25 and conference P-value: ",csq[1])

#print(ncaaf['FumblesTouchdowns'].sum())
#ncaaf = ncaaf.drop(columns=['FumblesRecovered', 'FumblesTouchdowns', 'fumblesrecovered2', 'fumblestouchdowns2'])
cols = [c for c in rawdata2018.columns if c.lower()[:4] != 'team' and c.lower()[:4] != 'play' and c.lower()[:4] != 'espn' and c.lower()[:4] != 'conf' and c != 'r']
data2018 = rawdata2018[cols]
data2018 = data2018.rename(columns = {data2018.columns[43]: "y"})
display(data2018.head()) #all numerical
print(data2018.columns)
print(data2018['y'].describe())
print(data2018['QBRating'].describe())

team Completions Attempts PassingYards CompletionPercentage AverageCompletion LongestCompletion QBTouchdowns Interceptions Sacks ... sacksyardslost2 passesdefended2 interceptions2 interceptionyards2 longestinterception2 interceptiontouchdowns2 fumblesforced2 conference_categorical AP_top_25 conference
0 Air Force 48 78 844 61.537998 10.821 69 4 3 5 ... 5 1 3 0 0 0 0 9 0 Mountain West Conference
1 Akron 178 342 2329 52.047001 6.810 56 15 8 31 ... 0 3 4 149 147 2 1 8 0 Mid-American Conference
2 Alabama 245 355 3966 69.014000 11.172 81 43 6 13 ... 0 5 3 71 38 1 1 11 1 Southeastern Conference
3 Appalachian State 159 254 2039 62.598000 8.028 90 21 6 14 ... 0 5 4 113 64 1 0 12 0 Sun Belt Conference
4 Arizona 170 302 2530 56.291000 8.377 75 26 8 14 ... 0 3 3 63 62 1 0 10 0 Pac-12 Conference

5 rows × 47 columns

Relationship between top 25 and conference P-value:  0.061018655720283685

Completions Attempts PassingYards CompletionPercentage AverageCompletion LongestCompletion QBTouchdowns Interceptions Sacks SackYardsLost ... totaltackles2 sacks2 sacksyardslost2 passesdefended2 interceptions2 interceptionyards2 longestinterception2 interceptiontouchdowns2 fumblesforced2 y
0 48 78 844 61.537998 10.821 69 4 3 5 -32 ... 104 1 5 1 3 0 0 0 0 0
1 178 342 2329 52.047001 6.810 56 15 8 31 -199 ... 75 0 0 3 4 149 147 2 1 0
2 245 355 3966 69.014000 11.172 81 43 6 13 -110 ... 60 0 0 5 3 71 38 1 1 1
3 159 254 2039 62.598000 8.028 90 21 6 14 -78 ... 51 0 0 5 4 113 64 1 0 0
4 170 302 2530 56.291000 8.377 75 26 8 14 -108 ... 38 0 0 3 3 63 62 1 0 0

5 rows × 44 columns

Index(['Completions', 'Attempts', 'PassingYards', 'CompletionPercentage',
       'AverageCompletion', 'LongestCompletion', 'QBTouchdowns',
       'Interceptions', 'Sacks', 'SackYardsLost', 'QBRating', 'Receptions',
       'ReceivingYards', 'AverageReceivingYards', 'LongestReception',
       'ReceivingTouchdowns', 'RushingAttempts', 'RushingYards',
       'AverageRushingYards', 'LongestRush', 'RushingTouchdowns',
       'SoloTackles', 'AssistedTackles', 'TotalTackles', 'Sacks.1',
       'SacksYardsLost', 'PassesDefended', 'Interceptions.1',
       'InterceptionYards', 'LongestInterception', 'InterceptionTouchdowns',
       'FumblesForced', 'solotackles2', 'assistedtackles2', 'totaltackles2',
       'sacks2', 'sacksyardslost2', 'passesdefended2', 'interceptions2',
       'interceptionyards2', 'longestinterception2', 'interceptiontouchdowns2',
       'fumblesforced2', 'y'],
      dtype='object')
count    130.000000
mean       0.192308
std        0.395638
min        0.000000
25%        0.000000
50%        0.000000
75%        0.000000
max        1.000000
Name: y, dtype: float64
count    130.000000
mean     136.015916
std       19.700496
min       76.365967
25%      122.536299
50%      136.583679
75%      147.605255
max      199.450623
Name: QBRating, dtype: float64

Exploratory Data Analysis

1) Conference top 25 vs non 25 breakdown

conf = pd.concat([data2018['y'], rawdata2018['conference']], axis=1).groupby('conference').sum()
conf['non25'] = rawdata2018.groupby('conference')['AP_top_25'].count() - conf['y']
conf = conf.rename(columns={"y": "top25"})
#display(conf)
conf.plot.bar(stacked=True)

The below bar plot shows how each conference compares in terms of number of top 25 teams in regards to other conferences. Of particular note, Clemson, which consistently makes the playoffs, belongs to a conference that has only one other team that’s a top 25 team.

<matplotlib.axes._subplots.AxesSubplot at 0x1a190e5e90>

png

2) Feature distribution and significance

#sns.distplot(data['CompletionPercentage'])
sns.distplot(data2018['QBRating'])
print(shapiro(data2018['QBRating']))
qqplot(data2018['QBRating'], line='s')
plt.show()
print(shapiro(data2018['AverageCompletion']))
qqplot(data2018['AverageCompletion'], line='s')
plt.show()
print(shapiro(data2018['QBTouchdowns']))
qqplot(data2018['QBTouchdowns'], line='s')
plt.show()
data2018.plot.scatter(x='CompletionPercentage', y='y');
data2018.plot.scatter(x='QBRating', y='CompletionPercentage');
data2018.plot.scatter(x='QBRating', y='PassingYards');
data2018.plot.scatter(x='QBRating', y='y');

The W-statistic for QBRating, AverageCompletion, and QBTouchdowns, shows that it is highly likely the data is drawn from a Gaussian distribution. Additionally, the p-value is slightly higher or significantly lower than the alpha threshold of .1, indicating that it is unlikely that these results would be observed under the null hypothesis and the null hypothesis can be rejected.

(0.9831368327140808, 0.10715162754058838)

png

png

(0.9665281772613525, 0.002681687707081437)

png

(0.9542575478553772, 0.0002441542746964842)

png

3) Feature correlation

png

png

png

png

corrmat2018 = data2018.corr()
#display(corrmat)
#average passing yards (46.4%) and qbrating (48.8%) are the most correlated with whether or not making it to top25
f, ax = plt.subplots(figsize=(12, 12))
sns.heatmap(corrmat2018, vmax=.8, square=True);

Below is a heat map to see the correlation of the predictors with each other and the target field. While much of the broad heat map is true red (indicating nearly 0 correlation), there is a distinct pattern of squares, indicating significant correlation between fields. This tends to indicate a single player’s statistics, which are highly correlated with each other. Additionally, the top left contains a larger square, where the quarterback and primary receiver’s statistics show high levels of correlation. Finally there are two diagonal patterns above and below the center diagonal in the bottom right of the chart. These are high correlations in the same stats between the two defensive stats leaders evaluated; in some cases, these could be the same player, leading to high correlation.

png

pairs2018 = corrmat2018['y'].abs().sort_values(ascending=False)
pairs2018 = pairs2018[pairs2018!=1]
print(pairs2018[0:5])
print(pairs2018)

Here we can see the top five features that have the highest correlation to top 25 teams. This shows evidence that have a very good passing game is important to a teams ranking.

QBTouchdowns         0.439608
QBRating             0.420781
PassingYards         0.408534
AverageCompletion    0.360998
Completions          0.344052

Name: y, dtype: float64
QBTouchdowns               0.439608
QBRating                   0.420781
PassingYards               0.408534
AverageCompletion          0.360998
Completions                0.344052
RushingTouchdowns          0.325614
CompletionPercentage       0.315978
Attempts                   0.297411
interceptions2             0.231004
RushingYards               0.203099
ReceivingTouchdowns        0.183028
PassesDefended             0.182274
Interceptions.1            0.175058
RushingAttempts            0.173930
passesdefended2            0.156667
AverageRushingYards        0.144478
interceptionyards2         0.140950
LongestCompletion          0.139173
LongestRush                0.123929
FumblesForced              0.117996
sacks2                     0.117728
longestinterception2       0.117477
Interceptions              0.102940
LongestReception           0.099535
ReceivingYards             0.093496
solotackles2               0.087416
sacksyardslost2            0.075888
Sacks.1                    0.075836
interceptiontouchdowns2    0.074948
totaltackles2              0.072841
Receptions                 0.064869
InterceptionTouchdowns     0.058030
AssistedTackles            0.054361
SacksYardsLost             0.049912
assistedtackles2           0.042727
fumblesforced2             0.039301
LongestInterception        0.033558
TotalTackles               0.030793
AverageReceivingYards      0.023466
InterceptionYards          0.017590
SackYardsLost              0.005622
Sacks                      0.003528
SoloTackles                0.001159
Name: y, dtype: float64
sns.set()
cols = pairs2018[0:5].index
sns.pairplot(data2018[cols], height = 2.5)
plt.show();

png

box = pd.concat([data2018['QBRating'], data2018['y']], axis=1)
box = box.rename(columns={"y": "top 25"})
box.loc[(box["top 25"] == 0),"top 25"]='non-top 25'
box.loc[(box["top 25"] == 1),"top 25"]='top 25'
fig = sns.catplot(x="top 25", y='QBRating', kind='box', data=box)

4) Exploratory data analysis summary

It quickly became evident that QBrating, itself an aggregated score based on other factors, is a nice summary field and quite predictive. In the box plots below, we see that the 25th percentile QB rating for a top 25 team was still higher than the 50th percentile for a non-top 25 team, showing the dif- ferentiation strength in this field.

png

box = pd.concat([data2018['QBRating'], rawdata2018['conference']], axis=1)
display(box)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x='QBRating', y="conference", data=box)

QBRating conference
0 161.665482 Mountain West Conference
1 119.046684 Mid-American Conference
2 199.450623 Southeastern Conference
3 152.592667 Sun Belt Conference
4 149.770386 Pac-12 Conference
... ... ...
125 175.485870 Big 12 Conference
126 114.257156 Conference USA
127 146.653625 Mid-American Conference
128 132.480041 Big Ten Conference
129 96.965309 Mountain West Conference

130 rows × 2 columns

To understand the combined strength of the QB rating and conference, I have done a combined box plot. When compared with the team’s conference, we see the wide variation in skill level that some conferences have. The Big 12 and Big Ten, for example, have nearly 100 point ranges for the rating, while Pac-12 and Sun Belt Conferences are quite compact.

png

rawdata2019 = pd.read_csv('2019.csv')
display(rawdata2019.head())

csq=chi2_contingency(pd.crosstab(rawdata2019['AP_top_25'], rawdata2019['conference_categorical']))
print("Relationship between top 25 and conference P-value: ",csq[1])

#print(ncaaf['FumblesTouchdowns'].sum())
#ncaaf = ncaaf.drop(columns=['FumblesRecovered', 'FumblesTouchdowns', 'fumblesrecovered2', 'fumblestouchdowns2'])
cols = [c for c in rawdata2019.columns if c.lower()[:4] != 'team' and c.lower()[:4] != 'play' and c.lower()[:4] != 'espn' and c.lower()[:4] != 'conf' and c != 'r']
data2019 = rawdata2019[cols]
data2019 = data2019.rename(columns = {data2019.columns[43]: "y"})
display(data2019.head()) #all numerical
print(data2019.columns)
print(data2019['y'].describe())
print(data2019['QBRating'].describe())

team Completions Attempts PassingYards CompletionPercentage AverageCompletion LongestCompletion QBTouchdowns Interceptions Sacks ... sacks2 sacksyardslost2 passesdefended2 interceptions2 interceptionyards2 longestinterception2 interceptiontouchdowns2 fumblesforced2 conference_categorical AP_top_25
0 Air Force 56 111 1316 50.450001 11.856 81 13 6 4 ... 0 0 4 3 99 92 1 0 9 1
1 Akron 150 279 1822 53.763000 6.530 87 11 6 43 ... 1 8 2 1 64 64 1 0 8 0
2 Alabama 180 252 2840 71.429001 11.270 85 33 3 10 ... 0 0 3 4 54 36 0 0 11 1
3 Appalachian State 225 359 2718 62.674000 7.571 73 28 6 18 ... 0 0 8 5 54 30 2 1 12 1
4 Arizona 160 266 1954 60.150002 7.346 75 14 11 19 ... 0 0 7 4 29 14 0 0 10 0

5 rows × 46 columns

Relationship between top 25 and conference P-value:  0.07510413466966305

Completions Attempts PassingYards CompletionPercentage AverageCompletion LongestCompletion QBTouchdowns Interceptions Sacks SackYardsLost ... totaltackles2 sacks2 sacksyardslost2 passesdefended2 interceptions2 interceptionyards2 longestinterception2 interceptiontouchdowns2 fumblesforced2 y
0 56 111 1316 50.450001 11.856 81 13 6 4 -22 ... 18 0 0 4 3 99 92 1 0 1
1 150 279 1822 53.763000 6.530 87 11 6 43 -219 ... 138 1 8 2 1 64 64 1 0 0
2 180 252 2840 71.429001 11.270 85 33 3 10 -63 ... 59 0 0 3 4 54 36 0 0 1
3 225 359 2718 62.674000 7.571 73 28 6 18 -107 ... 45 0 0 8 5 54 30 2 1 1
4 160 266 1954 60.150002 7.346 75 14 11 19 -120 ... 47 0 0 7 4 29 14 0 0 0

5 rows × 44 columns

Index(['Completions', 'Attempts', 'PassingYards', 'CompletionPercentage',
       'AverageCompletion', 'LongestCompletion', 'QBTouchdowns',
       'Interceptions', 'Sacks', 'SackYardsLost', 'QBRating', 'Receptions',
       'ReceivingYards', 'AverageReceivingYards', 'LongestReception',
       'ReceivingTouchdowns', 'RushingAttempts', 'RushingYards',
       'AverageRushingYards', 'LongestRush', 'RushingTouchdowns',
       'SoloTackles', 'AssistedTackles', 'TotalTackles', 'Sacks.1',
       'SacksYardsLost', 'PassesDefended', 'Interceptions.1',
       'InterceptionYards', 'LongestInterception', 'InterceptionTouchdowns',
       'FumblesForced', 'solotackles2', 'assistedtackles2', 'totaltackles2',
       'sacks2', 'sacksyardslost2', 'passesdefended2', 'interceptions2',
       'interceptionyards2', 'longestinterception2', 'interceptiontouchdowns2',
       'fumblesforced2', 'y'],
      dtype='object')
count    131.000000
mean       0.190840
std        0.394471
min        0.000000
25%        0.000000
50%        0.000000
75%        0.000000
max        1.000000
Name: y, dtype: float64
count    131.000000
mean     138.697629
std       21.322280
min       82.885071
25%      125.215012
50%      138.550446
75%      148.743362
max      206.931274
Name: QBRating, dtype: float64

Create and train models

I applied a variety of techniques to the data set to determine which was the best differentiator of the 0/1 target, inclusion in the AP Top 25. Modeling techniques attempted include:

  • adaboost
  • decision tree
  • k nearest neighbors clustering
  • logistic regression
  • naive Bayes algorithm
  • neural network
  • random forest
  • support vector machine

I set the data up such that the 2018 season was the training data set and the 2019 was our out of time validation. Predictive accuracy shown below is on the 2019 predictions, based on models created with the prior season. One point that I would like to make clear - I am examining correlation and not implying causation in this analysis. The statistics credited to any given player are not achieved by that player alone; a quarterback cannot put up large passing yardage numbers without an effective offensive line blocking or a wide receiver who can get open and make the play. As such, I am seeking to understand the ability of these statistics to predict the team’s ranking, but am not suggesting that, absent the remaining team members, these metrics and therefore the team outcomes are possible.

from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import cross_val_score, GridSearchCV, cross_validate, train_test_split
from sklearn import preprocessing

np.random.seed(1)

x_train = data2018.iloc[:,:-1]
#x_data = preprocessing.normalize(x_data, norm='l2')
x_train = preprocessing.scale(x_train)
y_train = data2018.iloc[:,-1]

x_test = data2019.iloc[:,:-1]
x_test = preprocessing.scale(x_test)
y_test = data2019.iloc[:,-1]

#x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = 0.2, shuffle = True)

neuralnet = MLPClassifier(hidden_layer_sizes=(4,2), learning_rate_init=0.01, max_iter=100000).fit(x_train, y_train)
print(confusion_matrix(y_test, neuralnet.predict(x_test)))
print("neural networks accuracy on test:",neuralnet.score(x_test, y_test))

from sklearn.svm import SVC
svm = SVC()
svm.fit(x_train, y_train)
print("pre-tune svm accuracy on training:",svm.score(x_train, y_train))
print("pre-tune svm accuracy on test:",svm.score(x_test, y_test))

from sklearn.linear_model import LogisticRegression

logitmodel = LogisticRegression(penalty='elasticnet',solver='saga',l1_ratio = 0.5, max_iter=1000000).fit(x_train,y_train)
logitpred = logitmodel.predict(x_test)
print(logitpred)
logitacc = accuracy_score(y_test,logitpred)
print("Logistic Regression Accuracy:",logitacc)
confusion_matrix(y_test,logitpred)

from sklearn.neighbors import KNeighborsClassifier

knnmodel = KNeighborsClassifier().fit(x_train,y_train)
knnpred = knnmodel.predict(x_test)
knnacc = accuracy_score(y_test,knnpred)
print("KNN Accuracy:",knnacc)
confusion_matrix(y_test,knnpred)

from sklearn.naive_bayes import GaussianNB

nbmodel = GaussianNB(var_smoothing=10**(-3)).fit(x_train, y_train)
nbpred = nbmodel.predict(x_test)
nbacc = accuracy_score(y_test,nbpred)
print("Naive Bayes Accuracy:",nbacc)
confusion_matrix(y_test,nbpred)

from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier()
dtmodeltuned = GridSearchCV(estimator=dt,
                     param_grid={'max_depth': np.arange(1,31)},
                     scoring='roc_auc',
                     cv=5)
dtmodeltuned.fit(x_train, y_train)
y_pred_dt_test = dtmodeltuned.predict(x_test)
print(dtmodeltuned.best_params_)
print("Decision Tree Accuracy:",accuracy_score(y_pred_dt_test, y_test))

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
#rfmodel = rf.fit(x_train, y_train)

rfmodeltuned = GridSearchCV(estimator=rf, param_grid={'max_depth':np.arange(1,31)},
                     cv=5,
                     scoring='roc_auc')
rfmodeltuned.fit(x_train, y_train)

#y_pred_rf = model_rf.predict(x_train)
#print(accuracy_score(y_pred_rf, y_train))
y_pred_rf_test = rfmodeltuned.predict(x_test)
print(rfmodeltuned.best_params_)
print("Random Forest Accuracy:",accuracy_score(y_pred_rf_test, y_test))
display(pd.DataFrame(rfmodeltuned.cv_results_).sort_values(by='rank_test_score',ascending=True).head(10))
#importances = rfmodeltuned.feature_importances_
#print(importances)
#indices = np.argsort(importances)[::-1]
#print("important features:",data.columns[indices])

from sklearn.ensemble import AdaBoostClassifier
adaboost = AdaBoostClassifier(n_estimators=10000)
adaboost = adaboost.fit(x_train, y_train)
print("AdaBoost Accuracy:",adaboost.score(x_test, y_test))

#from sklearn.cluster import KMeans
#kmeans = KMeans(n_clusters=2, random_state=0).fit(x_data)
[[94 12]
 [13 12]]
neural networks accuracy on test: 0.8091603053435115
pre-tune svm accuracy on training: 0.9461538461538461
pre-tune svm accuracy on test: 0.816793893129771
[0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 1 0 0]
Logistic Regression Accuracy: 0.816793893129771
KNN Accuracy: 0.8549618320610687
Naive Bayes Accuracy: 0.7862595419847328
{'max_depth': 3}
Decision Tree Accuracy: 0.8091603053435115
{'max_depth': 8}
Random Forest Accuracy: 0.8320610687022901

mean_fit_time std_fit_time mean_score_time std_score_time param_max_depth params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
7 0.335908 0.107007 0.036530 0.027093 8 {'max_depth': 8} 0.895238 0.733333 0.776190 0.704762 0.961905 0.814286 0.098331 1
11 0.190920 0.015072 0.013918 0.003101 12 {'max_depth': 12} 0.866667 0.761905 0.771429 0.714286 0.957143 0.814286 0.086871 1
26 0.213712 0.012762 0.015596 0.002735 27 {'max_depth': 27} 0.861905 0.709524 0.809524 0.757143 0.923810 0.812381 0.075509 3
18 0.184676 0.006516 0.013240 0.001120 19 {'max_depth': 19} 0.923810 0.752381 0.752381 0.657143 0.971429 0.811429 0.117479 4
4 0.565932 0.270572 0.077565 0.061481 5 {'max_depth': 5} 0.885714 0.714286 0.809524 0.695238 0.952381 0.811429 0.098312 5
10 0.188528 0.023570 0.013572 0.002284 11 {'max_depth': 11} 0.919048 0.757143 0.780952 0.652381 0.942857 0.810476 0.107724 6
22 0.204820 0.019892 0.013236 0.001621 23 {'max_depth': 23} 0.914286 0.761905 0.700000 0.695238 0.952381 0.804762 0.108254 7
28 0.212919 0.044503 0.015496 0.005727 29 {'max_depth': 29} 0.876190 0.738095 0.790476 0.652381 0.966667 0.804762 0.108797 8
12 0.219141 0.038183 0.015095 0.003886 13 {'max_depth': 13} 0.861905 0.761905 0.785714 0.652381 0.961905 0.804762 0.103323 8
20 0.197889 0.010379 0.013528 0.001187 21 {'max_depth': 21} 0.885714 0.771429 0.776190 0.633333 0.952381 0.803810 0.109229 10
AdaBoost Accuracy: 0.8396946564885496

Evaluation and Final Results

When the model of the 2018 season is used to predict the top 25 finishers of 2019, we find that the various techniques tested had 79-87% accuracy on the test sample. These results are reasonable, but not particularly strong. This shows that the individual stats are insufficient to generate the predictive power I would like.

png

While the existing data is relatively unbiased (it contains all teams from the 2018 season, but I have not tested if the season itself could be biased in a meaningful way), it is unclear if the data itself is sufficient to generate the results I desire. If I were to attempt to further improve the model, there are a number of things we could try to make a more robust training sample:

  • Increase sample size: It’s likely that increasing the number of seasons used in the training would result in a more robust model being created.

  • Increase the breadth of data: By including other team-level statistics, I could build a stronger model. This might include strength of schedule statistics, win-loss records, and perhaps some information on coaching staff, for example.

  • Increase the depth of data: While I currently have statistics on 5 players within each team, I could certainly pull some stats on the rest of the team and likely create additional leverage for the model.

In conclusion, I was able to construct several reasonable but not incredibly powerful predictors of the top 25 finishers for a football season. Most techniques performed similarly, with random forest having the strongest out- of-time validation with 87.0% accuracy. Several data expansion recommendations would likely improve the strength of the model and could be considered for future testing.

Mo Berro
Machine Learning Engineer

My research interests include distributed systems, mobile computing and programmable matter.