COGS 118A Final - NCAA Basketball Game Predictor

Abstract

In this project we attempted to create a prediction model that can produce an accurate March Madness bracket to a high degree. To this end, we sought to predict the outcome of individual matches rather than whole brackets as brackets are built by individual matches. We tested models using both Gradient Boosting Decision Tree and Logistic Regression classifiers trained on NCAA regular season data from 2003 to 2021. The models were trained and tested on the regular season data and accuracy was determined by percentage of correctly classified wins in the test set. The logistic regression ended up having the highest accuracy after preliminary testing, so we used it for our final model. Our final model is able to predict match outcomes with high accuracy based on the data we used, but we’re unsure about how it would generalize to predicting entire March Madness brackets since the vast majority of our training data comes from regular season games.

Background

The March Madness Division I basketball tournament is an extremely popular sporting event that acts as the playoffs for collegiate men’s basketball. The format of the tournament is a single elimination, 68 team bracket played over the course of 3 weekends through March and April. The tournament began in 1939, initially only hosting 8 teams. As the Division I tournament grew in popularity, the number of teams also grew until in 1985 it expanded to its current 68 team layout and expanded its absurd probability of predicting a “perfect bracket” (a bracket that correctly predicts the winner of every game) ^[1]. Despite the 1 in 9.2 quintillion odds, every year millions of people take a shot at filling out their own bracket, and it has become a fun tradition for March Madness enthusiasts and inexperienced basketball fans alike.

Of course, many have also taken a shot at creating different algorithms to increase their chances at predicting a perfect bracket as much as they can. The most important features of this predictor algorithm seems to be the type of methodology used to run the probability algorithms, as well as the statistics used for the data. One algorithm made by Kaito Goto, which won two Solo Silver Medals at the Kaggle March Madness competition, which utilized seeding, form, efficiency, and blocks per foul in a Gradient Boosting Machine (sequential set of regressions) model ^[2]. The interesting thing about Goto’s model is the usage of stats. One would infer that the main statistics such as ranking, points, assists, rebounds, etc. would be used for prediction, but the chosen stats for this model seem to be much less focused on. Looking at Goto’s model, it might be a good idea to broaden our parameter search to include more obscure statistics like Goto did, in order to increase the likelihood of finding a combination of parameters with a higher predictive probability.

Another machine learning algorithm, made by Adrian Pierce and Lotan Weininger, finished in the top ten percent in the Google Machine Learning competition. In their approach, the statistics utilized were the Pomeroy ranking (team rankings created by Ken Pomeroy, a leading college basketball statistician), offensive rating, defensive rating, net rating, tempo, possession time per game, and adjusted Pomeroy ranking (adjusted for non-linearity). For the method, they chose to use a logistic regression model to fit their data and create a model. Along with the pattern of different combinations of stats being used, another interesting point that Pierce and Weininger chose to elaborate was their model fine tuning. They described that they did a lot of fine tuning for the current bracket they were doing, which means that instead of making a general March Madness prediction algorithm, a fine tuned model that predicts a single year is more feasible and accurate. Intuitively, this makes sense, as each year a variety of factors play into the features and structure of each year’s unique bracket ^[3]. Creating a unique prediction algorithm for the most recent year instead of a general predictor would probably be our method going forward.

Problem Statement

The problem that we are solving is to predict the outcomes of basketball games with a high level of accuracy. We aim to create a classification model that can be used for bracket prediction by predicting the outcome of any basketball game given all of the appropriate data. The intention of our model is not to generate brackets themselves, but rather to generate individual game predictions with high accuracy so that the model can be used for purposes such as bracket prediction.

Data

Kaggle.com has a large amount of historical NCAA men’s basketball data (available at https://www.kaggle.com/c/mens-march-mania-2022/data). This site contains decades of regular season and March Madness data, containing multiple datasets with thousands of rows each. For example, the file entitled MRegularSeasonCompactResults.csv contains 170,000 rows and 8 columns. Another dataset, MNCAATourneyDetailedResults.csv, contains 34 variables, including points, assists, field goals made, offensive rating, and more for both the winning and losing teams. It will be critical to train our model on the differences between a winning team’s stats and a losing team’s stats in order for it to classify teams effectively. The vast majority of the variables in these datasets are numeric, which will be useful since we can expect them to be easier to analyze than categorical variables. One of the main issues we will have to face when dealing with this dataset is the question of what we should and should not include in our model. We may choose to pull data from a few different datasets, in which case we will have to consider how to combine them. From a brief skim of the datasets, it appears that there are very few missing values, so cleaning should not be a huge issue.

As of now, the most important variables in the data that we will be using for our model are score, field goals, assists, turnovers, steals, blocks, personal fouls. These are the specific variables that will be used in our model to calculate the win rate of any given team.

There were no missing values in the initial datasets, so no cleaning was required there. Various datasets were created for the purpose of EDA, but the main dataset that was used in our preliminary results contained only the important variables mentioned above and whether that team won or lost.

Below is our data wrangling and EDA, mainly consisting of data consolidation and merging to gather key stats in unified locations.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# https://www.kaggle.com/c/mens-march-mania-2022/data
data_path = 'mens-march-mania-2022/MDataFiles_Stage1/'
data = pd.read_csv(data_path + 'MNCAATourneyDetailedResults.csv')
teams_data = pd.read_csv(data_path + 'MTeams.csv')
season_data = pd.read_csv(data_path + 'MRegularSeasonDetailedResults.csv')
seed_data = pd.read_csv(data_path + 'MNCAATourneySeeds.csv')
ordinal_data = pd.read_csv(data_path + 'MMasseyOrdinals.csv')

# want to see all columns
pd.set_option('display.max_columns', None)
data.head()

	Season	DayNum	WTeamID	WScore	LTeamID	LScore	WLoc	NumOT	WFGM	WFGA	WFGM3	WFGA3	WFTM	WFTA	WOR	WDR	WAst	WTO	WStl	WBlk	WPF	LFGM	LFGA	LFGM3	LFGA3	LFTM	LFTA	LOR	LDR	LAst	LTO	LStl	LBlk	LPF
0	2003	134	1421	92	1411	84	N	1	32	69	11	29	17	26	14	30	17	12	5	3	22	29	67	12	31	14	31	17	28	16	15	5	0	22
1	2003	136	1112	80	1436	51	N	0	31	66	7	23	11	14	11	36	22	16	10	7	8	20	64	4	16	7	7	8	26	12	17	10	3	15
2	2003	136	1113	84	1272	71	N	0	31	59	6	14	16	22	10	27	18	9	7	4	19	25	69	7	28	14	21	20	22	11	12	2	5	18
3	2003	136	1141	79	1166	73	N	0	29	53	3	7	18	25	11	20	15	18	13	1	19	27	60	7	17	12	17	14	17	20	21	6	6	21
4	2003	136	1143	76	1301	74	N	1	27	64	7	20	15	23	18	20	17	13	8	2	14	25	56	9	21	15	20	10	26	16	14	5	8	19

season_data.head()

	Season	DayNum	WTeamID	WScore	LTeamID	LScore	WLoc	WFGM	WFGA	WFGM3	WFGA3	WFTM	WFTA	WOR	WDR	WAst	WTO	WStl	WBlk	WPF	LFGM	LFGA	LFGM3	LFGA3	LFTM	LFTA	LOR	LDR	LAst	LTO	LStl	LBlk	LPF
0	2003	10	1104	68	1328	62	N	27	58	3	14	11	18	14	24	13	23	7	1	22	22	53	2	10	16	22	10	22	8	18	9	2	20
1	2003	10	1272	70	1393	63	N	26	62	8	20	10	19	15	28	16	13	4	4	18	24	67	6	24	9	20	20	25	7	12	8	6	16
2	2003	11	1266	73	1437	61	N	24	58	8	18	17	29	17	26	15	10	5	2	25	22	73	3	26	14	23	31	22	9	12	2	5	23
3	2003	11	1296	56	1457	50	N	18	38	3	9	17	31	6	19	11	12	14	2	18	18	49	6	22	8	15	17	20	9	19	4	3	23
4	2003	11	1400	77	1208	71	N	30	61	6	14	11	13	17	22	12	14	4	4	20	24	62	6	16	17	27	21	15	12	10	7	1	14

# checking for missing values
data.isna().sum()
# no nan values here!

Season     0
DayNum     0
WTeamID    0
WScore     0
LTeamID    0
LScore     0
WLoc       0
NumOT      0
WFGM       0
WFGA       0
WFGM3      0
WFGA3      0
WFTM       0
WFTA       0
WOR        0
WDR        0
WAst       0
WTO        0
WStl       0
WBlk       0
WPF        0
LFGM       0
LFGA       0
LFGM3      0
LFGA3      0
LFTM       0
LFTA       0
LOR        0
LDR        0
LAst       0
LTO        0
LStl       0
LBlk       0
LPF        0
dtype: int64

# checking for missing values
season_data.isna().sum()
# no nan values here!

Season     0
DayNum     0
WTeamID    0
WScore     0
LTeamID    0
LScore     0
WLoc       0
NumOT      0
WFGM       0
WFGA       0
WFGM3      0
WFGA3      0
WFTM       0
WFTA       0
WOR        0
WDR        0
WAst       0
WTO        0
WStl       0
WBlk       0
WPF        0
LFGM       0
LFGA       0
LFGM3      0
LFGA3      0
LFTM       0
LFTA       0
LOR        0
LDR        0
LAst       0
LTO        0
LStl       0
LBlk       0
LPF        0
dtype: int64

# data stats
data.describe()

	Season	DayNum	WTeamID	WScore	LTeamID	LScore	NumOT	WFGM	WFGA	WFGM3	WFGA3	WFTM	WFTA	WOR	WDR	WAst	WTO	WStl	WBlk	WPF	LFGM	LFGA	LFGM3	LFGA3	LFTM	LFTA	LOR	LDR	LAst	LTO	LStl	LBlk	LPF
count	1181.000000	1181.000000	1181.000000	1181.000000	1181.000000	1181.000000	1181.000000	1181.000000	1181.000000	1181.000000	1181.000000	1181.000000	1181.000000	1181.000000	1181.000000	1181.000000	1181.000000	1181.000000	1181.000000	1181.000000	1181.000000	1181.000000	1181.000000	1181.000000	1181.000000	1181.000000	1181.000000	1181.000000	1181.000000	1181.000000	1181.000000	1181.000000	1181.000000
mean	2011.650296	139.109229	1293.542760	75.234547	1294.587638	63.707028	0.071126	26.344623	55.462320	7.022862	18.298899	15.522439	21.328535	10.523285	25.861135	14.259949	11.447925	6.392887	3.944962	16.324301	22.911939	57.712108	6.254022	20.502964	11.629128	16.617273	10.970364	21.368332	11.409822	12.171041	5.707875	2.939881	18.853514
std	5.274224	4.234734	102.884842	10.724755	105.945614	10.305693	0.294045	4.751158	7.225518	2.965677	5.679989	6.059217	7.681727	3.978822	5.002307	4.341092	3.812563	2.977842	2.526038	3.860358	4.104239	7.300343	2.761999	5.826098	5.114416	6.609636	4.215384	4.463864	3.641061	3.920983	2.670336	2.045817	4.232007
min	2003.000000	134.000000	1101.000000	47.000000	1101.000000	29.000000	0.000000	13.000000	34.000000	0.000000	4.000000	0.000000	1.000000	0.000000	13.000000	3.000000	2.000000	0.000000	0.000000	5.000000	11.000000	37.000000	0.000000	5.000000	0.000000	2.000000	1.000000	8.000000	2.000000	3.000000	0.000000	0.000000	7.000000
25%	2007.000000	136.000000	1211.000000	68.000000	1210.000000	57.000000	0.000000	23.000000	51.000000	5.000000	14.000000	11.000000	16.000000	8.000000	22.000000	11.000000	9.000000	4.000000	2.000000	14.000000	20.000000	53.000000	4.000000	16.000000	8.000000	12.000000	8.000000	18.000000	9.000000	9.000000	4.000000	1.000000	16.000000
50%	2012.000000	137.000000	1277.000000	75.000000	1295.000000	63.000000	0.000000	26.000000	55.000000	7.000000	18.000000	15.000000	21.000000	10.000000	26.000000	14.000000	11.000000	6.000000	4.000000	16.000000	23.000000	58.000000	6.000000	20.000000	11.000000	16.000000	11.000000	21.000000	11.000000	12.000000	6.000000	3.000000	19.000000
75%	2016.000000	139.000000	1393.000000	82.000000	1393.000000	71.000000	0.000000	29.000000	60.000000	9.000000	22.000000	19.000000	26.000000	13.000000	29.000000	17.000000	14.000000	8.000000	5.000000	19.000000	26.000000	63.000000	8.000000	24.000000	15.000000	21.000000	14.000000	24.000000	14.000000	15.000000	7.000000	4.000000	22.000000
max	2021.000000	154.000000	1463.000000	121.000000	1463.000000	105.000000	2.000000	44.000000	84.000000	18.000000	41.000000	38.000000	48.000000	26.000000	43.000000	29.000000	28.000000	20.000000	15.000000	30.000000	36.000000	85.000000	18.000000	42.000000	31.000000	39.000000	29.000000	42.000000	23.000000	27.000000	19.000000	13.000000	33.000000

# season_data stats
season_data.describe()

	Season	DayNum	WTeamID	WScore	LTeamID	LScore	NumOT	WFGM	WFGA	WFGM3	WFGA3	WFTM	WFTA	WOR	WDR	WAst	WTO	WStl	WBlk	WPF	LFGM	LFGA	LFGM3	LFGA3	LFTM	LFTA	LOR	LDR	LAst	LTO	LStl	LBlk	LPF
count	100423.000000	100423.000000	100423.000000	100423.000000	100423.000000	100423.000000	100423.000000	100423.000000	100423.000000	100423.000000	100423.000000	100423.000000	100423.000000	100423.000000	100423.000000	100423.000000	100423.000000	100423.000000	100423.000000	100423.000000	100423.000000	100423.000000	100423.000000	100423.000000	100423.000000	100423.000000	100423.000000	100423.000000	100423.000000	100423.000000	100423.000000	100423.000000	100423.000000
mean	2012.489350	70.513169	1287.600341	75.549396	1282.113341	63.548978	0.068311	26.228991	55.458122	7.218227	18.927069	15.873196	22.297083	10.669667	25.756829	14.712695	12.874202	6.983868	3.799120	16.880814	22.665146	56.574281	6.128994	19.849148	12.089611	17.841909	10.824004	21.714886	11.415582	14.196698	6.000040	2.864951	19.076546
std	5.564818	35.408043	104.947734	11.059565	104.123614	10.894962	0.305483	4.688120	7.526434	3.095434	5.927677	6.245465	8.119103	4.142087	4.910566	4.411885	4.006917	3.107018	2.441147	4.950588	4.364286	7.627384	2.795085	6.029684	5.356261	7.123095	4.221449	4.544578	3.728140	4.385896	2.753956	2.026886	5.483882
min	2003.000000	0.000000	1101.000000	34.000000	1101.000000	20.000000	0.000000	10.000000	27.000000	0.000000	1.000000	0.000000	0.000000	0.000000	5.000000	1.000000	1.000000	0.000000	0.000000	0.000000	6.000000	26.000000	0.000000	1.000000	0.000000	0.000000	0.000000	4.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	2008.000000	40.000000	1198.000000	68.000000	1191.000000	56.000000	0.000000	23.000000	50.000000	5.000000	15.000000	11.000000	17.000000	8.000000	22.000000	12.000000	10.000000	5.000000	2.000000	14.000000	20.000000	51.000000	4.000000	16.000000	8.000000	13.000000	8.000000	19.000000	9.000000	11.000000	4.000000	1.000000	16.000000
50%	2013.000000	73.000000	1286.000000	75.000000	1281.000000	63.000000	0.000000	26.000000	55.000000	7.000000	19.000000	15.000000	22.000000	10.000000	26.000000	14.000000	13.000000	7.000000	3.000000	17.000000	23.000000	56.000000	6.000000	19.000000	12.000000	17.000000	10.000000	22.000000	11.000000	14.000000	6.000000	3.000000	19.000000
75%	2017.000000	101.000000	1380.000000	83.000000	1373.000000	71.000000	0.000000	29.000000	60.000000	9.000000	23.000000	20.000000	28.000000	13.000000	29.000000	17.000000	15.000000	9.000000	5.000000	20.000000	25.000000	61.000000	8.000000	24.000000	15.000000	22.000000	13.000000	25.000000	14.000000	17.000000	8.000000	4.000000	22.000000
max	2022.000000	132.000000	1472.000000	144.000000	1472.000000	140.000000	6.000000	57.000000	103.000000	26.000000	56.000000	48.000000	67.000000	38.000000	54.000000	41.000000	33.000000	26.000000	21.000000	41.000000	47.000000	106.000000	22.000000	59.000000	42.000000	61.000000	36.000000	49.000000	31.000000	41.000000	22.000000	18.000000	45.000000

# add team names to team id's
data['WTeamName'] = [teams_data[teams_data['TeamID'] == x].iloc[0,1] for x in data['WTeamID']]
data['LTeamName'] = [teams_data[teams_data['TeamID'] == x].iloc[0,1] for x in data['LTeamID']]
season_data['WTeamName'] = [teams_data[teams_data['TeamID'] == x].iloc[0,1] for x in season_data['WTeamID']]
season_data['LTeamName'] = [teams_data[teams_data['TeamID'] == x].iloc[0,1] for x in season_data['LTeamID']]

# add seeds/rankings to each winning/losing team row
data['WSeed'] = [int(seed_data[(seed_data['Season']==data['Season'][x]) & (seed_data['TeamID']==data['WTeamID'][x])].iloc[0,1][1:3]) for x in range(len(data))]
data['LSeed'] = [int(seed_data[(seed_data['Season']==data['Season'][x]) & (seed_data['TeamID']==data['LTeamID'][x])].iloc[0,1][1:3]) for x in range(len(data))]
data['WOrdinalRank'] = [ordinal_data[(ordinal_data['Season']==data['Season'][x]) & (ordinal_data['TeamID']==data['WTeamID'][x])].iloc[0,4] for x in range(len(data))]
data['LOrdinalRank'] = [ordinal_data[(ordinal_data['Season']==data['Season'][x]) & (ordinal_data['TeamID']==data['LTeamID'][x])].iloc[0,4] for x in range(len(data))]

#calculate number of wins and losses or each team
team_wins = season_data.groupby(['Season', 'WTeamID']).count()
team_wins = team_wins.reset_index()[['Season', 'WTeamID', 'WScore']].rename(columns = {'WTeamID': 'TeamID', 'WScore': 'NumWins'})
team_losses = season_data.groupby(['Season', 'LTeamID']).count()
team_losses = team_losses.reset_index()[['Season', 'LTeamID', 'LScore']].rename(columns = {'LTeamID': 'TeamID', 'LScore': 'NumLosses'})

#create a dataframe containing the information on wins and losses
data_features_w = season_data.groupby(['Season', 'WTeamID']).count().reset_index()[['Season', 'WTeamID']].rename(columns={"WTeamID": "TeamID"})
data_features_l = season_data.groupby(['Season', 'LTeamID']).count().reset_index()[['Season', 'LTeamID']].rename(columns={"LTeamID": "TeamID"})
data_features = pd.concat([data_features_w, data_features_l], 0).drop_duplicates().sort_values(['Season', 'TeamID']).reset_index(drop=True)
data_features = data_features.merge(team_wins, on=['Season', 'TeamID'], how='left')
data_features = data_features.merge(team_losses, on=['Season', 'TeamID'], how='left')
data_features.fillna(0, inplace=True)
data_features['WinPct'] = data_features['NumWins'] / (data_features['NumWins'] + data_features['NumLosses'])

/var/folders/6_/1csns7v17fs7hhm43fhdcyfr0000gn/T/ipykernel_58724/4091683511.py:4: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only
  data_features = pd.concat([data_features_w, data_features_l], 0).drop_duplicates().sort_values(['Season', 'TeamID']).reset_index(drop=True)

data.head()

	Season	DayNum	WTeamID	WScore	LTeamID	LScore	WLoc	NumOT	WFGM	WFGA	WFGM3	WFGA3	WFTM	WFTA	WOR	WDR	WAst	WTO	WStl	WBlk	WPF	LFGM	LFGA	LFGM3	LFGA3	LFTM	LFTA	LOR	LDR	LAst	LTO	LStl	LBlk	LPF	WTeamName	LTeamName	WSeed	LSeed	WOrdinalRank	LOrdinalRank
0	2003	134	1421	92	1411	84	N	1	32	69	11	29	17	26	14	30	17	12	5	3	22	29	67	12	31	14	31	17	28	16	15	5	0	22	UNC Asheville	TX Southern	16	16	218	231
1	2003	136	1112	80	1436	51	N	0	31	66	7	23	11	14	11	36	22	16	10	7	8	20	64	4	16	7	7	8	26	12	17	10	3	15	Arizona	Vermont	1	16	26	192
2	2003	136	1113	84	1272	71	N	0	31	59	6	14	16	22	10	27	18	9	7	4	19	25	69	7	28	14	21	20	22	11	12	2	5	18	Arizona St	Memphis	10	7	89	67
3	2003	136	1141	79	1166	73	N	0	29	53	3	7	18	25	11	20	15	18	13	1	19	27	60	7	17	12	17	14	17	20	21	6	6	21	C Michigan	Creighton	11	6	63	3
4	2003	136	1143	76	1301	74	N	1	27	64	7	20	15	23	18	20	17	13	8	2	14	25	56	9	21	15	20	10	26	16	14	5	8	19	California	NC State	8	9	116	29

season_data.head()

	Season	DayNum	WTeamID	WScore	LTeamID	LScore	WLoc	WFGM	WFGA	WFGM3	WFGA3	WFTM	WFTA	WOR	WDR	WAst	WTO	WStl	WBlk	WPF	LFGM	LFGA	LFGM3	LFGA3	LFTM	LFTA	LOR	LDR	LAst	LTO	LStl	LBlk	LPF	WTeamName	LTeamName
0	2003	10	1104	68	1328	62	N	27	58	3	14	11	18	14	24	13	23	7	1	22	22	53	2	10	16	22	10	22	8	18	9	2	20	Alabama	Oklahoma
1	2003	10	1272	70	1393	63	N	26	62	8	20	10	19	15	28	16	13	4	4	18	24	67	6	24	9	20	20	25	7	12	8	6	16	Memphis	Syracuse
2	2003	11	1266	73	1437	61	N	24	58	8	18	17	29	17	26	15	10	5	2	25	22	73	3	26	14	23	31	22	9	12	2	5	23	Marquette	Villanova
3	2003	11	1296	56	1457	50	N	18	38	3	9	17	31	6	19	11	12	14	2	18	18	49	6	22	8	15	17	20	9	19	4	3	23	N Illinois	Winthrop
4	2003	11	1400	77	1208	71	N	30	61	6	14	11	13	17	22	12	14	4	4	20	24	62	6	16	17	27	21	15	12	10	7	1	14	Texas	Georgia

data_features.head()

	Season	TeamID	NumWins	NumLosses	WinPct
0	2003	1102	12.0	16.0	0.428571
1	2003	1103	13.0	14.0	0.481481
2	2003	1104	17.0	11.0	0.607143
3	2003	1105	7.0	19.0	0.269231
4	2003	1106	13.0	15.0	0.464286

From the previews of the cleaned datasets, we can see all the variables that we have to work with as well as the win percentage of each team. The datasets have data for both teams in each match of each season with no missing values.

EDA

Below is the EDA that was performed on the data. The main area of analysis was to plot all of the data to check for things like a normal distribution and reliability of data. The tournament data was analyzed separately from the regular season data to ensure that both datasets are reliable.

EDA of Tournament Data Stat Differentials

# histogram of point differentials
sns.histplot(data['WScore']-data['LScore'])
plt.axvline(x=(data['WScore']-data['LScore']).mean(),
            color='red', ls='--')

<matplotlib.lines.Line2D at 0x7f9724bdbc40>

png

# histogram of 3 point make differentials
sns.histplot(data['WFGM3']-data['LFGM3'])
plt.axvline(x=(data['WFGM3']-data['LFGM3']).mean(),
            color='red', ls='--')

<matplotlib.lines.Line2D at 0x7f9724bc3d30>

png

# histogram of free throw make differentials
sns.histplot(data['WFTM']-data['LFTM'])
plt.axvline(x=(data['WFTM']-data['LFTM']).mean(),
            color='red', ls='--')

<matplotlib.lines.Line2D at 0x7f9725047c70>

png

# histogram of offensive rating differentials
sns.histplot(data['WOR']-data['LOR'])
plt.axvline(x=(data['WOR']-data['LOR']).mean(),
            color='red', ls='--')

<matplotlib.lines.Line2D at 0x7f97251ce4f0>

png

# histogram of defensive rating differentials
sns.histplot(data['WDR']-data['LDR'])
plt.axvline(x=(data['WDR']-data['LDR']).mean(),
            color='red', ls='--')

<matplotlib.lines.Line2D at 0x7f97254bac10>

png

# histogram of personal foul differentials
sns.histplot(data['WPF']-data['LPF'])
plt.axvline(x=(data['WPF']-data['LPF']).mean(),
            color='red', ls='--')

<matplotlib.lines.Line2D at 0x7f97253faf40>

png

# histogram of turnover differentials
sns.histplot(data['WTO']-data['LTO'])
plt.axvline(x=(data['WTO']-data['LTO']).mean(),
            color='red', ls='--')

<matplotlib.lines.Line2D at 0x7f972536caf0>

png

# histogram of assist differentials
sns.histplot(data['WAst']-data['LAst'])
plt.axvline(x=(data['WAst']-data['LAst']).mean(),
            color='red', ls='--')

<matplotlib.lines.Line2D at 0x7f9725270be0>

png

# histogram of steal differentials
sns.histplot(data['WStl']-data['LStl'])
plt.axvline(x=(data['WStl']-data['LStl']).mean(),
            color='red', ls='--')

<matplotlib.lines.Line2D at 0x7f9725caf520>

png

EDA of Season Data Stat Differentials

# histogram of point differentials
sns.histplot(data['WScore']-data['LScore'], bins=20)
plt.axvline(x=(data['WScore']-data['LScore']).mean(),
            color='red', ls='--')

<matplotlib.lines.Line2D at 0x7f972600c5e0>

png

# histogram of 3 point make differentials
sns.histplot(data['WFGM3']-data['LFGM3'], bins=20)
plt.axvline(x=(data['WFGM3']-data['LFGM3']).mean(),
            color='red', ls='--')

<matplotlib.lines.Line2D at 0x7f9726015a30>

png

# histogram of free throw make differentials
sns.histplot(data['WFTM']-data['LFTM'], bins=20)
plt.axvline(x=(data['WFTM']-data['LFTM']).mean(),
            color='red', ls='--')

<matplotlib.lines.Line2D at 0x7f972568c040>

png

# histogram of offensive rating differentials
sns.histplot(data['WOR']-data['LOR'], bins=20)
plt.axvline(x=(data['WOR']-data['LOR']).mean(),
            color='red', ls='--')

<matplotlib.lines.Line2D at 0x7f97260c59d0>

png

# histogram of defensive rating differentials
sns.histplot(data['WDR']-data['LDR'], bins=20)
plt.axvline(x=(data['WDR']-data['LDR']).mean(),
            color='red', ls='--')

<matplotlib.lines.Line2D at 0x7f97263b44f0>

png

# histogram of personal foul differentials
sns.histplot(data['WPF']-data['LPF'], bins=20)
plt.axvline(x=(data['WPF']-data['LPF']).mean(),
            color='red', ls='--')

<matplotlib.lines.Line2D at 0x7f972600cd30>

png

# histogram of turnover differentials
sns.histplot(data['WTO']-data['LTO'], bins=20)
plt.axvline(x=(data['WTO']-data['LTO']).mean(),
            color='red', ls='--')

<matplotlib.lines.Line2D at 0x7f9726613d30>

png

# histogram of assist differentials
sns.histplot(data['WAst']-data['LAst'], bins=20)
plt.axvline(x=(data['WAst']-data['LAst']).mean(),
            color='red', ls='--')

<matplotlib.lines.Line2D at 0x7f97267bd220>

png

# histogram of steal differentials
sns.histplot(data['WStl']-data['LStl'], bins=20)
plt.axvline(x=(data['WStl']-data['LStl']).mean(),
            color='red', ls='--')

<matplotlib.lines.Line2D at 0x7f9726a366a0>

png

From the EDA we can see that most of the variables in both the tournament and regular season data take roughly a normal distribution. This shows that the data is reliable and also that we have enough observations of each variable to be able to properly train and test our model.

Proposed Solution

With the above mentioned data, we propose to build a classification predictive model using NCAA regular season and March Madness data with the aim of accurately predicting individual game outcomes. We experimented with both logistic regression and gradient boosted decision tree classifiers and determined that logistic regression provided the highest accuracy. We then trained and tuned our logistic regression classification model which was trained by comparing team statistics for an individual match to the outcome of that match. As we developed the model, variables with high correlation to match outcome were removed from the analysis to reduce their influence. Once we created our trained model, we tested it on the rest of data that wasn’t used for training. The model was given a team and its statistics for a given match and then tasked with predicting the outcome of that match. After this classification was performed on all the test data, we calculated the accuracy, recall, and precision generated from all of the true positives, false positives, true negatives, and false negatives. These three metrics were our main metrics that were used to measure the success of the model.

Evaluation Metrics

The main evaluation metrics that we used to determine the success of our model were accuracy, recall and precision. These three statistics each represent a different aspect of how well the model classified the data, and together they come together to show how well the model performed at overall classification.

Accuracy was determined by dividing the number of correct predictions by the total number of predictions. This metric gave us a general idea of how well the model performed when it came to correctly predicting both match wins and match losses.

Recall was determined by dividing the number of correctly predicted match wins by the total number of matches that were wins (true positives + false negatives). This metric helped us to determine what percentage of match wins were correctly classified as wins compared to match wins that were misclassified as losses.

Precision was calculated by dividing the number of correctly predicted match wins by the total number of predicted wins. This metric helped us to determine what percentage of the model’s predicted wins were actual wins.

Results

Preliminary Model Testing and Selection

This section will cover the preliminary testing that we did with both the logistic regression and gradient boosted decision tree classifiers to determine which would be the best for our model. These two classification algorithms were chosen because our research into preexisting bracket predictors indicated that logistic regression and gradient boosted decision trees typically have the highest accuracy.

Logistic Regression

Starting with logistic regression, the data for each match was split by winning team and losing team and the win and loss data were put in separate dataframes. One dataframe contained all the win data and one dataframe contained all the loss data so that win and loss data was not mixed and in turn the data was easier to use.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)

# https://www.kaggle.com/c/mens-march-mania-2022/data

# regular season data
data_regszn = pd.read_csv(data_path + 'MRegularSeasonDetailedResults.csv')

# tournament data
data_tournament = pd.read_csv(data_path + 'MNCAATourneyDetailedResults.csv')

data_regszn.head()

	Season	DayNum	WTeamID	WScore	LTeamID	LScore	WLoc	WFGM	WFGA	WFGM3	WFGA3	WFTM	WFTA	WOR	WDR	WAst	WTO	WStl	WBlk	WPF	LFGM	LFGA	LFGM3	LFGA3	LFTM	LFTA	LOR	LDR	LAst	LTO	LStl	LBlk	LPF
0	2003	10	1104	68	1328	62	N	27	58	3	14	11	18	14	24	13	23	7	1	22	22	53	2	10	16	22	10	22	8	18	9	2	20
1	2003	10	1272	70	1393	63	N	26	62	8	20	10	19	15	28	16	13	4	4	18	24	67	6	24	9	20	20	25	7	12	8	6	16
2	2003	11	1266	73	1437	61	N	24	58	8	18	17	29	17	26	15	10	5	2	25	22	73	3	26	14	23	31	22	9	12	2	5	23
3	2003	11	1296	56	1457	50	N	18	38	3	9	17	31	6	19	11	12	14	2	18	18	49	6	22	8	15	17	20	9	19	4	3	23
4	2003	11	1400	77	1208	71	N	30	61	6	14	11	13	17	22	12	14	4	4	20	24	62	6	16	17	27	21	15	12	10	7	1	14

data_tournament.head()

	Season	DayNum	WTeamID	WScore	LTeamID	LScore	WLoc	NumOT	WFGM	WFGA	WFGM3	WFGA3	WFTM	WFTA	WOR	WDR	WAst	WTO	WStl	WBlk	WPF	LFGM	LFGA	LFGM3	LFGA3	LFTM	LFTA	LOR	LDR	LAst	LTO	LStl	LBlk	LPF
0	2003	134	1421	92	1411	84	N	1	32	69	11	29	17	26	14	30	17	12	5	3	22	29	67	12	31	14	31	17	28	16	15	5	0	22
1	2003	136	1112	80	1436	51	N	0	31	66	7	23	11	14	11	36	22	16	10	7	8	20	64	4	16	7	7	8	26	12	17	10	3	15
2	2003	136	1113	84	1272	71	N	0	31	59	6	14	16	22	10	27	18	9	7	4	19	25	69	7	28	14	21	20	22	11	12	2	5	18
3	2003	136	1141	79	1166	73	N	0	29	53	3	7	18	25	11	20	15	18	13	1	19	27	60	7	17	12	17	14	17	20	21	6	6	21
4	2003	136	1143	76	1301	74	N	1	27	64	7	20	15	23	18	20	17	13	8	2	14	25	56	9	21	15	20	10	26	16	14	5	8	19

combined_data = pd.concat([data_regszn, data_tournament])

# extract all game stats of the winning team
# year, daynum, teamID, and numOT ignored
wins_data = combined_data[['WScore', 'WLoc', 'WFGM', 'WFGA', 'WFGM3', 'WFGA3', 'WFTM', 
                  'WFTA', 'WOR', 'WDR', 'WAst', 'WTO', 'WStl', 'WBlk', 'WPF']]

wins_data.rename(columns = {'WScore':'Score', 
                            'WLoc':'Loc',
                            'WFGM':'FGM',
                            'WFGA':'FGA',
                            'WFGM3':'FGM3',
                            'WFGA3':'FGA3',
                            'WFTM':'FTM',
                            'WFTA':'FTA',
                            'WAst':'Ast',
                            'WOR':'OR',
                            'WDR':'DR',
                            'WTO':'TO',
                            'WStl':'Stl',
                            'WBlk':'Blk',
                            'WPF':'PF'
                           }, inplace = True)

wins_data['Won'] = 1

wins_data['Loc'] = wins_data['Loc'].map({'H':1, 'N':0, 'A':-1})

/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/frame.py:5039: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(
/var/folders/6_/1csns7v17fs7hhm43fhdcyfr0000gn/T/ipykernel_58724/3365504104.py:23: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  wins_data['Won'] = 1
/var/folders/6_/1csns7v17fs7hhm43fhdcyfr0000gn/T/ipykernel_58724/3365504104.py:25: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  wins_data['Loc'] = wins_data['Loc'].map({'H':1, 'N':0, 'A':-1})

# do the same thing for losses
# there is no LLoc, only WLoc, so reverse it when mapping
losses_data = combined_data[['LScore', 'WLoc', 'LFGM', 'LFGA', 'LFGM3', 'LFGA3', 'LFTM', 
                  'LFTA', 'LOR', 'LDR', 'LAst', 'LTO', 'LStl', 'LBlk', 'LPF']]

losses_data.rename(columns = {'LScore':'Score', 
                            'WLoc':'Loc',
                            'LFGM':'FGM',
                            'LFGA':'FGA',
                            'LFGM3':'FGM3',
                            'LFGA3':'FGA3',
                            'LFTM':'FTM',
                            'LFTA':'FTA',
                            'LAst':'Ast',
                            'LOR':'OR',
                            'LDR':'DR',
                            'LTO':'TO',
                            'LStl':'Stl',
                            'LBlk':'Blk',
                            'LPF':'PF'
                           }, inplace = True)

losses_data['Won'] = 0

losses_data['Loc'] = losses_data['Loc'].map({'H':-1, 'N':0, 'A':1})

/var/folders/6_/1csns7v17fs7hhm43fhdcyfr0000gn/T/ipykernel_58724/408430235.py:23: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  losses_data['Won'] = 0
/var/folders/6_/1csns7v17fs7hhm43fhdcyfr0000gn/T/ipykernel_58724/408430235.py:25: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  losses_data['Loc'] = losses_data['Loc'].map({'H':-1, 'N':0, 'A':1})

After isolating all of the win and loss data, it was then moved to a single dataframe for ease of access as in the cell seen below. This dataframe is different from the original because each team in each match is represented as a separate row rather than having both the winning and losing team’s data in the same row. With each row now containing only one team’s outcome in an individual match, the data is now represented in a way that can be processed by the logistic regression classifier. The data was then shuffled to prevent the order of the data having an effect on the result.

# combine wins and losses dataframes
data = pd.concat([wins_data, losses_data])

# shuffle data
data = data.sample(frac = 1)

data.head()

	Score	Loc	FGM	FGA	FGM3	FGA3	FTM	FTA	OR	DR	Ast	TO	Stl	Blk	PF
37899	72	-1	27	61	7	17	11	18	11	19	12	20	9	1	22
83564	54	-1	24	58	3	11	3	9	8	20	15	10	3	0	10
29654	41	1	10	44	1	9	20	27	11	20	4	22	4	3	13
2012	83	-1	31	73	12	34	9	12	10	22	18	16	6	2	26
94628	49	-1	17	58	3	18	12	25	15	18	11	17	6	3	0

Controlling for variable correlation

Before inputting the data into the model, each variable’s correlation to match wins was calculated and displayed in a heatmap. To prevent variables that are highly correlated with match wins from having confounding effects on the results of the model, the variables with high correlation values were removed from the analysis.

# sorts columns by correlation to wins
ix = data.corr().sort_values('Won', ascending=False).index
data = data.loc[:, ix]
data.columns

Index(['Won', 'Score', 'DR', 'Ast', 'FGM', 'FTM', 'Loc', 'FTA', 'Blk', 'FGM3',
       'Stl', 'OR', 'FGA', 'FGA3', 'TO', 'PF'],
      dtype='object')

# heatmap to examine the correlation between team stats and wins
plt.figure(figsize=(10,8))
cor = data[['Won', 'Score', 'DR', 'Ast', 'FGM', 'FTM', 'Loc', 'FTA', 'Blk', 'FGM3',
           'Stl', 'OR', 'FGA', 'FGA3', 'TO', 'PF'
                ]].corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.title('Correlation Between Team Stats and Wins')
plt.show()

png

Finding the optimal number of parameters

Now that the data is fully prepared to be used in the model, a model was created using a train-test split of the data. To check how many of the variables should be included in the analysis, the accuracy, precision, and recall of a logistic regression model was calculated first with all the variables. Then, one variable was removed from the analysis and the evaluation metrics were calculated again. This process continued until only one variable remained, and the metrics were plotted for each number of variables. Through this analysis, it was determined that using all the variables would provide the best accuracy for the model.

# want to find optimal parameters

# removed highly correlated params
X = data[['DR', 'Ast', 'FGM', 'Loc', 'FTA', 'Blk',
           'Stl', 'OR', 'FGA', 'FGA3', 'TO', 'PF']]
y = data['Won']

accuracy = []
recall = []
precision = []

for i in range(len(X.columns)):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
    
    clf = LogisticRegression(solver='saga', max_iter=1000).fit(X_train, y_train)
    
    y_pred = clf.predict(X_test)
    
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
    
    accuracy.append((tp + tn) / (tp + tn + fp + fn))
    recall.append(tp / (tp + fn))
    precision.append(tp / (tp + fp))
    
    X.drop(columns=X.columns[-1], 
        axis=1, 
        inplace=True)

/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/frame.py:4906: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/frame.py:4906: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/frame.py:4906: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/frame.py:4906: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/frame.py:4906: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/frame.py:4906: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/frame.py:4906: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/frame.py:4906: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/frame.py:4906: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/frame.py:4906: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/frame.py:4906: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/frame.py:4906: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(

df = pd.DataFrame({'Accuracy': accuracy,
                   'Recall': recall,
                   'Precision': precision},
                 index=['12', '11', '10', '9',
                       '8', '7', '6', '5', '4',
                       '3', '2', '1'])

df.plot(kind='bar', stacked=True, color=['red', 'skyblue', 'green'], figsize=(10,6))
plt.xlabel('Number of Vars')
plt.ylabel('Accuracy + Recall + Precision')
plt.title('Accuracy, Recall, and Precision for Predictor Combinations')

Text(0.5, 1.0, 'Accuracy, Recall, and Precision for Predictor Combinations')

png

In the above plot we can see that the full number of variables (12) provided the highest total of accuracy, recall and precision, so we continued our analysis with all 12 variables.

Finding optimal regularization term and C value

Next we sought to find the optimal regularization term and C value for the logistic regression solver and classifier.

To find the optimal regularization term, the elasticnet penalty was used and l1_ratio values ranging from 0.0 (equivalent to l2 term) to 1.0 (equivalent to l1 term) were tested in steps of 0.1. The evaluation metrics for each l1_ratio value were stored and the metric values were plotted.

# finding optimal regularization term
accuracy = []
recall = []
precision = []

X = data[['DR', 'Ast', 'FGM', 'Loc', 'FTA', 'Blk',
           'Stl', 'OR', 'FGA', 'FGA3', 'TO', 'PF']]

# 0 = l2, 0.5 = half/half, 1.0 = l1
l1rs = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

for l1r in l1rs:
    clf = LogisticRegression(solver='saga', max_iter=1000, penalty = 'elasticnet', l1_ratio = l1r).fit(X_train, y_train)
    
    y_pred = clf.predict(X_test)
    
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
    
    accuracy.append((tp + tn) / (tp + tn + fp + fn))
    recall.append(tp / (tp + fn))
    precision.append(tp / (tp + fp))

reg_df = pd.DataFrame({'Accuracy': accuracy,
                   'Recall': recall,
                   'Precision': precision},
                 index=l1rs)

reg_df.plot(kind='bar', stacked=True, color=['red', 'skyblue', 'green'], figsize=(10,6))
plt.xlabel('l1_ratio Value')
plt.ylabel('Accuracy + Recall + Precision')
plt.title('Accuracy, Recall, and Precision for Penalty Terms')

reg_df

	Accuracy	Recall	Precision
0.0	0.868474	0.868378	0.868378
0.1	0.868489	0.868408	0.868382
0.2	0.868489	0.868408	0.868382
0.3	0.868459	0.868378	0.868352
0.4	0.868474	0.868408	0.868356
0.5	0.868534	0.868497	0.868394
0.6	0.868534	0.868467	0.868416
0.7	0.868489	0.868408	0.868382
0.8	0.868459	0.868378	0.868352
0.9	0.868489	0.868408	0.868382
1.0	0.868489	0.868408	0.868382

png

From the results, it seemed as though any combination of the l1 and l2 regularization terms showed no significant difference from another. Our choice didn’t matter too much, so we went with an l1_ratio value of 0.3 because its metric values were slightly better than the others.

Next, we sought to find an optimal C value for the logistic regression classifier. C values ranging from 0.01 to 100 were tested with no specific step size in between, and each value was tested in a logistic regression classifier using the same validation set.

# finding optimal C
accuracy = []
recall = []
precision = []

Cs = [0.01, 0.1, 0.5, 1.0, 5, 10, 50, 100]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=42)

for c in Cs:
    clf = LogisticRegression(solver='saga', penalty = 'elasticnet', l1_ratio = 0.3, max_iter=1000, C = c).fit(X_val, y_val)
    
    y_pred = clf.predict(X_val)
    
    tn, fp, fn, tp = confusion_matrix(y_val, y_pred).ravel()
    
    accuracy.append((tp + tn) / (tp + tn + fp + fn))
    recall.append(tp / (tp + fn))
    precision.append(tp / (tp + fp))

c_df = pd.DataFrame({'Accuracy': accuracy,
                   'Recall': recall,
                   'Precision': precision},
                 index=Cs)

c_df.plot(kind='bar', stacked=True, color=['red', 'skyblue', 'green'], figsize=(10,6))
plt.xlabel('C Value')
plt.ylabel('Accuracy + Recall + Precision')
plt.title('Accuracy, Recall, and Precision for C Values')

c_df

	Accuracy	Recall	Precision
0.01	0.870468	0.868204	0.871846
0.10	0.870292	0.868087	0.871625
0.50	0.870116	0.867851	0.871492
1.00	0.870116	0.867851	0.871492
5.00	0.870116	0.867851	0.871492
10.00	0.870145	0.867910	0.871499
50.00	0.870116	0.867851	0.871492
100.00	0.870116	0.867851	0.871492

png

The difference in performance of each C value was not very significant, but a C value of C = 5 pulled slightly ahead of all the others, so we decided to use that value moving forward.

Gradient Boosted Decision Tree

We also attempted to fit multiple gradient boosting models with the training data using scikit-learn’s prebuilt model as a baseline, as well as XGBoost (extreme gradient boosting) and LightGBM (light gradient boost) to increase efficiency and hopefully accuracy.

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
import xgboost
import lightgbm as lgb

# sklearn implementation
gbc = GradientBoostingClassifier()
gbc_params = {
              'max_features': [None],
              'loss': ['deviance'],
              'n_estimators': [150],
              'max_depth': [3],
              'min_samples_leaf': [220],
              'min_samples_split': [2],
              'learning_rate': [0.1],
              'criterion': ['friedman_mse'],
              'min_weight_fraction_leaf': [0],
              'subsample': [1],
              'max_leaf_nodes': [16],
              'min_impurity_decrease': [0.2],
             }
gbc_grid = GridSearchCV(gbc, param_grid = gbc_params, cv=5, verbose=1, n_jobs =-1)
gbc_grid.fit(X_train, y_train)
gbc_grid.best_score_

Fitting 5 folds for each of 1 candidates, totalling 5 fits


/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/ensemble/_gb.py:310: FutureWarning: The loss parameter name 'deviance' was deprecated in v1.1 and will be removed in version 1.3. Use the new parameter name 'log_loss' which is equivalent.
  warnings.warn(
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/ensemble/_gb.py:310: FutureWarning: The loss parameter name 'deviance' was deprecated in v1.1 and will be removed in version 1.3. Use the new parameter name 'log_loss' which is equivalent.
  warnings.warn(
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/ensemble/_gb.py:310: FutureWarning: The loss parameter name 'deviance' was deprecated in v1.1 and will be removed in version 1.3. Use the new parameter name 'log_loss' which is equivalent.
  warnings.warn(
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/ensemble/_gb.py:310: FutureWarning: The loss parameter name 'deviance' was deprecated in v1.1 and will be removed in version 1.3. Use the new parameter name 'log_loss' which is equivalent.
  warnings.warn(
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/ensemble/_gb.py:310: FutureWarning: The loss parameter name 'deviance' was deprecated in v1.1 and will be removed in version 1.3. Use the new parameter name 'log_loss' which is equivalent.
  warnings.warn(
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/ensemble/_gb.py:310: FutureWarning: The loss parameter name 'deviance' was deprecated in v1.1 and will be removed in version 1.3. Use the new parameter name 'log_loss' which is equivalent.
  warnings.warn(





0.8582131125213153

# XGBoost implementation
xgb = xgboost.XGBClassifier()
xgb_params = {
              'max_depth': [5],
              'learning_rate': [0.05],
              'n_estimators': [300],
              'gamma': [.65],
              'min_child_weight': [3],
              'max_delta_step': [2],
              'subsample': [0.8],
              'colsample_bytree': [0.7],
              'colsample_bylevel': [0.8],
              'reg_alpha': [0.1],
              'reg_lambda': [0.2],
              'scale_pos_weight' : [1],
              'base_score' : [0.5],
             }
grid = GridSearchCV(xgb, param_grid = xgb_params, cv=5, n_jobs=-1, verbose=1)
grid.fit(X_train, y_train)
grid.best_score_

Fitting 5 folds for each of 1 candidates, totalling 5 fits





0.865714761623319

# LightGBM implementation
lgbm = lgb.LGBMClassifier()
lgbm_params = {
               'num_boost_round': [70],
               'learning_rate': [.05],
               'num_leaves': [25],
               'num_threads': [4],
               'max_depth': [8],
               'min_data_in_leaf': [10],
               'feature_fraction': [1.0],
               'feature_fraction_seed': [95],
               'bagging_freq': [0],
               'bagging_seed': [95],
               'lambda_l1': [0.0],
               'lambda_l2': [0.0],
               'min_split_gain': [0],
             }
lgbm_grid = GridSearchCV(lgbm, param_grid = lgbm_params, cv=5, verbose=1)
lgbm_grid.fit(X_train, y_train)
lgbm_grid.best_score_

Fitting 5 folds for each of 1 candidates, totalling 5 fits
[LightGBM] [Warning] bagging_freq is set=0, subsample_freq=0 will be ignored. Current value: bagging_freq=0
[LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4
[LightGBM] [Warning] lambda_l1 is set=0.0, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.0
[LightGBM] [Warning] num_iterations is set=70, num_boost_round=70 will be ignored. Current value: num_iterations=70
[LightGBM] [Warning] min_data_in_leaf is set=10, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=10
[LightGBM] [Warning] feature_fraction is set=1.0, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=1.0
[LightGBM] [Warning] lambda_l2 is set=0.0, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0.0


/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/lightgbm/engine.py:177: UserWarning: Found `num_boost_round` in params. Will use it instead of argument
  _log_warning(f"Found `{alias}` in params. Will use it instead of argument")
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/lightgbm/engine.py:177: UserWarning: Found `num_boost_round` in params. Will use it instead of argument
  _log_warning(f"Found `{alias}` in params. Will use it instead of argument")


[LightGBM] [Warning] bagging_freq is set=0, subsample_freq=0 will be ignored. Current value: bagging_freq=0
[LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4
[LightGBM] [Warning] lambda_l1 is set=0.0, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.0
[LightGBM] [Warning] num_iterations is set=70, num_boost_round=70 will be ignored. Current value: num_iterations=70
[LightGBM] [Warning] min_data_in_leaf is set=10, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=10
[LightGBM] [Warning] feature_fraction is set=1.0, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=1.0
[LightGBM] [Warning] lambda_l2 is set=0.0, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0.0


/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/lightgbm/engine.py:177: UserWarning: Found `num_boost_round` in params. Will use it instead of argument
  _log_warning(f"Found `{alias}` in params. Will use it instead of argument")


[LightGBM] [Warning] bagging_freq is set=0, subsample_freq=0 will be ignored. Current value: bagging_freq=0
[LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4
[LightGBM] [Warning] lambda_l1 is set=0.0, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.0
[LightGBM] [Warning] num_iterations is set=70, num_boost_round=70 will be ignored. Current value: num_iterations=70
[LightGBM] [Warning] min_data_in_leaf is set=10, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=10
[LightGBM] [Warning] feature_fraction is set=1.0, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=1.0
[LightGBM] [Warning] lambda_l2 is set=0.0, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0.0


/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/lightgbm/engine.py:177: UserWarning: Found `num_boost_round` in params. Will use it instead of argument
  _log_warning(f"Found `{alias}` in params. Will use it instead of argument")


[LightGBM] [Warning] bagging_freq is set=0, subsample_freq=0 will be ignored. Current value: bagging_freq=0
[LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4
[LightGBM] [Warning] lambda_l1 is set=0.0, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.0
[LightGBM] [Warning] num_iterations is set=70, num_boost_round=70 will be ignored. Current value: num_iterations=70
[LightGBM] [Warning] min_data_in_leaf is set=10, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=10
[LightGBM] [Warning] feature_fraction is set=1.0, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=1.0
[LightGBM] [Warning] lambda_l2 is set=0.0, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0.0


/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/lightgbm/engine.py:177: UserWarning: Found `num_boost_round` in params. Will use it instead of argument
  _log_warning(f"Found `{alias}` in params. Will use it instead of argument")


[LightGBM] [Warning] bagging_freq is set=0, subsample_freq=0 will be ignored. Current value: bagging_freq=0
[LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4
[LightGBM] [Warning] lambda_l1 is set=0.0, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.0
[LightGBM] [Warning] num_iterations is set=70, num_boost_round=70 will be ignored. Current value: num_iterations=70
[LightGBM] [Warning] min_data_in_leaf is set=10, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=10
[LightGBM] [Warning] feature_fraction is set=1.0, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=1.0
[LightGBM] [Warning] lambda_l2 is set=0.0, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0.0


/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/lightgbm/engine.py:177: UserWarning: Found `num_boost_round` in params. Will use it instead of argument
  _log_warning(f"Found `{alias}` in params. Will use it instead of argument")


[LightGBM] [Warning] bagging_freq is set=0, subsample_freq=0 will be ignored. Current value: bagging_freq=0
[LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4
[LightGBM] [Warning] lambda_l1 is set=0.0, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.0
[LightGBM] [Warning] num_iterations is set=70, num_boost_round=70 will be ignored. Current value: num_iterations=70
[LightGBM] [Warning] min_data_in_leaf is set=10, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=10
[LightGBM] [Warning] feature_fraction is set=1.0, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=1.0
[LightGBM] [Warning] lambda_l2 is set=0.0, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0.0





0.8410651058580324

Looking at the three models, all three have about the same score varying by two accuracy points. Even when changing parameter values (in some cases to extreme outliers), the accuracy does not vary too much. Since the best scores of all our models do not outperform the average for our Logistic Regresion models, we decided to utilize Logistic Regression for our prediction model.

Finalizing our model and tuning hyperparameters

In order to estimate the final performance of our model, we ran a KFold cross-validator with 10 splits on the model with the regularization term and C value that we determined in the previous section. The evaluation metrics for each fold were then stored, and the average of each metric was taken after the KFold completed. The average values of each evaluation metric can be seen below.

# estimating final performance using kfold 

kf = KFold(n_splits=10, shuffle=True, random_state=43)
X = data[['DR', 'Ast', 'FGM', 'Loc', 'FTA', 'Blk',
           'Stl', 'OR', 'FGA', 'FGA3', 'TO', 'PF']]
y = data['Won']

accuracy = []
recall = []
precision = []
coef_vals = np.zeros(len(['DR', 'Ast', 'FGM', 'Loc', 'FTA', 'Blk',
           'Stl', 'OR', 'FGA', 'FGA3', 'TO', 'PF']))

for train_index, test_index in kf.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    clf = LogisticRegression(solver='saga', max_iter=1000, C = 5, 
                             penalty = 'elasticnet', l1_ratio = 0.3).fit(X_train, y_train)
    
    y_pred = clf.predict(X_test)
    
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
    
    accuracy.append((tp + tn) / (tp + tn + fp + fn))
    recall.append(tp / (tp + fn))
    precision.append(tp / (tp + fp))
    
    for i in range(len(np.array(clf.coef_)[0])):
        coef_vals[i] += np.array(clf.coef_)[0][i]

print("Average accuracy:")
print(np.mean(accuracy))
print("Average recall:")
print(np.mean(recall))
print("Average precision:")
print(np.mean(precision))

Average accuracy:
0.8692374338229681
Average recall:
0.867077360903928
Average precision:
0.8708284754761404

From the results of the KFold, we were able to determine optimal values for each hyperparameter coefficient in order to maximize the accuracy of the model. The value of each hyperparameter coefficient can be seen below.

# look at which coefficients are most important

coef_df = pd.DataFrame({'Coef': ['DR', 'Ast', 'FGM', 'Loc', 'FTA', 'Blk',
           'Stl', 'OR', 'FGA', 'FGA3', 'TO', 'PF'], 
           'Weight': coef_vals})

coef_df.reindex(coef_df.Weight.abs().sort_values().index).iloc[::-1]

	Coef	Weight
2	FGM	4.753021
8	FGA	-4.218257
6	Stl	3.967719
0	DR	3.642041
7	OR	3.370213
10	TO	-3.355280
3	Loc	2.278813
11	PF	-1.066218
1	Ast	0.818430
5	Blk	0.693089
4	FTA	0.509233
9	FGA3	0.479119

Discussion

Interpreting the result

Our final logistic regression model was able to achieve 86.9% accuracy, 86.7% recall, and 87.1% precision over a substantial dataset, a sign that it can effectively tackle the problem of predicting NCAA games. We were able to accomplish this by using 12 different predictor variables. The predictors most relevant to determining a win/loss were field goals made, field goals attempted, steals, and defensive rating, while the least relevant included free throws attempted and three point field goals attempted. We compared our logistic regression model to gradient-boosted decision trees, and found gradient-boosted trees to be effective but slightly less accurate at roughly 85%. Although logistic regression seemed to work well for our dataset, there are some generalization concerns due to the training data that we will explore in the next section. Additionally, even after tweaking hyperparameters and experimenting with different predictor combinations, our accuracy never went higher than 86.9% or much lower. It’s unclear why this value seemed to be our ceiling, and we would need rigorous testing to see if it’s the maximum accuracy for other models, but the high potential for variability in an NCAA game makes it unlikely that any model could do significantly better. For example, the risk of players getting injured, bad shooting nights, or just high pressure environments getting to players mentally could all swing the outcome on any given night, limiting the power of these models.

Limitations

The dataset we used was limited in its capacity to predict March Madness games in two key ways. First, the majority of the data was from regular season games, not March Madness games. Since March Madness games are higher stakes, and therefore higher pressure, it may be difficult to generalize trends discovered in the regular season to the playoffs. Second, although March Madness games are played on neutral courts (neither team’s home court), some courts are closer to one team’s school, making it more convenient for their fans to show up. This could very well have an impact on player morale, but our model fails to capture this due to the shortcomings of the dataset. If we had a dataset that included each team’s distance from their school to the arena, as well as more playoff data overall, then our model would be much more accurate and generalizable.

Ethics & Privacy

Since our model will be using data from nationally televised basketball games, it is unlikely that privacy will be a major concern due to the immense popularity of the March Madness tournament. In terms of ethical implications, there is always a possibility that our model will be biased in choosing certain types of teams over others. It is important that we are cognizant of this and the impact it could have on collegiate athletes’ careers. For example, if our model found that teams with taller players were more likely to win the championship, colleges who use the model might start giving more scholarships to taller players and less to shorter players. This would limit opportunities for short players, even if they are skilled and could contribute to a winning team. We must ensure that our model is as accurate as possible in order to minimize the chance of unfairly affecting players’ opportunities. One way to do this is to only focus on team stats, which would decrease the chance of creating a model that overfits to players’ attributes. However, if our model is used by collegiate recruiters, there is no way to fully prevent them from giving scholarships to players with skills that the model favors (for instance, if the model favors team three point shooting, then recruiters might recruit more three point specialists).

Conclusion

In selecting a predictor model aimed at producing an accurate March Madness bracket, our final logistic regression model outperformed gradient-boosted trees by a slight margin. The lack of variation in accuracy across both models regardless of selected outlier parameters however points to a noticeable over generalization in our data which contains a majority of regular season D1 NCAA games. As a result, it appears that regardless of the model selected for our chosen data set its accuracy value after parameter tuning would not vary significantly in comparison to other models as outlier parameter testing yielded very similar accuracies(~1-3% variation).

However, using logistic regression we were able to identify different weightings in terms of importance in determining winning or losing for our selected predictor variables that could prove useful to extrapolate and build upon in future research that could potentially result in an increase of accuracy, recall, and or precision.

Footnotes

1.^: Wilco, D. (4 Jan 2022) March Madness History - The ultimate guide. NCAA.com https://www.ncaa.com/news/basketball-men/article/2021-03-14/march-madness-history-ultimate-guide
2.^: Goto, K. (7 Apr 2021) Predicting March Madness Using Machine Learning. towardsdatascience.com https://towardsdatascience.com/kaggle-march-madness-silver-medal-for-two-consecutive-years-6207ff63b86c
3.^: Pierce, A., Weininger, L. (21 Apr 2019) How We Predicted March Madness Using Machine Learning. medium.com https://lotanweininger.medium.com/march-madness-machine-learning-2dbacc948874