Projects
Academia
The Correlation Between Alcohol Consumption and University Reputation
COGS 108 shifted from the previous, broader Python-based courses into a project-based class dedicated to analyses in data science. Focusing on a linear regression analysis through the use of the statsmodel api (OLS) and other relevant Python packages learned in the COGS 108 course, we aimed to identify whether a correlation existed between the rate of alcohol consumption and a University’s reputation score. Although more data and the specificity of the data could be improved, the overall process of this project opened the door to the power and process of regression analysis.
To access the notebook click here
The Makings of a Great Scooby-Doo Show
The COGS 137 course was an introduction to the capabilities of R and the utility it carries in the world of data science and statistics. In our final project my group decided to take a fun approach on analyzing what specific factors help predict a higher rating of a Scooby-Doo episode/movie. To do this, we utilized a forward selection process on a linear regression model to observe the highest adjusted R2 value given a certain combination of factors. We ended up doing the forward selection by hand, but in the future I look to define a function to carry out this forward selection process so that more factors can be included in the analysis.
To access the Rmarkdown document click here
Chasing the Perfect Bracket
In the COGS 118A course, algorithm derivations and Python code/package implementation were practiced and perfected. For my group’s final project, we explored creating a prediction algorithm that would use various stats to predict outcomes of March Madness (collegiate basketball playoff) games. To begin this daunting task, we filtered the data to contain specific statistics that we believed would be most important in determining match outcomes, and then utilized both a logistic regression model and various gradient boosted tree methods. After obtaining the scores of each type of model we compared the two to see which performs better. In the future it would be beneficial to look into other machine learning models that could be more fine-tuned to a prediction model like this, and also perform a more comprehensive wrangling of the data.
To access the notebook click here
Music with Machine Learning
COGS 118B was focused on the mathematical derivation and Python implementation of unsupervised machine learning models. Our group aimed to categorize different audio files into their respective genres based on either their audio (wav file) or spectrogram properties. To implement this approach a hand-written Principal Component Analysis was used to reduce the dimensionality of the data / spectrogram images, and K-Means clustering was followed to attempt to cluster each song into a genre. Looking at the results and our write-up, it’s obvious that these methods did not produce accurate results, but the complexities and methods of unsupervised models became much clearer. Hopefully in the future this project can be improved even more.
To access the notebook click here
To access the writeup click here