top of page

Who Will Be the Champion? Use This Machine Learning Model to Predict the Winner of PGA Tournaments & Get Better Yourself.

In 2014 I visited the factory of all golf simulator manufacturers to decide which simulator would be a game changer in the UAE market. I settled on a manufacturer from Canada for two main reasons:


  1. They were already incorporating AI into the coaching realm using the positioning of images and image capture with high speed cameras to analyze a swing. No more "Well, we think this is what's happening in your swing" and then having a different opinion from another coach.

  2. They collected more data in every swing than any other provider and even after 10+ years of use, the other suppliers are just now reaching that level.


With these two capabilities, I knew I could build the ultimate improvement facility. One that can provide tangible results and prove it using numbers. You get better, the number improves, you don't, it goes down. Simple and understandable, proven and without bias.


But this wasn't enough!


So, I decided to learn how to write code in Python, Enrolled in a Year Long Data Science Program at the University of Texas. Now I can machine learning models, which use statistical algorithms to make predictions either of numbers (If this variable changes, what happens to score) or to make classifications (Given this historical data set, a person with these characteristics will win or will not).


Perfect for Golf for two reasons:


  1. I can use the swing characteristics of all golfers to measure and provide an equation which says: If you change your path from out to in by 3 degrees to in to out by 3 degrees, your scores will drop by x. I can predict how much better you will get by using data collection datasets from all my customers.

  2. I can study and classify professional golfers and tell which characteristics are statistically significant in predicting a winner. Meaning, I know how to coach a person to have the best chance to win a tournament.


This article is about that journey, I'll try to keep the technical jargon to a minimum.


Problem Statement: Predict whether this golfer will win a tournament based upon a PGA data set of over 36864 row entries across the last 30 years, and 37 column variables.


The variables include many categorical features and many key numerical features which are taken from the PGA Tour website and fed into a Python Machine Learning Environment.


Objective: To Run a number of Machine Learning Algorithms on the data set, tune the best model to predict the winner on the training data, and then expose that model to unseen test data. We will then measure the Accuracy, Recall, Performance and F1 Score of the model as a performance evaluation matrix.


I will tune the model to provide the highest recall through its learning and output a series of features which will instruct me how to coach winners.


To get an idea of what recall means in this context, we are looking for the model to provide us with the highest number of True positives (model predicts that this player will win this tournament) divided by the True Positive + False Negative (Model Predicts that this Player will not win the tournament, but the player actually wins the tournament - or the model incorrectly predicts a loss). Combined with a high accuracy level, we can surmise that our model will do a good job in predicting a winner, and that model will outline the true variables needed to excel at the sport. (Where we should focus).


Data Pre-processing: A key component of Machine Learning is to prepare the model to digest the numbers and understand key relationships between the columns (variables) and the outcome (win/loss). In order to do so, a Data Scientist must clean, massage and work with the data to remove biases and noise from the data that will incorrectly instruct the model to make a prediction. From a numeric perspective, we are leading our model statistically and mathematically astray.


We've removed the number of players who were 'Cut', 'Withdrawn' or 'Disqualified' from the dataset as these clearly are not contributors to predicting a winner and we removed categorical variables which will have no impact or ability to predict. Removes include players name, player id number, course played, and all other non numerical non-predictive variables such as 'purse value.'


We then started looking at some correlations to simplify our model and to start to understand if the data will be useful for prediction of a winner or not. The below chart is a correlation matrix which shows the numerical columns & their correlation to producing a winner.




We then had a look at the distribution of strokes gained for each predictor variable:





Strokes Gained off the Tee (The Big Dog): We see a pretty normal distribution but we notice that Strokes Lost against the field is slightly longer than other variables. Meaning, people lose more strokes here to the field than others.





Strokes Gained Approach: We can see a pretty normal distribution here, however we could say that from a visual perspective the standard deviation is higher in this distribution than the Off the Tee, which indicates that the effect of the prediction using this variable would be higher than Strokes Gained off the Tee. That's an important clue.






Strokes Gained around the Green: (chipping & bunkers) has a similar spread to the off the tee and about half the spread as the approach.




Strokes Gained Putting: To no surprise, we can see the largest spread of strokes in putting. Meaning, the largest spreads up and down are made on the greens, and we should expect this variable to be a key predictor in determining a winner. No shocks, but really by how much is this variable important over the others. We're about to find out now.


Oversampling/Under-sampling of Data between Target Variable (Won or Not):


When dealing with large data sets in classification, we would expect that most of the historical data contains non winners, with only a few being listed as winning. This presents an imbalanced data problem, that we need to handle. In this case we run our models on oversampled data, which means we randomly create additional rows classified as winners, so that the winners and losers in the entire dataset are split 50/50. This helps our model remove bias towards the Non-Winners.


We also ran our models on under sampled data, whereby we simply remove the number of non winners to match the number of winners in the dataset. We then compared the best models and tuned them using a gradient descent algorithm to find the best results of our model prediction on the test data.





Time to Run our Algorithms and then validate our training data on subsamples of data a number of times to average an estimated performance of the finalized test data. The below chart outlines the results of the models run. For brevity, we will only show the oversampled data results below.




We can see a big performance variation between all the different types of models, but we will select the random forest model as our best model for prediction. Below are the recall results from the model ensemble:


Our model is able to predict the winner with a recall of 92% on the training data and we expect the model to perform on unseen data (meaning if we entered this data into our model after the conclusion of each round of golf) and ran a prediction, we would expect the random forest to be correct about 86% of the time. We are capturing some noise still in the data which is why our training dataset is still a bit higher (92%) than our validated data set (86%).


We'll Tune that Random Forest Model and we'll come back and see how we do at the end to get our important variables from the model:



After running a number of options, we have improved our model recall on the training data slightly. Let's see how we do on our cross validated data sub sets:




Again, a nice little bump to our performance. We're now predicting with a high accuracy and recall, which means we should expect this model in the real world to do its job nicely and predict winners before they win with an accuracy of 90% and a specificity of 88%.


Time to check it on the real test data. Let's apply our model by feeding it the unseen test data. We can see below our performance on the testing data is excellent. We are still predicting the winners at around 90% but our recall or true positive rate is at 96%, which is a very good result.


We can trust our model to output proper features that dictate a winner's performance and work our students towards matching or bettering these strokes gained numbers either in the Grint app or through tournament results calculated manually. This gives us some great direction and confirms a long held truth I see from watching golf, which is that the best iron players are usually the winners!




Conclusion:

Well, we're not going to contest that the best putters usually win, but really when it all comes down to a correlation with the winning position, putting doesn't tell the entire story:


  • We can see that strokes gained on the approach has a direct impact on the result and naturally would have a direct result on the strokes gained putting.

  • The feature importance chart above shows the way that the random forest decision tree is making a determination. The random forest model uses an aggregated average impurity rating to determine how to split the decision tree and make a prediction. As a simple explanation of impurity, the decision tree will look to to measure how much each variable (or feature) helps in organizing our data. The word “impurity” means how mixed up or messy things are. If a variable (or feature) helps separate the data into cleaner and more organized groups, it gets a higher “Mean Decrease Impurity” score.

  • Knowing how our decision tree works, will help us understand how our model predicted the winner of the tournament. From the above we can see that the strokes gained on the approach shot has a higher average (mean) impurity which means that the variable has a greater influence on reducing the impurity or the messiness of the data. The black line is the standard deviation of the variable. Here we can see that the Standard Deviation between the two most important features is nearly the same.

  • It's pretty simple logic, you hit it closer to the pin, you will gain more strokes on putts. You hit it further away on the green for your birdie chance, you can bet your strokes gained putting numbers will suffer. When we think about our own games and we apply the make statistics to certain distances, it is pretty clear to understand that we would make more putts from 3 feet than we would from 12 feet.

  • Your approach shots into the green are now the king. You still need to 'putt for dough' but it all depends on what distance we leave ourselves for that illusive birdie putt.

  • So, in short, if you want to get better really fast, and shave shots from the score, you'd actually be much better off working hard on your approach shots and your pitch shots.

  • Start by working on hitting it close or actually hitting it on the green as oppose to trying to work on getting your short game in order and neglecting how you got near the green.

37 views0 comments
bottom of page