Pitcher Prognosis: Using Machine Learning to Predict Baseball Injuries
Carl Wivagg, an Insight Health Data Fellow (Winter 2017), obtained his Ph.D. in experimental pathology from Harvard University. He studied antibiotic resistance in bacteria using genomics and machine learning tools. He is now a data scientist at Amazon Alexa.
In the multibillion dollar world of sports entertainment, we often think of injuries as being chance events. I set out to see whether the statistical richness of baseball could be mined to identify players at risk of injury. Some baseball pitchers are paid on the order of a million dollars per game, so the consequences of an injury and a subsequent trip to the Disabled List are immense. Although professional players are placed under a high level of medical scrutiny, I reasoned that the information encoded in performance statistics might add a useful leading indicator of injury risk to the medical toolbox.
Baseball is the data scientist’s dream sport, because nearly every aspect of the game is discrete and quantifiable. Even time, which in other sports goes according to the clock, in baseball is defined by innings, outs, and pitches. Even with all this quantification, it was first necessary for me to properly formulate the question. I chose a classical binary classification format: for each player in each game, I would label that game according to whether it preceded an injury for that player (1) or not (0). Then, I would aggregate the player’s statistics from preceding games and use those as features.The idea is thus that a coach, medical support staff member, or even a player him- or herself, could then enter their accumulated statistics on a given day (the “intervention point”) into my model and see what the likelihood would be that playing on that day could precede an injury.
Baseball fans will note from the statistics I have chosen in the example that I am focusing on starting pitchers. Other players have different statistics and would constitute an entirely separate machine learning problem. Pitchers are the most impactful choice for a first analysis anyway, both because they are often the most valuable players on a team and because the demanding nature of their task makes them highly susceptible to injury.
Having formulated a suitable question, the next step is data. Major League Baseball statistics are readily available, but records of injury events are harder to come by. Ultimately, I chose a list containing several thousand injury events from mlb.com’s transaction history. Each disabling injury results in a player being moved to the Disabled List, which is a transaction. Unfortunately, players being traded or moving up from the minor leagues are also transactions, so I used regex processing to generate a mostly clean list of about a thousand pitcher movements to the Disabled List. Spot checking revealed no irregularities; every event that passed through the regex filters was indeed injury-related.
Exploratory Data Analysis
It is usually a good idea to explore the data a bit. In my case, the well-structured nature of baseball and prior familiarity with the dataset had assured me that my data were relatively clean, so the most urgent question confronting me was whether game statistics in fact contained any predictive information at all in relation to injuries. I started with one of the simplest statistics of all: a player’s age at the time of the game preceding his injury (or non-injury). It seemed intuitive to me that older players would be more susceptible to injury; although in many careers, the early forties are a highly productive time, the extreme physical demands of baseball mean that few players can continue to perform at the professional level that long. Since injury is a failure mode associated with physical stresses, older players should have more injuries. Indeed, that is exactly what I saw (as figure below): relative to the “not injured” events, “injured” events are right-skewed. The effect size is modest, but because of the large number of events, statistically significant at p < 0.0001.
Many other statistics have similar correlations with injury. One of the most predictive and interesting is “Innings Pitched”, which is related to how long a pitcher spends in a game and how many pitches they throw.
Surprisingly, a high number of innings pitched is a predictor that a player is relatively safe from injury; I had expected that, to the contrary, a high number of innings pitched would constitute overwork and lead to eventual breakdown and injury. One possible explanation for this counterintuitive finding is that nagging undetected conditions that might eventually result in injuries impede performance in the games preceding an injury; poor performance in turn leads to the coach benching the player. It is not the case that injury causes a lower number of innings pitched in that game, because the aggregation window for the features is separated by a full played game from transfer to the Disabled List (see initial image).
To hone the predictive power of my features, first I generated new features by applying different aggregation windows: for each player, I created separate features for each performance metric for one game preceding the intervention point, for the average of seven games preceding the intervention point, and for the player’s entire career. I also created separate features for the percent deviation of each single game value and seven-game average value from the career total.
Second, it was necessary to decorrelate the features. It should come as no surprise that many of the statistics a player generates during a game are highly mutually correlated. For instance, a player with a high number of innings pitched will also have a high number of pitches, a high number of outs, and higher numbers of the various particular types of outs. To avoid the cardinal machine learning sin of fitting a multicollinear set of features, I normalized each feature to an appropriate reference feature. For instance, I divided the number of groundball outs by the number of outs, and the number of hits by the number of batters faced. This method not only reduced multicollinearity; it also gave me more meaningful features. “Fraction of groundball outs” contains more information about a pitcher’s style than the total number of groundball outs, which could be high either because the pitcher was in the game for a long time or because they frequently throw groundball outs.
Additionally, I had one more aspect of pitchers’ performance that I wanted to account for: pitching style. The pages of baseball literature and commentary are filled with accounts of power pitchers, knuckleballers, sinkerballers, and more. For a relatively casual baseball fan like myself, it is difficult to draw consistent, distinct categories of pitching style from expert commentary or from the statistical data that I had already collected. And so I turned to natural language processing.
I located a reasonably complete database of the pitching styles of current pitchers and used standardized techniques to treat the descriptions as bags of words, lemmatize, and vectorize them. Having no strong preconceptions about how many pitching styles there might be, and given the limited time available, I turned to an extremely simple technique: K-means analysis.
I projected the term frequency vectors I had created, which had a dimensionality on the order of the total number of terms present, onto a two-dimensional space using multidimensional scaling, which is meant to preserve the approximate relation of each of the pitcher descriptions to all of the others. Initially, I separated the descriptions into two means to see if there was any obvious topical difference between the terms associated with one of the means compared to the other. I did not see any, so I added in a third mean.
Now, the means started to make sense. First, the observant reader will note that the third mean, colored purple, drew descriptions almost entirely from the population of descriptions that had previously been in the second mean. This indicates that the first two means were well separated in multidimensional space: if they had been close together or overlapping, we might have expected a third mean to draw points more equally from both initial means. Thus, we can be confident that we are separating the populations in a meaningful way.
Second, there is now a somewhat intuitive meaning to the terms associated with the three populations. The first has “flyballs”, the second has “groundballs”, and the third has “whiffs/swing”, which is a baseball term for a swing-and-a-miss, which usually leads to a strikeout. Thus, we have means associated with the three different types of out in baseball. These term associations were robust through several of the top terms associated with each mean: particularly for the groundball mean, the top three terms all contained the word “groundballs”. In the way that I set up the term frequency vectors, a single word can occur more than once because I accounted for the frequency of bigrams, or pairs of words occurring together, and trigrams as well as single words.
Given the tight timeline, I chose to use random forest for a quick preliminary method to build up a base model. It offers several advantages for the problem I try to solve: 1) it doesn’t require labor-intensive feature scalings; 2) it is robust to find outliers; 3) It is sensitive to interactions between variables.
I optimized the random forest hyperparameters to maximize the area under an ROC curve, which has two characteristics that make it better than accuracy score for this sort of situation: 1) the value of this metric is still meaningful with greatly imbalanced datasets － and there are many more games preceding noninjuries in baseball than games preceding injuries － and 2) how a risk-predicting application may be used is not necessarily known before deployment: avoiding false positives may matter more than avoiding false negatives, or vice versa. The area under an ROC curve metric does not require me to know in advance where I will set the threshold for identifying players at risk of injury.
I began by optimizing the random forest’s hyperparameters. The hyperparameters I focused on were the number of features each decision tree could choose from at each step in its creation and the maximum depth of those trees, or the total number of features that could be used in the classification of a single point. I used a grid search to explore all possible combinations of low integer values for these two hyperparameters, settling on an optimum value of three to four features for each. I also optimized the number of decision trees in my random forest; although I saw little increase in performance beyond 300 trees, I settled on 1,000 because compute time was not limiting and having redundancy within the forest would not be expected to harm model performance. This range agreed with various sources of expert advice.
The performance metric I chose to maximize with my grid search was area under the ROC curve, which has two characteristics that make it better than the standard accuracy score for this sort of situation: 1) the value of this metric is still meaningful with greatly imbalanced datasets － and there are many more games preceding noninjuries in baseball than games preceding injuries － and 2) how a risk-predicting application may be used is not necessarily known before deployment: avoiding false positives may matter more than avoiding false negatives, or vice versa. The area under an ROC curve metric does not require me to know in advance where I will set the threshold for identifying players at risk of injury. My ultimate area under the ROC curve for a withheld test set was 0.72, substantially below a perfect 1.00, but also substantially above the 0.50 that would be expected from random guessing. Considering that I did not know at the outset whether athletes’ statistical performance metrics would contain any at all information predicting injury, I was very happy with this result!
A final, and critical, step in any machine learning project is to prepare the model and findings for presentation or deployment in a way that is useful and meaningful to the intended audience. In this case, I wished the project to be persuasive and usable by all athletes, both those with PhDs in mathematics and those struggling to complete high school. Having provided a nice statistical metric conveying my model’s performance, I thought it would be useful to audiences of all backgrounds to have a graphical representation of the model’s performance. I chose to investigate whether the model generated an uptick in injury risk scores for individual players in the games leading up to a stint on the Disabled List. I picked four random players and calculated their injury scores for each game in the season they got injured.
All four indeed displayed high injury scores leading up to the injury; above is presented the one with the sharpest uptick. Two other players had similar but mildly noisier trends, while the fourth player had a consistently high injury score.
A Web Application for Public Interaction with the Model
More than arguing for the model’s validity, I wished to offer baseball enthusiasts and professionals of all stripes an easy way to use and understand the model. I deployed my model on an AWS instance using Flask and Green Unicorn. I thus deployed my model on an AWS instance using Flask and Green Unicorn, making it available to the public at http://www.baseballinjurypredict.tech/ .
In this screen capture, after a player or coach has entered their values for each of the feature that the model uses, the model outputs an assessment of the player’s risk of injury. The “injury score” output by the random forest model is notionally a probability of a particular set of feature values of indicating that an injury will occur, or more precisely the average of this probability across all of the decision trees in the forest, although depending on how one deals with the class imbalance in injury prediction problem, this interpretation is not necessarily correct.
Extract Insightful Information from the Model
The final stage of a machine learning problem is to produce a clear, useful, and interpretable result. To avoid forcing baseball players and coaches to deal with the intricacies of random forest output, the web application I designed compares the injury score for a given player’s input to all of the scores in the database used for the modeling and outputs the player’s injury score percentile, which should be readily understandable to many people.
Some users may distrust what seems like a data science black box, and to provide more persuasive analysis or explanation, I also use nearest neighbors analysis to identify games similar to the user’s entered values. The application presents an equal number of games that resulted in injuries and did not, with the idea that the user can evaluate by eye how similar his own feature values are to those in each class; moreover, the nearest neighbors analysis offers some insight into which of the features may be driving the random forest’s output in this particular case.
My modeling of pitcher injury risk raises several interesting questions. In particular, the anticorrelation between innings pitched and injury risk, which persists over all aggregation windows, is an intriguing finding that the expertise of professional athletes or coaches might be able to clarify. More importantly, it offers a proof-of-concept for the possibility of rationally weighing a player’s age and his recent and long-term performance characteristics to assess injury risk. Such modeling can help not just professional players, but also youth and amateur athletes across the globe without access to the same level of medical scrutiny as the professionals.