Injury Analysis Based on Machine Learning in NBA Data

It is a commonplace that the injury plays a vital influence in an NBA match and it may reverse the result of two teams with wide strength disparity. In this article, in order to decrease the uncertainty of the risk in the coming match, we propose a pipeline from gathering data at the player’s level including the fundamental statistics and the performance in the match before and data at the team’s level including the basic information and the opponent team’s status in the match we predict on. Confined to the limited and extremely unbalanced data, our result showed a limited power on injury prediction but it made a not bad result on the injury of the star player in a team. We also analyze the contribution of the factors to our prediction. It demonstrated that player’s own performance matters most in their injury. The Principal Com-ponent Analysis is also applied to help reduce the dimension of our data and to show the correlation of different features.


Introduction
With the closing of the 2019th NBA final, the Golden State of Warriors was defeated by Toronto raptors by 2:4 which signed a breakdown of a dynasty. During these battles, the injuries coming one after another tear the Warrior apart, leading them not to be able to face the impacts from the Raptors. When it talks to In addition, it also may do harm to the assets of the team owners. Therefore, it is in great demand for a classifier to monitor and analyze the players' injury risk in real time in the coming match if they are on the court. With that system, the coach can have a flexible arrangement on the rotation of the athletes.
The former study mainly focused on analyzing the factor related to the injury of players. In Petty DH's work, they interviewed 481 youth pitchers in baseball and found that participants who pitched more than 100 innings in a year were 3.5 times more likely to be injured. In Croisier J L's work, he found that the rate of muscle injury is related to the injury of the athletes. These excellent works could give us some illumination; however, most of them are confined to a specific aspect, which can't give us a complete impression of what leads to injury.
In our work, we build up a pipeline to gather data related to diverse aspects and build up a prediction model and analyze the contribution of different features to filter the most significant ones. We can divide our data into four parts, including the fundamental data of the player, the fundamental data of the team, the relative information of the opponent in the next match, and the player's performance in the matches one week before the match we pay attention to.
Then we use the Random Forest, a machine learning method to detect the importance of these factors. And it turns out that the average points in the matches before the next match, the average of minutes and the total number of games the players participated in, age and weight are significant features. Surprisingly, the status of the opponent they will meet counts less, and neither did the number of matches they have attended in the week before. We then use Principal Component Analysis (PCA) method to decrease the dimensions of our model and it displays that many variables are highly relevant. Finally, we do some prediction trials; although it works not very well, it still shows its efficiency in some sense. We hope our model can act as a reference for the coach and manager of the Association to keep their players from injury.

Background
The relationship between injuries and training status has been widely studied in the sports field. Some researchers interviewed 481 youth pitchers in baseball (aged 9 to 14 years) annually in a 10-year follow-up study. Fisher exact tests were used to investigate risks of injury for pitching more than 100 innings in at least 1 calendar year, starting curveballs before age 13 years, and playing catcher for at least 3 years. And they argue that participants who pitched more than 100 innings in a year were 3.5 times more likely to be injured [3].
In soccer, some researchers find that the rate of muscle injury was significantly increased in subjects with untreated strength imbalances in comparison with players showing no imbalance in preseason by using a standardized concentric and eccentric isokinetic assessment to identify soccer players with strength imbalances [4]. And in NBA, it is said that no correlations were found between injury rate and player demographics, including age, height, weight, and NBA ex- perience through some descriptive epidemiological study [5].
Apart from those based on traditional statistical methods of prospective study, Alessio Rossi uses some machine learning method to give a effective injury forecasting in soccer with the GPS data. Their classifier can detect 80% of the injuries with about 50% precision, and they give a good trade-off between accuracy and interpretability [6]. Since this paper shows that some overall information in the field is very essential such as distance in meters covered during the training session, it gives us some idea about applying machine learning method but with more instantaneous information such as the latest information in the field.

Method
Random Forest classifier is a kind of ensemble machine learning method (other two famous algorithms are boosting and bagging [7]) widely applied in classification work. The random forest classifier consists of a combination of tree classifiers (another machine learning method called C4.5, where every variable can be regarded as a leave in a tree [8]) where each classifier is generated using a random vector sampled independently from the input vector, and each tree casts a unit vote for the most popular class to classify an input vector. The random forest classifier used for this study consists of using randomly selected features the classes. For a given training set T, selecting one case (pixel) at random and saying that it belongs to some class C i , the Gini index can be written as: Each time a tree is grown to the maximum depth on new training data using a combination of features. These fully grown trees are not pruned. As the number of trees increases, the generalization error always converges even without pruning the tree and overfitting is not a problem because of the Strong Law of Large Numbers The number of features used at each node to generate a tree and the number of trees to be grown are two user defined parameters required to generate a random forest classifier. At each node, only selected features are searched for the best split. Thus, the random forest classifier consists of N trees, where N is the number of trees to be grown, which can be any value defined by the user. Journal of Data Analysis and Information Processing To classify a new dataset, each case of the datasets is passed down to each of the N trees. The forest chooses a class having the most out of N votes, for that case.
We choose random forest as our main method for this analysis not only because it is a relatively stable machine learning method and has kind of resistance to unbalanced data but also the model from it would not lack in interpretation for this algorithm comes from CART, another method known for its Interpretability. We can use this interpretability to do some factor analysis for our features.

Original Data
To gather the data needed for this analysis, we collect injury data from pro Some basic introduction of the data is as follows: 1) We only adopt the data of Season 2016 for analysis.
2) The threshold for players we pay attention to is that they must have an av- We can tell that it is completely unbalanced data. The injury cases are only 27 but we have 13,975 cases in total. We could not rely on it to give us an accurate prediction even we neglect the latent variables out of court playing a significant role in players' injury. However, we can still do some factor analysis through it.
A sample data is shown below: the meaning of the feature's name will be introduced in the following section.  2) The summary of the team (self and the opponent).

Data Processing and Data Introduction
The raw data is not suitable for putting into the model. We should transform them into some characters useful for our research beforehand. The preprocessing methods in this section mainly referred to sliding window with moving weighted average and missing value imputation.

Players' Performance in Matches
Since we pay attention to factors that happen just before the match, so the matches the player played before the match focused on is in our consideration.
To be convenient, we could assume that all the matches in one week before are important. And we adopt a method in Time series called moving weighted average [11]. In this method, we assume that the performances in each match obey an exponential distribution-the closer to the focused match, the larger its weight is. And we could use this adjustment sliding through all the matches that players played in this season to produce our processed data as shown in Figure   2. Besides that, we abstract some other information useful for our research from the rolling data flow, for example, we can obtain whether there are some back to back matches during seven days' matches, and how many days players are out of court, and how many home and away home matches the players attend respectively. The detailed introduction is shown in Table 1.   There are also some advanced indexes which combined some fundamental data into an index giving a summary of some specific behavior. For example, USG% is an index which measuring the percentage of team plays used by a player while he was on the floor. The detailed introduction of these features is demonstrated in Table 2.

The Self and Opponent Team's State
Except for player's state, we should also consider in the team-level, which include not only the team the player stays in but also the team he will face in the next match, could also contribute to the player's injury in some cases. For example, if the style of player is in fast pace, then some players may get hurt during rapid shifts. And if the opponent is skilled in defensive strength, then the player may get hurt easily too. Therefore, for self and opponent team, we should consider different index. And many of them are assessed by ESPN [12]. And all the variables are displayed in Appendix (Table A1).

The Variable Importance
We used the permutation test in random forest to evaluate the importance of all the features in our model. And the result is shown in Figure 3.  It is not surprising to see that the player's performance in recent days is very important in their risk of injury. In fact, the latest indexes are more important than the other three parts of general descriptions. We also summarize the average importance of every part in Figure 4. We can tell that the overall information of one player is influential too, but which team they will face in next match is least significant.
From this chart, we can tell that: 1) When a player tries more shoots especially in 3-point shoot, they can get hurt in a larger possibility.
2) The more time the players play in the field, the more chance they may get hurt.
3) Interesting enough, the information of player themselves, DWS, an advanced index indicating the number of wins contributed by a player due to his defense, leads to the chance of injury. We can speculate that a fierce defense may cause damage to the defense player himself.

4)
Weight and height are not as important as I thought.
5) The frequency of routine is not important equally. Even back to back games may not lead to high risk of injury.

Factor Analysis
We can tell that we have too many variables in our model and in fact, many of them are in high correlation. It is beyond doubt that when a player plays more time, they have more chance to give goals attempts and get more points. So, when we use PCA or some other techniques to shrink the scale of our variables, we can make our model displaying more briefly.
We can tell that these variables can be divided into 10 more significant dimensions as shown in Figure 5. And some variables that contribute to the first dimension are just the ones important in influencing the injury of players, which is shown in Figure 6. Points_Avg, Mp_AVG, FGA_AVG … they are all some import variables in our Random Forest model and they are just in high correlation as we predict. However, opp_Rk and opp_SRS are two variables essential in third dimension, but they contribute least in the random forest model.

Prediction
We collected the relative NBA data in 2017 as our testing set which contained 10,199 pieces of items but only 15 of them were injured on the court, which means it showed a more serious unbalance than the training set. It is not beyond our expectations that our model did not perform well on the test dataset, since the data is so unbalanced after some data clean. However, even in this situation, we still have some correct predictions on injury events and most of them are the star players in the team which should deserve more attention than others, as you can see in Appendix (Table A2), which means this method of analyzing injury occurred in the field does work.
W. W. Wu Journal of Data Analysis and Information Processing

Conclusions & Discussion
In our work, we use some machine learning methods, or to be more specific, the Random Forest method to build up a model to analyze the latent factors that have some correlation with the injury of players in NBA. We find it is in high correlation with the players' performance on the most recent matches the player has played. Playing more actively and taking more trials on shooting, they are more easily involved in risks of injury in the next match. It is something that coaches can pay attention to. Besides that, we also exclude some factors which seem to be relative to injuries. The weight and height is part of them. And more counterintuitive, it seems that the intensive routine won't result in injury too.
Even when player gets some back to back games, their chance for injury remains constant.
We also do some job to decrease the scale of variables in our model. We find that some of the most important variables in our model are highly relevant to each other, which means we don't need to consider all of them simultaneously but combine them into some new factors, which will make the model look concise and still won't lack interpretation.
As we have said, the dataset is so unbalanced and the injury events are so rare compared to the not injured data items. So, others can try to use more data of past seasons to train a more sensible and accurate model to predict the injury event in the future. And from another perspective, maybe there exist other factors we neglect which counts more in the injury of players. Since getting hurt is such a rare event, we can hardly get enough factors as we hope to get an ideal result; however, the more we know, the better we can do in prediction. And such a method can also be used in other fields, especially in some one-to-one sports, such as tennis and badminton. In this field, the variables we need to consider are less so we can speculate that this method can be applied with a more ideal result.

Conflicts of Interest
The author declares no conflicts of interest regarding the publication of this paper.