An Empirical Study of Downstream Analysis Effects of Model Pre-Processing Choices

This study uses an empirical analysis to quantify the downstream analysis effects of data pre-processing 
choices. Bootstrap data simulation is used to measure the bias-variance 
decomposition of an empirical risk function, mean square error (MSE). Results 
of the risk function decomposition are used to measure the effects of model 
development choices on model bias, 
variance, and irreducible error. Measurements of bias and variance are then 
applied as diagnostic procedures for model pre-processing and development. Best 
performing model-normalization-data structure combinations were found to 
illustrate the downstream analysis effects of these model development choices. In additions, 
results found from simulations were verified and expanded to include additional 
data characteristics (imbalanced, sparse) by testing on benchmark datasets 
available from the UCI Machine Learning Library. Normalization results on 
benchmark data were consistent with those found using simulations, while also 
illustrating that more complex and/or non-linear models provide better 
performance on datasets with additional complexities. Finally, applying the 
findings from simulation experiments to previously tested applications led to 
equivalent or improved results with less model development overhead and processing 
time.


Introduction
Introduction Popularized in the work by David Holpert and William Macready, the No Free Lunch (NFL) Theorem states that no single machine learning algorithm is better than all the others on all problems [1]. Other researchers have tried multiple models to find one that works best for a problem. In fact, studies by Carp [2] [3] illustrate effects on research findings in functional MRI (fMRI) studies due to variations in analytic strategy, with increased model flexibility leading to higher rates of false positive results. Wagenmakers, et al. [4] point out that many studies in psychology do not commit to an analysis method before seeing the data, with some researchers fine-tuning their analysis to the data, proposing that researchers "preregister their studies and indicate in advance the analyses they intend to conduct" in order to be considered as "confirmatory" research, rather than as "exploratory". A study published in May 2020 expands on Carp's findings, noting that fMRI analyses conducted on the same data by seventy different laboratories produced a wide range of results [5]. This particular study highlighted the fact that fMRI analysis requires several stages of pre-processing and analysis to determine which areas of the brain show activity. They found that the choice of pre-processing pipeline led to widely varied results. Among the seventy study teams, no two teams selected the same pipeline. Figure 1 illustrates the potential implications of varying pipeline choices in neuroimaging.
Perhaps the most illustrative lack of research consistency is the study by Silberzahn, et al. [6] which recruited 29 independent research teams with 61 analysts to address the question, "Are soccer referees more likely to give red cards to dark-skin-toned players than to light-skin-toned players?" The research teams represented 13 countries, a variety of disciplines, and a range of expertise and academic degrees. Using the same dataset and research question, the 29 teams utilized 29 unique analytical modeling approaches resulting in 21 unique combinations of covariates, 20 teams with significant positive results, and odds ratios ranging from 0.89 to 2.93, as in Table 1. To say the least, analytic choices, even if justifiable and statistically valid, have a downstream effect on model results. The statistical model development framework can be generally divided into Figure 1. Researchers process neuroimaging data using a wide variety of pipelines, which can produce varying results. Making different choices for each step leads to a different end point-the red dots represent how activation moves throughout the brain depending on which pipeline is used [5]. three phases: data discovery, variable preparation, and modeling. Within each of these phases there are steps in the model development that encompass a wide range of data management, data mining, and data analysis techniques, including data ingestion, sample selection, data cleaning and imputation, feature reduction, feature engineering, normalization, model development, and model validation.
Analyzing the downstream effects of modeling approaches within each of these steps will allow for statistically motivated modeling choices in the future. Quantifying the analysis effects of these strategies provides a diagnostic illustration of where researchers can expect to find improvements in their model results.
This study quantifies the downstream analysis effects of data pre-processing choices by utilizing a decomposition of the loss functions, measuring effects on model bias, variance, and irreducible error/random noise. In this way, measurements of bias and variance can be efficiently applied as diagnostic procedures for model pre-processing and development. Applying bias-variance decomposition to a variety of data distributions and model types can lead towards an improved understanding of quantitative variations within model development methods as well as comparing results consistently between methods. Understanding of statistical bias and variance can be used to diagnose problems with machine learning bias and develop methods for reducing bias and variance in algorithms. For example, this bias-variance trade-off does not always behave as expected under distributional assumptions. Even with the availability of more advanced models, such as neural networks, simple models still often perform well, or even better than more complex models, in experiments [7]. Generally, while more complex models result in decreased bias, they tend to increase variance and, therefore, do not generalize well to new data [8]. However, it has been found that ensemble models, although complex, often outperform single models and this seems contradictory to the trade-off between simplicity and accuracy. In this case, decomposition of bias-variance for ensembles led to the understanding that while increased complexity for a single model often increases variance, averaging multiple models will often (but not always) lead to decreased variance [9]. The goal, therefore, of understanding the effects on bias-variance decomposition is to quantify the downstream analysis effects of a selection of model development choices. Using these results as diagnostic procedures can lead to improved model development performance and consistent, reproducible results across data types and domains.

Quantifying Bias-Variance Trade-Off
We measure the effects of various normalization methods on the bias-variance decomposition of the risk function by directly simulating the definition of the decomposition under varying conditions. We use information found from these simulations to quantify the downstream analysis implications for the predictive models of interest. Since "an important goal in algorithm design is to minimize statistical bias and variance and thereby minimize error [10]," we use our findings to propose pre-processing and algorithm design choices that best minimize common design effects on bias and variance. For example, "any change that increases the representational power of an algorithm can reduce its statistical bias. Any change that expands the set of available alternatives for an algorithm or makes them depend on a smaller fraction of the training data can increase the variance of the algorithm [10]." The result of such a study is to formulate a theory of bias and variance reduction and predict when either or both will succeed in practice.

Data Normalization
In this context we consider normalization to include data scaling techniques such as normalization and standardization. Data can be scaled so that features measured on different scales can be compared. Probability distributions of features can also be adjusted to be in alignment with each other. Another method is to shift and scale the features (standardization), which removes the units of measure. The primary goal of normalization is to scale each data point in a way that gives equal weight to the features to be used in developing a model. We consider four within-feature normalization methods including standardization, min-max, max absolute value (maxAbs), and quantile transformation, and one between-feature normalization, quantile normalization.

Z-Score, "Standardization"
Standardization is a method that shifts and scales the data to be centered around 0 with a standard deviation of 1: Characteristics of this method include:  Assumes data is normally distributed within each feature.  Centers distribution around 0, with standard deviation 1.  If data has outliers, scales most of the data to a small interval.  Does not produce normalized features with the exact same scale.
Even if the data has outliers, Z-score normalization will scale most of the non-outlier data to be in a similar range between all features, assuming the data is normally distributed, as in Figure 2.

Min-Max Normalization
For each feature, the minimum value of that feature is transformed to a 0, the maximum value is transformed to a 1, and every other value lies between 0 and 1: Advantages of this method include:  Scales data between 0 and 1; guarantees all features have exact same scale.  Preserves shape of original distribution.  Preserves 0 entries in sparse data.  Least disruptive to information in original data.
However, this method does not reduce the importance of outliers, so skewed results can still exist after normalizing if outliers exist, as in Figure 3.

Max Absolute Value Normalization
This method scales feature by its maximum absolute value so that the maximum absolute value of each feature will be 1. This sets the distribution of each feature Characteristics of this method include:  Good for data with positive and negative values.  Preserves 0 entries in sparse data.  Similar sensitivity to outliers as in min-max normalization.

Quantile Transformation
Quantile transformation transforms each feature independently to follow a uniform or normal distribution. This is a non-linear transformation that uses the estimated value of the cumulative distribution function (CDF) to map a feature original values to a uniform or normal distribution: 1) Calculate empirical ranks, using percentile function.
2) Modify the ranking through interpolation.
3) Map to a Normal distribution by inverting the CDF, and clipping bounds at the extreme values so they don't go to infinity.
Characteristics of this method include:  Tends to spread out the most frequent values of a given feature.  Smooths out unusual distributions.  Less sensitive to outliers as other scaling methods.  Distorts the linear correlations between variables measured at the same scale, but variables measured at different scales are more directly comparable.  For a Normal transformation, the median of the feature becomes the mean, centered at 0.

Quantile Normalization
Quantile normalization is a method most notably used in genetics to normalize within samples, rather than within features as in the previously described methods. In genetic sequencing, data is often normalized based on the assumption of consistent within and between sample distributions, with observed variation around these distributions assumed to be the result of technical noise. Samples are normalized to the same distribution as each other or to a reference gene sample ( [12]). 1) Given n arrays of length p, form X of dimension p n × where each array is a column; 2) Sort each column of X to give sort X ; 3) Take the means across rows of sort X and assign this mean to each element in the row to get sort X ′ ; 4) Get normalized X by rearranging each column of sort X ′ to have the same ordering as original X.
Characteristics of this method include:  Makes 2 or more distributions identical in statistical properties.  Does not preserve original data distributions.

Model Characteristics
In order to test the effects of various pre-processing methods on select data structures, the data structures and normalizations are considered under a selection of commonly used modeling techniques. The considered models include generalized linear models (GLM), decision tree, random forest, support vector machine (SVM), gradient boosting, and neural network. The selected models represent a range of simple to complex, parametric and non-parametric, global and local, stochastic methods. Each method has characteristics that may require normalization for optimal results or lead to unintended effects if incorrect normalization is used. For example, while GLM are fit using maximum likelihood estimation (MLE) which provides statistically optimal properties of the estimators, scaling data by normalization and standardization is still important because variables with a large difference in ranges can result in an ill-conditioned design matrix and difficulty reaching model convergence, resulting in slower processing times and unstable parameter estimates.

Generalized Linear Models
Historically, Generalized Linear Models (GLM) are an extension of simple linear regression models with continuous targets and continuous and/or categorical features. The form of such a model is expressed as where i x is the data in feature i, and β are the coefficent parameters to be estimated as part of the linear function. In simple linear regression the assumption is that y is normally distributed, and the errors are normally distributed as x β , the linear combination of data and estimated coefficient parameters [13]. A summary of common GLM is found in Table 2.
Generalized Linear Models are comprised of three main components: Random, ) and their linear combination. The Link Function specifies the link between the random distribution of the target variable and the systematic features. Assumptions of GLMs include:  Data are independently distributed.  Errors are independent, but do not need to be normally distributed (i.e. Logistic Regression).  Dependent variable does not need to be normally distributed (expect in linear regression) but are distributed within the exponential family.  Assumes a linear relationship between the link function transformed target and the explanatory features.  Uses Maximum Likelihood Estimation (MLE) to estimate the parameters, so it relies on large sample properties and regularity conditions (1st and 2nd derivatives must exist). GLMs use Maximum Likelihood Estimation (MLE) to estimate the model parameters. In each of the distributions considered above (i.e. Linear, Logistic, Poisson, etc.), the distribution depends on one or more unknown parameters, θ . The value of these parameters, θ , is estimated using observed data x. The function of θ that results from plugging in observed data x is known as the Likelihood Function: This function is the product of the values of the parameters, given each sample of data, and is denoted simply as ( ) L θ . The log-likelihood is often used for computational convenience. The goal in GLMs is to maximize the likelihood of a parameter estimate given the observed data. The value of θ that maximizes this function is known as θ , the maximum-likelihood estimate (MLE). The maximum of the function is found by taking the derivatives with respects to the parameter(s) θ .
In this work, three generalized linear models are considered for three distinct target data types: linear regression, logistic regression, and Poisson regression.

Linear Regression
Linear regression is used for data with a continuous target which is a linear combination of the explanatory features, as in since linear regression is modeling the mean response directly.

Logistic Regression
When there is a binary target (i.e. 0 and 1) binary logistic regression models the log odds of probability of "success" (target = 1). The random component, Y, has a binomial distribution,

( )
Binomial , n π , where π is the probability of success. The systematic component, X, can be continuous, categorical, or a combination of both, and is also linear in the parameters as in linear regression. However, in this case, the link function is the Logit link, cally, the logit link models the log odds of the mean response, π .

Poisson Regression
When the target of interest is an expected count (i.e. counts of disease, number of homes sold in a day, etc.), we extend the generalized linear model to use a log-linear or Poisson regression model. This models the expected count as a function of the explanatory predictors,

Advantages of GLM
 Do not need to transform target variable to have normal distribution.  Models fit using MLE which provides statistically optimal properties of the estimators.  Model can be easily explained and parameters can be interpreted in the context of the prediction problem.  Easily implemented in most software.

Disadvantages of GLM
 Still has to be a linear function of the parameters; the link function serves only to connect the nonlinear target distribution to a linear function.  Target responses must be independent.

Decision Tree
A decision tree is a non-parametric classification technique that learns decision rules from features, using locally optimized, recursive partitioning. The algorithm assigns each sample in a dataset into a predicted class based on each samples' feature attributes. The algorithm uses information gain (7) to find the best features for classifying the data, where p and n are the proportion of 0 and 1 values of a binary outcomes for the i-th target class. Then, for each value defined for the decision values of the best feature (the feature and splitting value that best splits the predicted 0 and 1 outcomes), the algorithm repeats the process with additional, next-best predictive features. This process continues until the leaves of the tree are pure (samples at each node belong to the same class) or a pre-defined stopping criteria is reached [14]. In this way, decision tree is also a feature importance algorithm, where the data will be split on the most important, predictive features first.

Disadvantages
 Since this is a locally optimized, greedy algorithm, it is not guaranteed that a global optimum will be reached.  Decision tree is very sensitive to changes in data. Small changes in data (i.e. adding samples) can lead to large structural changes in the tree, i.e. high variance.  This is a more complex model and often requires more training time.  Without regularization (early stopping, pruning, max nodes, etc.), there is high risk of overfitting.

Random Forest
Random forest is a method that uses ensemble learning to address some of the disadvantages of the decision tree model. Ensemble learning combines results from multiple models to make more accurate predictions than any one single model, by reducing variance. Random forest uses an ensemble learning technique known as bootstrap aggregation, aka bagging. Bagging uses random sampling with replacement to build individual models on subsets of the available data and then aggregate the results into one prediction. The repeated sampling leads to an algorithm that is known to reduce variance, as in one of the main disadvantages of the decision tree model. Random forest combines many decision trees into one model by running the individual decision tree models in parallel and then outputting the prediction that is the mode of target classes for a classification problem or the mean prediction for a regression problem [15]. The structure of a random forest model is shown in Figure 4.
Advantages  Much like decision tree, gives estimates of most important features.  Known for high accuracy, low bias.  Decreased variance in comparison to decision tree.  Can handle large datasets with high dimensionality.  Since it identifies most important features, can be used as a feature reduction method.  Robust to missing data.  Use of bootstrap sampling allows for successful application when data is limited.

Disadvantages
 When classifying categorical data, biased in favor of features with more levels.  Will overfit data if regularization not used, such as limiting number of features that can be split at each node.  More difficult to interpret than single decision tree model.

Support Vector Machines (SVM)
SVM with Gaussian kernel is a parametric model that represents instances of data as points in space and then builds a model to assign new instances to one category or another. Each data point is represented as an n-dimensional vector, then SVM constructs an n-1-dimensional separating hyperplane to discriminate 2 classes, with maximized distance between the hyperplane and data points on each side. SVM aims to find the best hyperplane for separation of both classes [16]. Data are represented as: where i y is either 1 or −1, indicating to which class i x belongs. Each i x is p-dimensional vector representing all of the characteristic values (features) of where  ω is the normal vector to the hyperplane and b is the offset of the hyperplane from the origin. If the data points are linearly separable, the hard margin can be represented as and (11) Figure 5 shows a maximum margin separation for linearly separable data. The samples that fall on the margin are known as the support vectors.  The SVM algorithm assumes that data is in a standard range (usually between 0 to 1, or −1 to 1), so it is recommended to scale features before using the algorithm. In fact, when using the Gaussian kernel, if data is normalized between 0 and 1, then the dot product between the feature vectors and the separating hyperplane is the cosine similarity [18]. Disadvantages  Since this model has to calculate the distance between every training point to create a separating hyperplane, it is computationally expensive as the size of the data set increases.  Noisy data with overlapping target classes are difficult to separate; Kernel functions can be added to transform the data into higher level feature space for improved separation but this adds model complexity.  Does not directly provide parameter coefficients so it is difficult to interpret.

Gradient Boosting
Gradient boosting is another form of ensemble learning, this time utilizing a technique known as boosting. In a boosting algorithm, predictions are not made in parallel as in the bagging method of random forest. In this case, subsequent prediction models learn from the mistakes of previous models. Observations have an unequal probability of appearing in the subsequent models, with high error observations appearing in the most models. This is contrary to the random forest model where observations are selected for each model via bootstrapping (random selection with replacement) and have equal probability of appearing in each model. Visual comparison of single, bagging, and boosting models is show in Figure 6.
In gradient boosting, an ensemble of weak models, often decision trees, are used to improve the model based off of hard to predict samples. The algorithm leverages patterns in model residuals, such as those from using MSE loss, to build subsequent models from the weak predictions. For example, in a simple linear regression there is the assumption that the sum of the residuals is 0, i.e. spread randomly with no pattern around zero. However, assuming there is some pattern in the residuals for a base model, such as a decision tree, gradient boosting builds sequential models off of these residual patterns until there is no longer a pattern, i.e. average residual is zero or constant. The sequential model predictions are then weighted into a combined prediction. The intuitive idea behind gradient boosting is to combine several weak models, with each additional weak model improving the MSE of the overall model. Advantages of bagging and boosting ensemble techniques are illustrated in Figure 7.
Advantages  Focus on difficult to classify cases makes it robust to imbalanced datasets.  MSE is commonly used loss function, but gradient boosting can be optimized on many objective functions so it can be extended to many different problem spaces.    Requires more hyperparameter tuning, and training time to avoid overfitting compared with random forest.
 Sensitive to overfitting if data is noisy, i.e. many hard to classify cases to use in the sequential models.  Longer training requirements due to sequential nature of algorithm (as opposed to parallel model development in random forest).

Neural Network
In this study, the effects of normalization on various data types are also tested using a multi-layer perception (MLP), also known as the simple form of a neural network. Neural networks are models that learn non-linear function approximations by feeding a set of input features into an output. Although the input and output layers are similar to the linear approximations of generalized linear models, neural networks differ in that there is one or more non-linear hidden layers, as in Figure 8 with one hidden layer.
The first layer, the input layer, contains a set of neurons  and βs in a GLM. The combined inputs are then transformed by a non-linear activation, such as a tan function. From the last hidden layer, the input layer then applies an activation function to transform the values into outputs, such as the sigmoid function for a binary classification problem. The weights within each layer of the neural network are learned through a process of backpropagation and gradient descent. This process uses derivatives with respect to each parameter to find the optimal value of the selected loss function. Even though neural network uses non-linear transformations in the hidden layers, the network still uses linear combinations of the features and weights to learn the optimal parameters, as in the GLM, linear-based methods.
Advantages  Can learn complex, non-linear models.  Works well with "big data"; feeding neural networks more data leads to improved training and results.  Ability to detect all possible interactions between predictor variable.

Disadvantages
 MLP with hidden layers have a non-convex loss function where there exists more than one local minimum; different random weight initializations can lead to different validation accuracy.  Sensitive to feature scaling, due to above disadvantage.  Requires a lot of tuning (number of hidden neurons, layers, iterations), and regularization to prevent overfitting.  Requires a lot of data for best training and results.  Difficult, computationally expensive to train.  "Black box" algorithm is difficult or not possible to interpret.

Model Summary
A global model is one in which there is a single predictive formula for the entire data space. It is expected that a linear transformation of data in a linear-based global model (linear regression, logistic regression, Poisson regression, linear SVM) will result in the model parameters (i.e. weights in a neural network, coefficients in regression) adjusting to reach the optimal value of the risk function, such as using the MLE in the GLM class of models. As a result, we expect that choice of normalization method should not affect the risk function value as long as the feature space is a convex function, but it can affect the values and stability of the feature coefficients. In this case, even though normalization may not affect the estimated total average error, it may have effects on the estimated average bias and variance due to model instability.
For non-linear, locally recursive models such as decision tree, random forest, and gradient boosting regression, it is also expected that within-feature global normalization will have little effect on risk function value. These tree-based models optimize by finding the best split-point within each individual feature by the percentage of labels correctly classified using that feature. Since these models are local, recursive models, as long as the ordering within the features is preserved, normalization of the data should not affect the loss function value. However, although we're using a decision tree-based learning model for the gradient boosting regression, this type of sequential boosting model relies on minimizing the MSE for the global model through subsequent predictions on the individual model residuals. Because of this, it is suspected that the gradient boosting model will exhibit patterns in bias-variance decomposition similar to the linear models. However, since we are using the default hyper-parameters in the gradient boosting model for consistent simulation conditions, it is possible that the bias-variance decomposition results will suffer from over-fitting and have a longer training time.

Simulation Methods
In order to approximate the bias-variance decomposition we need to approx- by simulating many variants of the training data sets. We can do this via bootstrap sampling. We take a synthetic input dataset D and create variants of D from 1 , , T D D  of size n (Algorithm 1).
x y example we have many predictions and can estimate: This process was then repeated on additional simulated datasets including rank-based data, categorical data, mixed data, and Poisson data. Default hyperparameters were used for all tested models and data structures, with no additional hyperparameter tuning to allow for consistent comparison between methods. MSE risk decomposition was then performed on several benchmark datasets to assess results on various data characteristics including sparse data, wide data (more features than samples), and imbalanced data. These benchmark datasets were from the UCI Machine Learning Library [20] and are listed in Table   3. The results of bias-variance decomposition are then used to inform model develop on several existing study applications. Selected applications include the historical data from the NCAA Men's Basketball Tournament (historical data used due to cancellation of NCAA tournament in 2020; comparing to model results from last years competition), and a credit risk model [21].

Simulation Results
Results are divided into three sections describing results from 1) bias-variance decomposition simulation, 2) bias-variance decomposition on benchmark datasets, and 3) application of findings from Sections 1 and 2 to existing NCAA data and credit risk data. The first section on simulation results is divided by target data type (binary, continuous, Poisson). Results across data structures and models is relatively consistent so summaries were provided to avoid repetition. Complete results of bias-variance decomposition under various data structures, Open Journal of Statistics  Table 4.

Binary Target
Over all models applied to bivariate normal data with binary target, SVM using within-feature normalization methods (risk = 0.290), and logistic regression with raw data or quantile normalization (risk = 0.290) have best, similar risk function results (Table A1). While SVM has the consistently best results and is normalization-agnostic ( Figure B1(d)), logistic regression ( Figure B1(a)) with raw data or quantile normalization has similar performance with a faster run time (approximately 5 seconds vs 18 seconds as in Table 4). If an analyst would like to use decision tree, random forest, or neural network models instead, it is recommended to use raw data or quantile normalization for best risk function results.
For models applied to rank-based data with binary target, logistic regression using raw data or quantile normalization (risk = 0.444), and SVM with raw data or quantile normalization (risk = 0.437) have best, similar risk function results (Table A4). While SVM has the consistently best results (14d), logistic regression with raw data or quantile normalization ( Figure B4(a)) has similar performance with a faster run time (approximately 6 seconds vs 39 seconds as in Table   4). If an analyst would like to use decision tree (best risk = 0.489), random forest (best risk = 0.445), gradient boosting (best risk = 0.465), or neural network (best risk = 0.475) models instead, it is recommended to use raw data or quantile normalization for best risk function results.
Over all normalization methods and models applied to categorical data with binary target, risk function estimates ranged from 0.473 to 0.502, indicating that this data is somewhat model-and normalization-agnostic (Table A7). SVM using z-standardized data, and logistic regression with all methods except z-standardization resulted in best risk function value of 0.473 ( Figure B7(d)). However, while SVM and logistic regression have similar results, logistic regression ( Figure B7(a)) is more than 3 times faster (7 seconds vs. 21 seconds average processing time as in Table 4). Since the simulated dataset consists of all categorical data, the features are first converted to [0, 1] coded dummy features, effectively "normalizing" the data between 0 and 1, so additional normalization methods are not expected to have an effect on the downstream analysis. For mixed data types with a binary target, although there are slight deviations between normalization and model performance, risk function values do not vary much between all methods, with a range between 0.479 and 0.507 (Table A10). A decision tree model using raw data leads to the best results ( Figure B10(b)), and gradient boosting using raw or quantile normalized data leads to the worst results ( Figure B10(e)). However, considering the consistency of performance across normalization methods and models, it is recommended to make selections based on additional criteria, such as processing resources, model interpretation, or another performance measure such as specificity and sensitivity.

Continuous Target
Over all models applied to bivariate normal data with continuous target, neural network using raw data (risk = 0.342), and linear regression with raw data (risk = 0.338) have best, similar risk function results (Table A2). Quantile normalization method for both models has similar results with MSE risk of 0.344 for linear regression ( Figure B2(a)) and 0.356 for neural network ( Figure B2(f)). However, while neural network and linear regression have similar results, linear regression is approximately 20 times faster (1.2 seconds vs. 26 seconds average processing time as in Table 4). SVM has the most consistent results; even though the within-feature normalization methods all perform worse than raw or quantile normalized data, SVM within-feature normalization results perform better than the same normalization in all other tested models. If normalization and scaling of data is required, as with features measured on highly divergent scales, it is recommended for an analyst to test the SVM model, keeping in mind increased processing requirements.
When applied to rank-based data with continuous target, neural network using raw and quantile normalized data (risk = 0.175), and linear regression with raw and quantile normalized data (risk = 0.174) have best, similar risk function results (Table A5). However, while neural network and linear regression have similar results, linear regression is more than 17 times faster (0.8 seconds vs. 14 seconds average processing time as in Table 4). SVM has the most consistent results ( Figure B5(d)); even though the within-feature normalization methods all perform worse than raw or quantile normalized data, SVM within-feature normalization results perform better than the same normalization in all other tested models. If normalization and scaling of data is required, as with features measured on highly divergent scales, it is recommended for an analyst to test the SVM model, keeping in mind increased processing requirements. Note, however, that outside of the best-performing linear regression and neural network models, all other methods and models perform significantly worse due to exploding estimates of average bias.
While normalization generally does not improve or worsen results as compared with raw data, it is not recommended to use z-standardization for categorical data with a continuous target, as the simulation results indicate increased risk function values (Table A8). In particular, if using linear regression ( Figure  B8(a)), z-standardization and quantile transformation should be avoided as these methods used with this model lead to significant explosion in the risk function value. Outside of z-standardization for any tested model, and quantile transformation for linear regression, simulation results indicate that this type of data is both normalization-and model-agnostic. In this case, normalization and model selection can be based off of additional criteria, such as processing requirements or model transparency.
Over all models applied to mixed data with continuous target, neural network using raw data (risk = 0.366) and quantile normalized data (risk = 0.425), and linear regression using raw data (risk = 0.367) and quantile normalized data (risk = 0.363) have best, similar risk function results (Table A11). However, while neural network ( Figure B11(f)) and linear regression ( Figure B11(a)) have similar results, linear regression is approximately 41 times faster (1.6 seconds vs. 66 seconds average processing time as in Table 4). SVM has the most consistent results; even though the within-feature normalization methods all perform worse than raw or quantile normalized data, SVM within-feature normalization results perform better than the same normalization in all other tested models. If normalization and scaling of data is required, as with features measured on highly divergent scales, it is recommended for an analyst to test the SVM model, keeping in mind increased processing requirements and potential for increased bias.

Poisson Target
Over all models applied to bivariate normal data with Poisson target, random forest using within-feature normalization methods (risk = 1.125), and gradient boosting using within-feature normalization methods (risk = 1.167) have best, similar risk function results (Table A3). Poisson regression using raw or quantile normalized data also has strong results with risk function value of 1.5. Although Poisson regression ( Figure B3(a)) results are not as strong as those found using random forest ( Figure B3(c)) and gradient boosting ( Figure B3 Table 4. For rank-based data with Poisson target, SVM using raw and quantile normalized data (risk = 1.281), and Poisson regression with raw and quantile normalized data (risk = 1.280) have best, similar risk function results (Table A6).
However, while SVM and Poisson regression have similar results, Poisson regression is more than 6 times faster (6 seconds vs. 36.6 seconds average processing time), as seen in Table 4. Although Poisson regression has the best results for this data, use of the within-feature normalization methods result in unstable, exploding risk function values and should be avoided ( Figure B6(a)).
Within-feature normalizations should also be avoided if using a neural network on this data for the same reason. If normalization and scaling of data is required, random forest model has the best results for the within-feature methods, although it has the longest processing time.
Although there are slight deviations between normalization and model performance, risk function values do not vary much between all methods when considering categorical data with Poisson target, with a range between 1.499 and 1.783 (Table A9). A decision tree model using min-max, maxAbs, quantile transformation, or quantile normalization lead to the best results ( Figure B9(b)), and SVM using z-standardization leads to the worst result ( Figure B9(d)). However, considering the consistency of performance across normalization methods and models, it is recommended to make selections based on additional criteria, such as client requested models.
Over all models applied to mixed data with Poisson target, gradient boosting ( Figure B12(e)) using raw and quantile normalized data (risk = 1.746), and random forest ( Figure B12(c)) using raw data (risk = 1.820) have best results (Table A12). Generally, the best performing risk function values do not vary much between all tested models, with a range between 1.476 for the gradient boosting model and 2.087 for the decision tree model ( Table 4). The similarity in best performing model results indicates that this data type is somewhat model-agnostic, although normalization methods should be selected carefully if required for analysis. For example, it is not recommended to use any of the tested within-feature normalizations if a neural network is used due to significant increases in bias and variance found in the simulation results.

Benchmark Data Results
Benchmark datasets were selected from the UCI Machine Learning Library to cover data types similar to those covered in the simulations.  Figures B13-B15). The traditional within-feature normalization methods (z-standardization, min-max, maxAbs, quantile transformation) result in risk function values that are the same or worse than using raw data or quantile normalization. For the wine quality data, using raw or quantile normalized data in logistic regression, linear SVM, or neural network results in best risk function performance, while quantile normalization with logistic regression was best for the breast cancer data. For the congressional voting records data, logistic regression and neural network with raw and quantile normalized data were also found to be the best method-model combinations, consistent with simulated data results. However, it is interesting to note that z-standardization in both of these models resulted in the worst risk function performance among all other method-model combinations applied to this dataset, due to both increased bias and variance.
For the binary target data with mixed data type features (arrhythmia, abalone), raw data and quantile normalization also lead to the best risk function performance. However, the arrhythmia dataset is a relatively more complex dataset in comparison to the others tested. It has missing data, many features in comparison to few instances (i.e. 279 features vs. 452 instances), and imbalanced target data. As a result, logistic regression is not well suited for describing these complex relationships, and has worse risk function performance; decision tree and gradient boosting regression with raw data or quantile normalization have best results ( Figure B17). In contrast, while the abalone dataset is also an imbalanced dataset, it has less complex data structures with only 8 features and over 4000 instances. In this case, logistic regression, linear support vector machine, and neural network are well-suited for the less complex data, and result in improved risk function values due to decreased variance in comparison to the more complex models ( Figure B16).
For assessing results on continuous target data, the forest fires dataset (numeric features), solar flare dataset (categorical features), and auto MPG dataset (mixed type features) were considered. Once again, in all cases, the within-feature normalization performed the same or worse than using raw data, with the between-feature quantile normalization process being the only method that resulted in same or some improvement to risk function values. For both numeric and categorical only datasets (forest fires and solar flare, respectively), the linear-based logistic regression and neural network models with raw data or quantile normalization resulted in best performance (see Figure B18 and Figure  B19), with all other normalization-model combinations resulting in both increased bias and variance. In the case of the more complex mixed feature type auto MPG dataset, gradient boosting regression also has improved performance, but logistic regression has similar performance and is a faster algorithm ( Figure  B20).

Applications
Findings of the empirical study and benchmark data were applied to existing studies, using best-performing normalization-model combinations in the model development process.

NCAA Tournament Data
For the 2019 NCAA Men's Basketball Tournament Bracket prediction problem, data was used from over 100,000 NCAA regular season games, with the goal to take information about two teams as input, and output a probability of team 1 winning a game. Motivated by the popular Kaggle competition, models were developed to minimize log-loss between predicted win probabilities and actual game outcomes, as in: This loss function has high penalty for models that are both confident and wrong ( [7]). Model development involved:  Readily available game statistics, provided by Kaggle.  Commonly used external ratings systems (Massey Ratings).  No additional feature engineering.
 No domain knowledge.
The analysis value-added in comparison to previous research and public models found on Kaggle was to focus on the comparison of various normalization techniques for model development. In particular, the use of outside domain knowledge (public health, genetics) to apply a technique from one domain (genetic research) to an unrelated domain (sports data) proved advantageous but was not, initially, statistically motivated. However, through simulation of bias-variance decomposition and findings from application to benchmark data, it is expected that improved loss function performance for this type of data (ranked data, balanced target, non-missing data) can be achieved by using a linear-based model with raw or quantile normalized data. Rather than iterating through many normalization-model combinations (which took over 12 hours of computation time when building the model for the 2019 tournament), logistic and linear SVM models with raw and quantile normalized data were trained on NCAA regular season data from 2014-2017 and tested on the 2018 tournament. The same features were used as in the 2019 model, with only the model and normalization selection process updated based on the model development framework findings. From the simulation results, logistic regression and SVM for both raw and quantile normalization provided similar results, although SVM outperformed logistic somewhat due to decreased variance, although it has increased bias. In the updated NCAA application on 2018 data, logistic regression with raw data outperformed the other tested models, with a log-loss score of 0.569 (Figure 9). This log-loss score, in comparison to other Kaggle submissions in the 2018 tournament, would have ranked 23rd out of 933 teams (98 percentile) and required only the original data supplied by Kaggle and no additional feature engineering or model tuning. In addition, the entire model development process and testing took less than 30 minutes. Three out of the four models developed correctly predicted the Final Four including the tournament Champion, Villanova. This is compared with a 2019 bracket that, while it correctly predicted the tournament winner, only predicted two of the Final Four teams, and scored in the 90th percentile of Kaggle Log-loss scores.

Credit Risk Data
Previous research by Rudd and Priestley (rudd2017comparison) compare the use of logistic regression and decision trees for prediction of commercial credit risk. The dataset, provided by Equifax, included over 11 million records and over 300 features, and involved extensive data preprocesing including imputation, feature reduction, and transformation. The effects of normalization were not considered at the time. Based on findings from simulations and benchmark results, it was found that gradient boosting regression with raw and quantile normalization should also be considered for this type of data. Running the analysis again, this time including gradient boosting regression, found best results for gradient boosting with raw data (AUC = 0.96). A drawback, however, of this result in the context of credit risk analysis is that gradient boosting is much more  difficult to explain than the logistic regression and decision tree models, and can be problematic in a heavily regulated industry where model interpretability is required ( Figure 10).

Conclusions and Suggestions
In this study, simulation was used to investigate the effects of normalization on downstream analysis results. Normalization methods were investigated by utilizing a decomposition of the empirical risk functions, measuring effects on model bias, variance, and irreducible error. Estimates of bias and variance were then used as diagnostic procedures for data pre-processing and model development. We used our findings to propose model development and algorithm design choices that best minimize common design effects on bias and variance. Open Journal of Statistics Mean square error (MSE) was considered as the evaluation metric, and the effects of a selection of normalization methods were measured on the empirical risk function. Normalization techniques were selected that represent both data invariant as well as data variant normalization strategies. For example, techniques such as z-score standardization (transforms data to have a mean of zero and a standard deviation of 1) and feature scaling (rescaling data to have values between 0 and 1) change the spread and position of data points (all by consistent factors) but do not change the distribution shape of the data, whereas techniques such as quantile normalization, commonly used in genetic differential expression analysis, affect the measures of spread, position, and shape. Through simulation of various data structures and bootstrap sampling of the bias-variance decomposition, best performing model-normalization-data structure combinations were found to illustrate the downstream analysis effects of these model development choices. For example, it was found that for rank-based data with binary target, quantile normalization performed better than the data invariant methods with similar or improved performance over raw data due to decreased variance in the loss function value. In addition, results found from simulations were verified and expanded to include additional data characteristics (imbalanced, sparse) by testing on benchmark datasets available from the UCI Machine Learning Library. Normalization results on benchmark data were consistent with those found using simulations, while also illustrating that more complex and/or non-linear models provide better performance on datasets with additional complexities, such as wide data (large feature to instance ratio) as in the arrhythmia dataset. Finally, applying the findings from simulation experiments to previous applications led to equivalent or improved results with less model development overhead and processing time. Applying the model framework to the 2018 NCAA Men's Basketball data resulted in a log-loss score that would have been ranked 23 out of 933 teams (98th percentile) and only required 30 minutes of model overhead, as opposed to a 2019 model that required over 12 hours of processing and resulted in a 90th percentile log-loss score.

Limitations
While this work provides a statistical illustration of the downstream effects of

Future Research
In this research, an empirical study was used to illustrate the downstream analysis effects of model development choices. Further theoretical work is recommended to connect the empirical findings theoretical properties. For example, since the generalized linear models were most often found to have best risk function results regardless of data structure or selected normalization, additional theoretical study can enhance the explanation of these results. If we assume that MLE is unbiased in GLM then estimates of bias should resolve towards zero. In this case, bias-variance decomposition of the loss will be completely defined by variance. As an extension, the Cramer-Rao lower bound property of MLE (the lower bound of the variance of the estimator) suggests that no other model will achieve a better result of the bias-variance loss decomposition. This potential explanation requires theoretical understanding of at least two questions: 1) Are generalized linear models, in fact, unbiased? and 2) Does the Cramer-Rao lower bound theorem apply to variance of the prediction and not just the parameters? In addition, estimates of predictive risk are only one way to assess performance of a statistical procedure and downstream analysis effects. In order to provide further justification connecting the empirical evidence with the proposed framework, it is recommended to consider additional measures of performance such as coverage of predictive intervals, class probabilities cutoff points, gain and lift charts, etc. The diversity of analytic choices and resulting modeling pipelines leads to a wide range of potential future research required to truly quantify the complete effects of a unified model development framework. Open Journal of Statistics   Target   Table A4. Bias-variance decomposition results for ranked data with binary target.