On Minimizing the Standard Error of the Slope in Simple Linear Regression

A common homework problem in texts covering calculus-based simple linear regression is to find a set of values of the independent variable which minimize the standard error of the estimated slope. All discussions the authors have heard regarding this problem, as well as all texts with which the authors of this paper are familiar and which include this problem, provide no solution, a partial solution, or an outline of a solution without theoretical proof and the provided solution is incorrect. Going back to first principles we provide the complete correct solution to this problem.


Introduction
A homework question, occurring in several oft cited best-selling introductory texts covering calculus-based simple linear regression, goes something like this: Suppose we are to collect data and fit a straight-line simple linear regression, σ and to be uncorrelated with one another.Further sup- pose that in this designed experiment, the region of interest for x is A x B ≤ ≤ , A B < , and that the primary goal is to make the standard error of the estimate of the slope as small as possible.For a given sample size n, at what values of the independent variable should the observations be taken?That is, how should 1 2 , , , n x x x  be chosen so as to minimize the standard error of the estimate of 1 β .From [1], which does not include the above noted problem, and virtually any other text covering simple linear regression, we know the following: the estimate of the slope is

∑
The estimated standard deviation, or standard error, is found by replacing 2 σ by its estimate ˆ. 2 σ is an unknown constant and its estimator cannot be formed until data are collected.
Thus in the case of either the theoretical standard deviation or the estimated standard error, the numerator under the radical is unknown and not under the control of the experimenter in the question.Consequently the minimization of the standard deviation or the standard error is achieved by maximizing the quantity ( ) ∑ the corrected sum of squares of the x's.
Many texts which include this problem provide no solution.Every discussion that the authors have heard discussed or seen in a solutions manual suggests, without proof, that in order to maximize SXX if n is even, half of the observations should be taken at A and half at B. Many texts that include a solution ignore the possibility that n is odd, even though no condition on n was provided in the question.When a solution is provided for n odd, every solution we have seen suggested without proof that ( ) observations should be taken at each of A and B with the remaining single observation being taken half way between these values, at ( ) 2 A B + .That this solution is incorrect which can be seen with a simple example where 3 n = .The result using the "usual" solution outlined above is to take 1 x A = , 2 x B = , and ( ) . Alternatively, if we take 1 2 x x A = = and 3 x B = , we have ( ) ( ) which are larger than the value obtained using the "usual" solution, showing that the usual solution is not correct.We suppose that the desire for symmetry led to the belief in the incorrect solution; however symmetry has not been neither mentioned nor required for the problem under discussion.
In the sequel we show that for n even, the "usual" solution of choosing half of the observations to be taken at A and the other half to be taken at B is correct.For n odd we show that in order to minimize the standard error, ( ) observations should be taken at one end of the interval (either at A or at B) and the remaining ( ) observations should be taken at the other end of the interval.An example of this result is given in Figure 1.Throughout we will assume that the sample size n is a given constant.

The Objective Function; Sum of Squares
Our goal is to find the set of i x which maximize ( ) Since the i x are continuous variables (not in the statistical sense but rather in the algebraic sense) on the interval [ ] A B , we may use techniques of calculus in order to find the values that maximize this function (see, e.g., [2]).We have Setting this equal to zero we have i x x i = ∀ being stationary points.Of course our variables exist on a closed interval so we must also investigate the endpoints.As a result it must be true that x x i = ∀ then SXX = 0, which is smallest possible value of SXX, i.e., choosing i x x i = ∀ leads to a minimum rather than a maximum.The same is true if observations are taken either all at A or all at B. We would then say it is obvious that at least one observation must be taken at A and at least one observation must be taken at B, but authors saying "it is obvious that..." is what led to this note in the first place.Consider the case where some observations are taken at x A = and the rest at x x = distinct from A; this is a contradiction as the mean would then not be at x .Similarly, it is impossible to have some observations at B and the rest at x .Accordingly it must be true that at least one observation must be taken at each of A and B.
Let 1 n be the number of observations taken at A, n 2 be the number of observations taken at x , and 3 n be the number of observations taken at B. From the argument in the previous paragraph we have n ≥ , all integers, and 1 2 3 n n n n + + = , a given constant.Then ( ) ( ) x n A n x n B n n n = + + + + , the simplification of which leads to ( ) ( ) x n A n B n n =+ + .Consequently, substituting these values, we have ( ) ( ) The quantity ( ) − is an arbitrary non-negative constant.Some texts give as their example 1 A = − and 1 B = , some give 0 A = and 1 B = , and still other books use other choices for these given constants.The choice of A and B, as seen in the final formula for SXX, have no bearing on the solutions for 1 n , 2 n and 3 n which maximize SXX.Thus we shall simply attempt to find parameters 1 n , 2 n and 3 n that maximize ( ) ( ) with the constraints imposed previously that 1 n , 2 n and 3 n are non-negative integers, 1 n , 3 1 n ≥ , and 1 2 3 n n n n + + = , a known/given constant.

Optimization
The function with constraints given in the previous paragraph may be maximized in any number of ways.Possi-bilities considered by the authors include the following: taking the variables of interest to be continuous and maximizing the function through the use of calculus, hoping for integer values which would then be the optimal solution [3]; using integer programming [4]; and other possibilities.However, it seems that a simple algebraic manipulation may be the most elegant solution.Let ( ) ( ) ( ) ( ) , , , , g n n n .We now show that n 2 must be ze- ro.Assume that ( ) is an ordered triple which meets the constraints and which minimizes ( ) .Then the ordered triple ( ) , , n n n also satisfies the constraints, and furthermore ( which is a contradiction to the assumption that ( ) , , g n n n , hence 2 0 n = .Now one of our constraints reduces to 1 3 n n n + = , and maximizing ( ) ( ) This last is simply a parabola which we need to maximize over . To find the maximum, treat the parabola as a function of a continuous variable z.The maximum occurs when ( ) As n is integer valued, for n even this implies 1 2 n n = gives the maximum value, while for n odd either of the two points surrounding 2 n , ( ) , gives the same maximum value.Figure 2 graphically demonstrates this result.The contradiction in the previous paragraph gives 2 0 n = and this with the original constraint that 1 2 3 n n n n + + = , a known/given constant, gives the value of 3 n .

Conclusions
For the common homework problem appearing in approximately half of the texts covering calculus-based simple linear regression with which the authors are familiar, and which was posed at the beginning of this paper, we have shown that if n is even, the oft given solution to choose half of the points at which to take observations at either end of the interval is correct.However, for odd n we have shown that the only previously given solution to place one point in the center of the interval and half of the remaining points at each end of the interval is incorrect, and that the correct solution is to choose nearly half, either ( ) , at one end of the interval and the remaining points at the opposite end of the interval.We part with the common caveat that this oft given textbook problem is of little use in most realistic applications unless it is known that the true relationship among the data is linear, as the solution affords us no opportunity to check this assumption with the observed data.However, the authors would submit that there is a difference between being "useless in practical situations" and "understanding something fundamental about simple linear regression".We believe that it is important for a student to understand the theory underlying simple linear regression, and this importance is supported by the inclusion of the problem in a large number of highly cited and best-selling texts.Unfortunately, many of these texts no solution, some a partial solution and provide an incorrect solution.No texts with which we are familiar, nor their solutions manuals, provide a complete and correct solution.This common textbook problem affords the student the opportunity to understand what drives the variance of the parameter estimate, and as such deserves a correct solution.
errors are assumed to have mean zero, unknown variance2

Figure 1 .
Figure 1.The graphs above represent n (an odd number) data points collected according to two plans for minimizing the standard error of the slope in simple linear regression.The figure on the left represents the common but incorrect solution whereby one observation is taken in the middle of the interval.In the graph to the right, the number of observations taken at either end of the interval differ by one.Although lacking symmetry, this is the correct solution for minimizing the standard error of the slope.

Figure 2 .
Figure 2. When n is even, the maximum of the objective function occurs at 2 n .When n is odd, the maximum value occurs at ( ) -1 2 n and ( ) 1 2 n +.