Influence of the Fitted Straight Line for Confidence Bands Algorithm in Q-Q Plots ()
1. Introduction
Normal probability plots and, in particular, Normal Q-Q Plots, are used to determine if a set of observations derives from a normal distribution. For this, it is necessary that the plotted points on the graph have a rectilinear configuration.
Normal Q-Q Plot compares the empirical quantiles of sample data, i.e., the ordered sample data,
, with the corresponding quantiles of a theoretical distribution, i.e., the normal distribution,
. Therefore,
the plotted points on the graph are the pairs
where Φ is the
standard normal cumulative distribution function and
are the plotting positions. In the literature, several definitions of plotting positions are available [1] [2].
In the development of this paper, we will use the definition proposed by Yu and Huang [3]:
(1)
On a Normal Q-Q Plot, we can represent a straight line enabling us to take a decision about the straight form of the points on the graph and determine if the hypothesis of normality is verified. There are also different lines that we can represent on the graph [4].
The main problem of this graphical technique is that the observer of the graph may affect the conclusion. That is why this technique is often called “informal technique”. To avoid this problem, the confidence bands or acceptance region [5] are used to determine whether or not a data set has a normal distribution, so that the conclusion is the same regardless of the observer of the graph. Some of the confidence bands depend on the straight line represented on the Normal Q-Q Plot to be able to be constructed.
Therefore, the plotting positions, the fitted straight line and the confidence bands are key elements in a Normal Q-Q Plot. Due to the high number of combinations of these three elements that exist, it is necessary to analyze the influence that the use of different combinations can have on the final conclusion. In this study, we will focus on the analysis of five types of straight lines and on the confidence bands based on the exact distribution of the order statistics [5].
Here, we focus on the normal distribution. However, the study can be extended to any distribution of interest.
This paper is organized as follows: in Section 2, we explain the five straight lines that we have used in this study. Section 3 presents the confidence bands based on the exact distribution of the order statistics. In Section 4, two examples illustrate the performance provided. Finally, in the last section, the conclusions of this study are presented.
2. Fitted Straight Lines in a Q-Q Plot
In this section, we carry out a review of some of the straight lines which can be fitted in a Q-Q Plot [4] and that we will use in our study to verify the influence they have on the confidence bands.
1) Straight line that passes through the first and third quartiles. This procedure consists of locating a point on the graph corresponding to the first quartile and another corresponding to the third quartile and joining these two points.
2) The least-squares line. The straight line, in our case, will take the form:
(2)
and the estimation of µ and σ will be obtained by using the unweighted least squares method. The solution in the case of normal distribution is the following:
(3)
and the fitted straight line is:
, where
are the ordered observations and
are the N (0, 1) quantiles in the plotting positions
.
3) Straight line with slope the quasi-standard deviation s and constant the average of the data set. This method consists of fitting the straight line to the plotted points:
where
is the average of the observations.
4) Theil-Sen’s line [6]. The slopes of the lines passing through all possible pairs of points are calculated. Then, the median of all previous slopes is taken as an estimate of the slope. For the calculation of the constant, n constants of the lines through each of the points and the previously estimated slope are calculated. The estimated constant of the straight line will be the median of then constants obtained.
5) Tukey’s line [7]. This method consists of dividing the set of observations into three equal parts and calculating the median for each of them and determining the straight line from the three medians. The steps to obtain Tukey’s line of general expression
are the following:
a) Given the observations:
, they are divided into three groups with an approximately equal number of elements according to the variable z.
b) For each group the median is calculated by obtaining the following points:
(4)
where
is the median of the left group,
is the median of the central group and
is the median of the right group of the observations of z. Similar to the observations of x.
c) The slope of Tukey’s line is calculated by the following expression:
(5)
d) The constant of Tukey’s line is calculated by the following expression:
(6)
3. Confidence Bands Based on the Exact Distribution of the Order Statistics
The procedure to obtain the confidence bands based on the exact distribution of the order statistics is [5]:
Step 1 Fix the significance level α.
Step 2 Draw a Normal Q-Q Plot and fit a straight line. The fitted straight line provides an estimate of the parameters µ and σ of the normal distribution.
Step 3 Determine, for each i,
, the values
and
as the quantiles of order
and
of a
distribution.
Step 4 Determine the values
and
, for each i, as the value
in the quantiles calculated in the previous step. Φ is the distribution function of a normal distribution with parameters µ and σ. The values of µ and σ are the values obtained in Step 2.
Step 5 Plot, for each i, vertically, an interval centered on the corresponding point of the fitted straight line with the lower end of the band as the point
and the upper end as the point
.
Step 6 Join the points calculated in the preceding step to obtain a band.
Step 7 Reject the hypothesis of normality if at least α% of the observations fall outside the confidence bands.
4. Examples
In this section, we show two examples of how to construct Normal Q-Q Plot using confidence bands. First, considering simulated data and, secondly, with real data. The examples have been made using R [8].
4.1. Example 1
Table 1 shows a simulated size 30 sample of a Cauchy distribution.
Figure 1 shows a Normal Q-Q Plot constructed from the above observations. The plotting position considered,
, is that of Yu and Huang [3]. However, any other plotting position could be used to construct the Normal Q-Q Plot. The plot also represents the confidence bands based on the exact distribution of the order statistics. To obtain these confidence bands, we have considered a straight
Table 1. Simulated sample of a Cauchy distribution
Figure 1. Normal Q-Q Plot with confidence bands using simulated data.
line that passes through the first and third quartiles and the least-squares line. It can be observed that the hypothesis of normality of observations is rejected according to the confidence bands obtained by considering the straight line that passes through the first and third quartiles, but it is not rejected according to that obtained by the least-squares line, although the data comes from a Cauchy distribution.
4.2. Example 2
The data set shown in Table 2 comes from Bickel and Doksum [9] and lists the elapsed times spent above a certain high level for a series of 66 wave records taken at San Francisco Bay.
Following the same procedure as in the previous example, we have obtained Figure 2.
In Figure 2, we can observe that the hypothesis of normality is rejected according to the confidence bands obtained by considering the straight line that passes through the first and third quartiles (there are 5 points outside the confidence bands, more than α = 5% of the data). Instead, it is not rejected according to that obtained by the least-squares line (there are 3 points outside the proposed confidence bands, less than α = 5% of the data).
Table 2. Data set from Bickel and Doksum.
Figure 2. Normal Q-Q Plot with confidence bands using real data.
5. Conclusions
The aim of this work has been to analyze the influence of different types of straight lines that can be represented in a Normal Q-Q Plot at the moment of detecting the non-normality of a set of observations. Confidence bands represented in Q-Q Plot depend on the fitted straight line, so if we change the straight line, the confidence bands also change, and the conclusion may be different.
There are three elements that can vary in a Normal Q-Q Plot: plotting positions, confidence bands and straight lines. We have focused on the plotting positions proposed by Yu and Huang [3]. In [5] out of the three graphic techniques compared, the best method proves to be the confidence bands based on the exact distribution of the order statistics, so in this study, we have used such confidence bands. Therefore, we have fixed these two elements and we have compared the graphics obtained with five types of straight lines. The final conclusion is that the election of straight line for construction of confidence bands in a Normal Q-Q Plot it can change the decision about whether or not the data comes from a Normal distribution. Therefore, special care must be taken about the line to choose when building a Normal Q-Q Plot.