Global warming is majorly caused by an increase in atmospheric temperature and carbon dioxide (CO2) emissions due to the rise in the temperature. The continued accumulation of CO2 into the atmosphere is a massive part of the climate change problem. This study aims to develop a data-driven statistical model using Africa’s fossil-fuel CO2 emissions real data to identify the significant attributable variables and their interaction that produce the carbon dioxide emissions. However, we have considered five attributable variables in our statistical modeling and they are Liquid fuels (Li), Solid fuels (So), Gas fuels (Ga), Gas flares (Gf) and Cement production. The development of the statistical model that contains the different emissions of fossil fuels and their interactions have been specified and ranked based on a percentage of their annual contributions to carbon dioxide in the atmosphere. Our proposed statistical model is compared with a different penalization method since multicollinearity among the risk factors exists and which provided excellent results according to the root mean square errors (RMSE) statistic. The results of the proposed model are compared to previous results of different countries of the world.
Global warming is not a phenomenon that could happen; it is a phenomenon that is happening. We are witnessing the effects of climate changes in the Arctic ice levels that have been the lowest since scientists have ever recorded. The circulation of radiation that warms the earth is referred to as the greenhouse effect and the gases involved are called the greenhouse gases, which mainly include Carbon Dioxide, Methane, Water Vapor, Chlorofluorocarbons, etc.
Recently, the UN’s Intergovernmental Panel on Climate Change (IPCC, 2018) [
The warmest year on record since 1850 is 2016 with a central estimate of 1.15˚C above the same baseline [
Indeed, carbon dioxide is released into the atmosphere from both natural and human’s emission such as fossil fuels that people are burning for energy. Besides, in 2017 Li et al. [
This is consistent with the IPCC report (April 2007) [
According to the United Nations Fact Sheet on Climate Change [
Regions that are facing inadequate supply of water, especially in North Africa, would have climate change further threatening sustainable development due to demands of water. On the other hand, African countries that are affected by AIDS, HIV, fighting poverty, political instability, internal/civil wars, drawbacks in policy making and economic reforms may lack the funds/resources to tackle these expected significant climate change problems.
Usually, atmospheric CO2 concentrations that are emitted from fossil fuel combustion and industrial operation are divide into seven sources [
In the present study, the real yearly CO2 emissions data for each of the fossil-fuels for the African continent obtained from Carbon Dioxide Information Analysis Center (CDIAC), and this actual annual data has been collected from 1963 to 2014. All emission estimates are shown in metric tons of carbon (MT). In developing the statistical model, the response variable is the CO2 in the atmosphere; hence, we develop an analytical model that contains the significant contributable variables and important interactions along with higher order of contributions if applicable.
The proposed model relies on several assumptions such as the linearity, multicollinearity, and the normality assumption that related to errors. Carbon dioxide dataset shows that the attributable variables are highly correlated; thus, the parameters are challenging to interpret. The parameters become very unstable when independent variables are highly correlated and leading to experiencing over-fitting the model. Moreover, we apply different penalization regression methods: Ridge Regression (L2) [
The proposed statistical model is useful in predicting the CO2 in the atmosphere given the values of the significant attributable variables. Also, we rank the attributable variables according to the percent of contribution to CO2 emissions in the atmosphere. The validation and quality of the proposed analytical model have been statistically evaluated using R square ( R 2 ), R square adjusted ( R a d j 2 ), root mean square error (RMSE) statistic and residual analysis. Eventually, its usefulness has been illustrated by utilizing different combinations of various attributable variables.
To our knowledge, no such statistical model has been developed under the proposed logical structure in Africa. Also, we wanted to rank the explanatory variables according to their CO2 contributions in the atmosphere and likely comparing them with those of the United States [
The CO2 emission data was obtained from Carbon Dioxide Information Analysis Center (CDIAC), located at Oak Ridge National Lab (Division of US Department of Energy). The plot of the yearly CO2 emissions in the atmosphere is shown in
The African CO2 emissions show an increasing pattern over the years 1964 to 1988. However, the years from 1990 to 2005 show nonstationary phenomena behavior in CO2 emissions as a function of time. The period 2002 to 2008 show a noticeable increase in CO2 emissions before a slight decrease in the years 2010 to 2013. This was probably due to the socio-economic and political crises that Africa was experiencing during these periods.
In developing the statistical model for CO2 emissions as a function of the attributable variables, one of the underlying assumptions is that the response variable should follow the Gaussian probability distribution. The mid-values of CO2 in the atmosphere seem to be reasonably straight, but the ends are somewhat skew which can be seen from the QQ plot in
The collinearity assumption of the model is shown in
However, a schematic diagram [
A statistical model describes the relationship of the response variable, (i.e. whose content we are trying to model) with the attributable variables. We proceed to develop the statistical model which is given by CO2 in the atmosphere as a function of the five attributable variables and all possible interactions as previously presented. One of the pure forms of a model with all possible interactions and additive error structure, in the given particular case, could be expressed as follows:
CO 2 = β 0 + ∑ i α i x i + ∑ j γ j k j + ε i , (1)
here β 0 is the intercept of the model, α i is the coefficient of ith individual attributable variable x i , γ j is the coefficient of jth interaction term k j , and ε i denotes the random disturbance or residual error of the model.
One of the underlying assumptions to construct the above model is that the response variable should follow the Gaussian probability distribution. As we
illustrated above, the dependent variable CO2 emission does not follow the Gaussian probability distribution. Therefore, we must utilize the Johnson Transformation [
z = γ + δ ln ( x − ϵ λ + ϵ − x ) ,
and
TCO 2 = − 4.4174 + 1.4857 ln ( CO 2 − ( − 0.9845 ) 0.3444 − CO 2 ) . (2)
Hence, TCO2 represents the new response variable after the Johnson Transformation has been applied. Again, we check the normality condition on the TCO2 data, which then follows the normal probability distribution as is clearly seen by
In order to develop our statistical model, we begin with the full statistical model, which included all five attributable variables as previously defined and ten possible interactions between each pair. Thus, initially, we start structuring our model with fifteen total terms that include the primary contribution of attributable variables and all possible interactions.
Since we started with the full statistical model (fifteen terms), as we mentioned above, we shall apply the backward elimination process to determine the significant contributions of both the individual attributable variables and interactions. Moreover, backward elimination is considered one of the best traditional methods in the case of having a small set of features to tackle overfitting and perform feature selection [
However, the estimation process of our statistical analysis has shown that four out of five risk factors significantly contribute and seven interaction terms. Thus, the best proposed statistical model with all significant attributable variables and interactions that estimates accurate CO2 emissions in the atmosphere in Africa is given by Equation (3), below.
T ^ CO 2 = − 6.658 × 10 00 + 1.800 × 10 − 4 So + 1.145 × 10 − 4 Li − 8.821 × 10 − 6 Ga − 1.949 × 10 − 4 Ce + 1.259 × 10 − 4 Gf − 1.276 × 10 − 9 So .Li − 4.896 × 10 − 9 So .Ga + 1.849 × 10 − 8 So .Ce − 6.158 × 10 − 9 So .Gf + 5.616 × 10 − 9 Li .Ga − 1.137 × 10 − 8 Li .Ce + 5.723 × 10 − 8 Ce .Gf (3)
The TCO2 estimate is obtained from Equation (3) is based on the Johnson transformation of the data, thus we will utilize the anti-transformation on Equation (3) to estimate the desired, actual CO2 emissions in the atmosphere as follows:
C ^ O 2 = − 0.05 + e 0.673 × T ^ CO 2 0.051 + e 0.673 × T ^ CO 2 . (4)
The proposed model will help scientists understand how the typical value of the carbon dioxide emissions in the atmosphere in Africa changes when any one of the five attributable variables is varied, while the other attributable variables are held fixed. Similarly, with the significant interaction. Most commonly, it will estimate the conditional expectation of the carbon dioxide emissions given the attributable variables. Furthermore, we illustrate the percentage that the attributable variables and the interactions contributing to CO2 in the atmosphere by
To assess the quality of the proposed statistical model we use both the coefficient of determination, R2 and adjusted R2 which are the key criteria to evaluate the model fitting.
The regression sum of squares (SSR), is a measure of the variation that is explained by the proposed model. The sum of squared errors (SSE), also called the residual sum of squares, is the variation that is left unexplained. The total sum of squares (SST) is proportional to the sample variance and equals the sum of SSR and SSE. The coefficient of determination R2 is defined as the proportion of the total response variation that is explained by the proposed model and it measures how well the regression line approximates the real data points. Thus, R2 is given by
R 2 = 1 − SSE SST .
However, R2 itself does not consider the number of variables in the model, plus there is that sticky problem of the ever increasing R2. The R2 adjusted will adjust for degree of freedom of the model and considers the number of parameters. The R2 adjusted is
R a d j 2 = 1 − SSE / df error SST / df total .
For our final statistical model, the R squared is 0.9728 and R squared adjusted is 0.9644. Both R squared and R squared adjusted are very high (more than 90%) and very close to each other. That is, the developed statistical model explains 97.28% of the variation in the response variable, a very high-quality model. Similarly, the risk factor that we included in the model along with the relevant interactions estimate 97% of the Africa CO2 emissions (metric tons per capita) in the atmosphere. These results show that the increase of the value of R squared is not due to the increase in the number of the predictors but to the good quality of the proposed statistical model.
In
Rank | Variables |
---|---|
1 | Liquid Fuels |
2 | Solid Fuels |
3 | Liquid Fuels ∩ Solid Fuels |
4 | Solid Fuels ∩ Gas Flares |
5 | Cement ∩ Gas Flares |
6 | Cement |
7 | Gas Fuels |
8 | Solid Fuels ∩ Cement |
9 | Solid Fuels ∩ Gas Fuels |
10 | Liquid Fuels ∩ Gas Fuels |
11 | Liquid Fuels ∩ Cement |
12 | Gas Flares |
respect to their contribution of CO2 in the atmosphere. That is, (we listed those terms based on their percentage of contribution to CO2 in the atmosphere) as we expected, Li ranks number one which is one of the risk factors from the emissions from fossil fuels.
Again the percentage of their contributions is shown in
The presence of collinearity which leads to overinflating the standard errors of the estimated coefficients; as well as it makes some attributable variables statistically insignificant when they should be significant and stable. Basically, in developing the proposed statistical model for CO2 emissions, the ordinary least squares method (OLS) has been used to obtain an approximate estimate of the coefficients of the contributable variables.
To address the multicollinearity problem, the Regularization methods are used and whereas these methods are based on adding the regularization parameter (two small penalty equal λ and α ) to the regression coefficients of the individual attributable variables, so that the model generalizes the data and prevent over-fitting. This can be explained with a cost function of the form
CO 2 = ∑ i = 1 n ( y i − ∑ j = 1 p x i j β j ) 2 .
Hence, we can characterize these proposed developed models into three categories as following: Ridge regression regularization method that adds squared magnitude of coefficient as penalty term to the loss function that can be explained by
CO 2 = ∑ i = 1 n ( y i − ∑ j = 1 p x i j β j ) 2 + λ ∑ j = 1 p β j 2 , (5)
where Lasso regression method, adds absolute value of magnitude of coefficient as penalty term to the loss function that can be expressed by
CO 2 = ∑ i = 1 n ( y i − ∑ j = 1 p x i j β j ) 2 + λ ∑ j = 1 p | β j | , (6)
and the Elastic Net regression method which is the mix of Ridge and Lasso technique can be defined by
CO 2 = ∑ i = 1 n ( y i − ∑ j = 1 p x i j β j ) 2 + λ [ ( 1 − α ) ∑ j = 1 p β j 2 + α ∑ j = 1 p | β j | ] . (7)
However, in the above Equations (5)-(7) the constructions of the three models will be the same structure as our proposed model in Equation (1) with only the coefficient estimation will be different because of the randomness of choosing the training data set. Also, they will include optimal two hyper-parameters, which are lambda = 0.0001 and alpha = 1 (penalty term) that give the smallest RMSE, as shown in
Technique | RMSE |
---|---|
Proposed Model | 0.261 |
Ridge Model | 0.484 |
Lasso Model | 0.307 |
Elastic Net Model | 0.307 |
We utilize two methods to perform the model validation. The first method is to use the proposed model to calculate the predicted value for each individual data, CO2, and then calculate the residuals.
Thus, the residual analysis of the complete model used to attest the quality of the developed statistical model, that is, the observed annual CO2 emission in the atmosphere (response) minus the model estimate of CO2 emission.
The residual analysis also justifies the model assumptions of normality and constant error variance. For the developed statistical model, where the mean residual is equal zero indicates that the predictions from our statistical model are very good, variance of the residual is 0.03, standard deviation is 0.16 and standard error of the residuals is 0.19, that are very good statistics that support the high quality of the model. The results are shown in Q-Q plot in
From the Q-Q plot, we can clearly see an approximate normality distribution of the residual within 95% confidence interval and the scatter plot illustrates an approximate zero mean and no clear pattern or trend in the residuals.
The second method we will utilize repeated cross-validation. The basic idea is; we will use 10-fold cross-validation, then just repeating cross-validation five times where in each of the repetition folds are split differently. In 10-fold cross-validation, the training set is divided into ten equal subsets. One of the subsets is taken as a testing set in turn and (10-1) subsets are taken as a training set in the proposed model.
Besides, after each repetition of the cross-validation, the model assessment metric is computed, whereas root mean square errors (RMSE) selected as the cost function, which is given by:
RMSE = ∑ i = 1 n ( y i − y ^ i ) 2 n . (8)
We construct our model using only the training set, and the constructed model will have the same structure as our proposed model with only the weights of the attributable variables will be different. To enhance the reliability of the training results; we use this model to predict the CO2 value using the testing sets of the attributable variables. However, we repeated this procedure to verify which regularization technique can be considered to improve the prediction and
then compare it with our proposed model we had on the RMSE. The results are shown in
We compare the statistical models in terms of the root mean square errors; RMSE, of the prediction of the CO2. The proposed nonlinear statistical model performed better than the other models with the smallest RMSE 0.261. Also, since the hyper-parameter α tuning using cross-validation in Equation (7) equal one, the RMSE was the same in both methods Lasso and Elastic net. Thus, the proposed underlying statistical model is very high in quality to predict CO2 in the atmosphere.
• Ranking of the Contributing Variables—Africa
We use the R2 criteria to rank the attributable variables along with the significant interactions with respect to the percent of contribution of CO2 emissions in the atmosphere.
The risk variable that has the biggest contribution to the CO2 emission in Africa is Liquid-Fuels, which contributes 13% of the CO2 emission. The next largest contribution is Solid-Fuels with 11% contribution. Note that numbers (rankings) 3, 4, and 5 are interactions of Li ∩ So, So ∩ Gf, and Ce ∩ Gf, respectively. Hence, summing these risk factors up we identify that they contribute 97.5% of CO2 emissions in Africa.
• Ranking of the Contributing Variables—United States
Xu and Tsokos [
Rank | Variables | Contribution (%) |
---|---|---|
1 | Liquid Fuels (Li) | 12.8 |
2 | Solid Fuels (So) | 11.3 |
3 | Li ∩ So | 10.8 |
4 | So ∩ Gf | 9.8 |
5 | Ce ∩ Gf | 8.6 |
6 | Cement (Ce) | 8.1 |
7 | Gas Fuels (Ga) | 7.1 |
8 | So ∩ Ce | 6.6 |
9 | So ∩ Ga | 6.5 |
10 | Li ∩ Ga | 6.3 |
11 | Li ∩ Ce | 5.8 |
12 | Gas Flares (Gf) | 3.8 |
Rank | Variables | Contribution (%) |
---|---|---|
1 | Liquid Fuels (Li) | 17.59 |
2 | Li ∩ Ce | 16.36 |
3 | Ce ∩ Bu | 15.73 |
4 | Bunker Fuels (Bu) | 15.06 |
5 | Cement | 10.77 |
6 | Gas Flares (Gf) | 8.95 |
7 | Gas Fuels (Ga) | 6.82 |
8 | Ga ∩ Gf | 5.43 |
9 | Li ∩ Ga | 2.25 |
10 | Li ∩ Bu | 0.02 |
• Ranking of the Contributing Variables—European Union
In 2013, Teodorescu and Tsokos [
• Ranking of the Contributing Variables—South Korea
Similarly, in 2015, Kim and Tsokos [
• Ranking of the Contributing Variables—Middle East
Recently, Habadi and Tsokos [
Rank | Variables | Contribution (%) |
---|---|---|
1 | Gas Fuels (Ga) | 48.72 |
2 | Li ∩ Bu | 12.41 |
3 | Li2 | 11.79 |
4 | Bu2 | 7.78 |
5 | Gas Flares (Gf) | 6.66 |
6 | Li ∩ Gf | 5.06 |
7 | Li ∩ Bu | 4.71 |
8 | Liquid Fuels (Li) | 2.86 |
Rank | Variables | Contribution (%) |
---|---|---|
1 | Liquid Fuels (Li) | 75.37 |
2 | Solid Fuels (So) | 18.61 |
3 | So ∩ Bu | 2.008 |
4 | Ga ∩ Bu | 1.534 |
5 | Li ∩ Bu | 0.912 |
6 | Bunker Fuels (Bu) | 0.47 |
7 | Gas Fuels (Ga) | 0.224 |
8 | Li ∩ So | 0.207 |
9 | Li ∩ Ga | 0.062 |
10 | Li ∩ So ∩ Bu | 0.004 |
Rank | Variables | Contribution (%) |
---|---|---|
1 | Cement (Ce) | 15.28 |
2 | Gas Fuels (Ga) | 14.7 |
3 | Li ∩ So | 13.66 |
4 | Ga ∩ So | 13.47 |
5 | Ga ∩ Ce | 12.56 |
6 | Liquid Fuels (Li) | 10.63 |
7 | So ∩ Gf | 9.65 |
8 | Gf ∩ Ce | 7.9 |
• Global Comparison: USA, EU, S. Korea, ME and Africa
Furthermore, Liquid-Fuels is the number one attributable variable of the emission of CO2 in the atmosphere in Africa, the US, and South Korea, whereas it is the last in the EU and the 6th in the Middle East.
Moreover, Gas-fuels ranked as the number one attributable variable in the EU; however, it is the 7th in Africa, the US, and South Korea with a contribution 7.1%, 6.82%, and 0.224% respectively while in the Middle East is ranked as number Two with only 14.7% contribution.
Similarly, Cement is ranked as the number one attributable variable in the Middle East; however, it is the 6th in Africa with a contribution 8.1%, whereas it is the 5th in the US with a contribution 10.77%.
As well, it is interestingly to identify that Africa has seven significant contributing interactions of the risk factors while the US and South Korea identified five, whereas the Middle East has Four significant interactions and EU has only three contributing interactions to CO2 emissions.
In the present study we investigated fossil fuels risk factors that contribute to the widespread of the most common air pollutant namely carbon dioxide in the atmosphere in Africa. Previous data obtained from Carbon Dioxide Information Analysis Center (CDIAC) shows that there are five attributable variables that are contributing to the emission of carbon dioxide into the atmosphere in Africa. These attributable variables are Liquid fuels (Li), Solid fuels (So), Gas fuels (Ga), Gas flares (Gf) and Cement production, in addition to seven interaction among them.
Rank | USA | EU | S. Korea | ME | Africa |
---|---|---|---|---|---|
1 | Li | Ga | Li | Ce | Li |
2 | Li ∩ Ce | Li ∩ Bu | So | Ga | So |
3 | Ce ∩ Bu | Li2 | So ∩ Bu | Li ∩ So | Li ∩ So |
4 | Bu | Bu2 | Ga ∩ Bu | Ga ∩ So | So ∩ Gf |
5 | Ce | Gf | Li ∩ Bu | Ga ∩ Ce | Ce ∩ Gf |
6 | Gf | Li ∩ Gf | Bu | Li | Ce |
7 | Ga | Li ∩ Bu | Ga | So ∩ Gf | Ga |
8 | Ga ∩ Gf | Li | Li ∩ So | Gf ∩ Ce | So ∩ Ce |
9 | Li ∩ Ga | - | Li ∩ Ga | - | So ∩ Ga |
10 | Li ∩ Bu | - | Li ∩ So ∩ Bu | - | Li ∩ Ga |
11 | Li ∩ Bu | - | Li ∩ So ∩ Bu | - | Li ∩ Ce |
12 | - | - | - | - | Gf |
In our study, we build a data-driven statistical model in which we discovered that all five attributable variables significantly contribute to the emission of carbon dioxide in the atmosphere along with seven significant interactions which were unknown to be part of factors that significantly cause the emission of the carbon dioxide in the atmosphere of the Africa continent.
The identification of the significance of the five attributable variables and the seven interactions were based on a well-structured statistical data analysis. The data we obtained did not follow the Gaussian probability distribution. We then used the Johnson transformation to transform the response variable (i.e. carbon dioxide) to make it Gaussian, so that we could proceed with statistical modeling.
There was the presence of multicollinearity among the risk factors. However, our model was compared with a different penalization technique which provided very good results according to the RMSE statistic. In statistical modeling, specifically in regression modeling, the parameter coefficients and p-values are affected by multicollinearity. However, this does not affect our predictions and how precisely the predictions are, as well as the goodness of fit of our model. We do not have to be concerned about the severity of multicollinearity in our model if our main aim is to make predictions [
The proposed model has high predictive accuracy which is supported by the high values of R2 and adjusted R2. Furthermore, we ranked the attributable variables in descending order by their percentages of contribution to the emission of CO2 in the atmosphere. Liquid fuel was ranked the highest contributor of the emission of CO2 in Africa representing 12.8%, whereas Gas flares is the least contributor with 3.8%. Interestingly, countries like the United States and South Korea also have Liquid fuel as the leading cause of CO2 in air [
We can address the usefulness of the proposed model in the subject area. First, we can obtain excellent predictions of CO2 emissions in the atmosphere given the values of the attributable variables. Second, we identify the individual attributable variables. Third, we have identified the significant interactions that exist in the model. Fourth, we rank the individual attributable variables and interactions as a percentage of contribution in the response namely CO2 emissions in the atmosphere.
Furthermore, having this proposed model one can proceed to perform surface response analysis that is with a high degree of accuracy what are the values of attributable variables that would be at the acceptable level which will not lead the CO2 in the atmosphere to go above the critical value.
Thus, we want to obtain the values of those attributable variables, so that we will not exceed the specified value of CO2 in the atmosphere. Thus, we want to be at least 95% certain what are the values of the attributable variables to be within the minimum appropriate, acceptable CO2 in the atmosphere.
In addition, we cannot have a world policy for Global warming because we have studied five different regions of the world and seem to be responding differently with respect to CO2. Our findings show that it would be a waste of time and resources to manage the world increasing global warming base through Global uniform policies. It is clear from our study that Global environmental policies are not applicable, but rather regional well-structured policies will address the world problem of Global warming.
Finally, our proposed statistical model is highly useful for decision making and strategic planning on controlling the air pollutant CO2 in the atmosphere in Africa.
The authors wish to express our appreciation to T. J. Blasing, Carbon Dioxide Information Analysis Center, Environmental Sciences Division, Oak Ridge National Laboratory, for supplying us the source of the data and his helpful suggestions. We wish to thank the Faculty of Public Health, the University of Benghazi for funding the research, right with the support provided by Prof. Chris P. Tsokos.
The authors declare no conflicts of interest regarding the publication of this paper.
Abu Sheha, M.A. and Tsokos, C.P. (2019) Statistical Modeling of Emission Factors of Fossil Fuels Contributing to Atmospheric Carbon Dioxide in Africa. Atmospheric and Climate Sciences, 9, 438-455. https://doi.org/10.4236/acs.2019.93030