Statistical Modeling of Emission Factors of Fossil Fuels Contributing to Atmospheric Carbon Dioxide in Africa

Global warming is majorly caused by an increase in atmospheric temperature and carbon dioxide (CO2) emissions due to the rise in the temperature. The continued accumulation of CO2 into the atmosphere is a massive part of the climate change problem. This study aims to develop a data-driven statistical model using Africa’s fossil-fuel CO2 emissions real data to identify the significant attributable variables and their interaction that produce the carbon dioxide emissions. However, we have considered five attributable variables in our statistical modeling and they are Liquid fuels (Li), Solid fuels (So), Gas fuels (Ga), Gas flares (Gf) and Cement production. The development of the statistical model that contains the different emissions of fossil fuels and their interactions have been specified and ranked based on a percentage of their annual contributions to carbon dioxide in the atmosphere. Our proposed statistical model is compared with a different penalization method since multicollinearity among the risk factors exists and which provided excellent results according to the root mean square errors (RMSE) statistic. The results of the proposed model are compared to previous results of different countries of the world.

temperatures are already measuring about 1.0˚C which means the planet is two-thirds of the way there.
The warmest year on record since 1850 is 2016 with a central estimate of 1.15˚C above the same baseline [2]. Scientists around the globe have gathered tons of evidence telling us that the earth is rapidly warming up. They believe that as the concentration of carbon dioxide in earth's atmosphere CO 2 increases, so is the temperature, and both are directly connected. However, the latest report that has been prepared by the UK's Met Office Hadley Centre (office's Richard Betts 2019) [2], pointed out that in 2019 the average CO 2 concentration in the earth's atmosphere is expected to increase by 2.8 ppm to reach 411 ppm, and that it will be the most significant rise in the concentration of atmospheric carbon dioxide in 62 years of records.
Indeed, carbon dioxide is released into the atmosphere from both natural and human's emission such as fossil fuels that people are burning for energy. Besides, in 2017 Li et al. [3] have found that economic growth, resident population growth, and energy intensity enhancement were the major significant growth factors of carbon emissions in Beijing. This is consistent with the IPCC report (April 2007) [4] that "Africa was not acting quickly enough to stem the dire economic and environmental consequences of greenhouse gas emissions". Continuing with the same report South Africa has been ranked as the 13th largest Carbon Dioxide emitter among all the countries in the world in 2008 based on the record of fossil-fuel CO 2 consumptions and cement productions with 119 million metric tons of carbon CO 2 emissions. Thus, South Africa is considered the largest CO 2 emitting country on the continent of Africa.
According to the United Nations Fact Sheet on Climate Change [5], Africa is the continent's most vulnerable into the impacts of climate change. Most vulnerable are the Seychelles islands, Cape Verde, and Mauritius, as well as large African deltas such as the Niger Delta, Nile delta in Egypt, the Kalahari and Okavango deltas in Botswana. Most of the continent already is experiencing temperature increases of approximately 0.7˚C, and with predictions that the temperatures will rise further, in addition Africa is facing a wide range of impacts, including increased drought and floods. The impact of climate change has already aggravated parts of Africa. For example, in the large basins the total available water in Senegal, Lake Chad, and Niger has decreased by 40 to 60 per- Atmospheric and Climate Sciences cent and many climate models project declining precipitation in the already-dry regions of Southern Africa [5].
Regions that are facing inadequate supply of water, especially in North Africa, would have climate change further threatening sustainable development due to demands of water. On the other hand, African countries that are affected by AIDS, HIV, fighting poverty, political instability, internal/civil wars, drawbacks in policy making and economic reforms may lack the funds/resources to tackle these expected significant climate change problems. Usually, atmospheric CO 2 concentrations that are emitted from fossil fuel combustion and industrial operation are divide into seven sources [6] (based on the chemical form of fossil-fuels) namely: Solid fuels (So) include wood, charcoal, coal, and others; Liquid fuels (Li) is the gasoline that we regularly use to create mechanical energy, Gas fuels (Ga) carry gas consisting essentially of methane, and Gas flares (Gf) are the vertical stack on oil wells or natural gas well completion activities. Cement production (Ce), oxidation of non-fuel hydrocarbons (Hy), and fuel from bunkers (Bu) used for shipping and air transportation. Thus, these seven emissions are considered as the attributable variables to the atmospheric CO 2 concentration in our statistical modeling with their interactions. Bunkers (Bu) and oxidation of non-fuel hydrocarbons (Hy), information are not available in the Africa data base, so our model is utilizing five attributable variables in this study.
In the present study, the real yearly CO 2 emissions data for each of the fossil-fuels for the African continent obtained from Carbon Dioxide Information Analysis Center (CDIAC), and this actual annual data has been collected from 1963 to 2014. All emission estimates are shown in metric tons of carbon (MT). In developing the statistical model, the response variable is the CO 2 in the atmosphere; hence, we develop an analytical model that contains the significant contributable variables and important interactions along with higher order of contributions if applicable.
The proposed model relies on several assumptions such as the linearity, multicollinearity, and the normality assumption that related to errors. Carbon dioxide dataset shows that the attributable variables are highly correlated; thus, the parameters are challenging to interpret. The parameters become very unstable when independent variables are highly correlated and leading to experiencing over-fitting the model. Moreover, we apply different penalization regression methods: Ridge Regression (L2) [7], Lasso Regression (L1) [8], and Elastic net (EN) [9]. These methods are widely used to address over-fitting of the model. To our knowledge, no such statistical model has been developed under the proposed logical structure in Africa. Also, we wanted to rank the explanatory variables according to their CO 2 contributions in the atmosphere and likely comparing them with those of the United States [10] [11], European Union [12], South Korea [13], and the Middle East [14]. Therefore, looking for an appropriate statistical model in predicting of carbon emissions is imperative.

The Data
The CO 2 emission data was obtained from Carbon Dioxide Information Analysis Center (CDIAC), located at Oak Ridge National Lab (Division of US Department of Energy). The plot of the yearly CO 2 emissions in the atmosphere is shown in Figure 1, below.
The African CO 2 emissions show an increasing pattern over the years 1964 to 1988. However, the years from 1990 to 2005 show nonstationary phenomena behavior in CO 2 emissions as a function of time. The period 2002 to 2008 show a noticeable increase in CO 2 emissions before a slight decrease in the years 2010 to 2013. This was probably due to the socio-economic and political crises that Africa was experiencing during these periods.
In developing the statistical model for CO 2 emissions as a function of the attributable variables, one of the underlying assumptions is that the response variable should follow the Gaussian probability distribution. The mid-values of CO 2 in the atmosphere seem to be reasonably straight, but the ends are somewhat skew which can be seen from the QQ plot in Figure 2, below. The goodness-of-fit testing (Shapiro-Wilk normality test, A p-value = 8.952e−05) that the subject data does not follow the normal probability distribution as well. Therefore, the QQ plot supports the fact that natural phenomena such as atmospheric CO 2 are not following the Gaussian probability distribution.  The collinearity assumption of the model is shown in Figure 3, where negative correlations displayed in red and positive correlations in blue color. Color intensity and the degree of the relationship between each pair are proportional to the correlation coefficients. Thus, the variables Gas-Fuels (Ga), Solid-Fuels (So), Liquid-Fuels (Li), Bunker-Fuels (Bu), and Cement (Ce) have a positive high correlation, so at this point we would consider the regularization techniques such as Ridge Regression (L2), Lasso Regression (L1) and Elastic net penalties to address over-fitting. Hence, there are enough statistically significant relationships (Linearity) between CO 2 and Africa's fossil-fuel CO 2 emissions to build a high-quality multiple regression model. However, a schematic diagram [15] that shows the relationship between the attributable variables and carbon dioxide in the atmosphere is shown in Figure   4.

Statistical Modeling
A statistical model describes the relationship of the response variable, (i.e. whose content we are trying to model) with the attributable variables. We proceed to develop the statistical model which is given by CO 2 in the atmosphere as a function of the five attributable variables and all possible interactions as previously presented. One of the pure forms of a model with all possible interactions and additive error structure, in the given particular case, could be expressed as follows: γ is the coefficient of j th interaction term j k , and i ε denotes the random disturbance or residual error of the model.
One of the underlying assumptions to construct the above model is that the response variable should follow the Gaussian probability distribution. As we   illustrated above, the dependent variable CO 2 emission does not follow the Gaussian probability distribution. Therefore, we must utilize the Johnson Transformation [16] to the carbon dioxide data to filter the data to follow normal probability distribution, which results in Equation (2) Hence, TCO 2 represents the new response variable after the Johnson Transformation has been applied. Again, we check the normality condition on the TCO 2 data, which then follows the normal probability distribution as is clearly seen by Figure 5, thus, we proceed to estimate the approximate coefficients (weights) of the actual contributable variables for the transformed CO 2 atmosphere data in the Equation (2).
In order to develop our statistical model, we begin with the full statistical model, which included all five attributable variables as previously defined and ten possible interactions between each pair. Thus, initially, we start structuring our model with fifteen total terms that include the primary contribution of attributable variables and all possible interactions.
Since we started with the full statistical model (fifteen terms), as we mentioned above, we shall apply the backward elimination process to determine the significant contributions of both the individual attributable variables and interactions. Moreover, backward elimination is considered one of the best traditional methods in the case of having a small set of features to tackle overfitting and perform feature selection [17].
However, the estimation process of our statistical analysis has shown that four out of five risk factors significantly contribute and seven interaction terms. Thus, the best proposed statistical model with all significant attributable variables and interactions that estimates accurate CO 2 emissions in the atmosphere in Africa is given by Equation (3), below.
The TCO 2 estimate is obtained from Equation (3) is based on the Johnson transformation of the data, thus we will utilize the anti-transformation on Equation (3) to estimate the desired, actual CO 2 emissions in the atmosphere as follows:  and very close to each other. That is, the developed statistical model explains 97.28% of the variation in the response variable, a very high-quality model. Similarly, the risk factor that we included in the model along with the relevant interactions estimate 97% of the Africa CO 2 emissions (metric tons per capita) in the atmosphere. These results show that the increase of the value of R squared is not due to the increase in the number of the predictors but to the good quality of the proposed statistical model.
In Table 1, we rank the individual attributable variables and interactions with Again the percentage of their contributions is shown in Figure 6.

Penalized Regression Models
The presence of collinearity which leads to overinflating the standard errors of the estimated coefficients; as well as it makes some attributable variables statistically insignificant when they should be significant and stable. Basically, in developing the proposed statistical model for CO 2 emissions, the ordinary least squares method (OLS) has been used to obtain an approximate estimate of the coefficients of the contributable variables.
To address the multicollinearity problem, the Regularization methods are used and whereas these methods are based on adding the regularization parameter (two small penalty equal λ and α ) to the regression coefficients of the individual attributable variables, so that the model generalizes the data and prevent over-fitting. This can be explained with a cost function of the form where Lasso regression method, adds absolute value of magnitude of coefficient as penalty term to the loss function that can be expressed by However, in the above Equations (5)-(7) the constructions of the three models will be the same structure as our proposed model in Equation (1) with only the coefficient estimation will be different because of the randomness of choosing the training data set. Also, they will include optimal two hyper-parameters, which are lambda 0.0001 = and alpha 1 = (penalty term) that give the smallest RMSE, as shown in Table 2, below.

Validation of the Proposed Models
We utilize two methods to perform the model validation. The first method is to use the proposed model to calculate the predicted value for each individual data, The second method we will utilize repeated cross-validation. The basic idea is; we will use 10-fold cross-validation, then just repeating cross-validation five times where in each of the repetition folds are split differently. In 10-fold cross-validation, the training set is divided into ten equal subsets. One of the subsets is taken as a testing set in turn and (10-1) subsets are taken as a training set in the proposed model.
Besides, after each repetition of the cross-validation, the model assessment metric is computed, whereas root mean square errors (RMSE) selected as the cost function, which is given by: MSE .    Table 2, above.
We compare the statistical models in terms of the root mean square errors; RMSE, of the prediction of the CO 2 . The proposed nonlinear statistical model performed better than the other models with the smallest RMSE 0.261. Also, since the hyper-parameter α tuning using cross-validation in Equation (7) equal one, the RMSE was the same in both methods Lasso and Elastic net. Thus, the proposed underlying statistical model is very high in quality to predict CO 2 in the atmosphere.  Table 3 below shows the rankings of these risk factors along with their percent of the overall contribution. The risk variable that has the biggest contribution to the CO 2 emission in Africa is Liquid-Fuels, which contributes 13% of the CO 2 emission. The next largest contribution is Solid-Fuels with 11% contribution. Note that numbers (rankings) 3, 4, and 5 are interactions of Li ∩ So, So ∩ Gf, and Ce ∩ Gf, respectively. Hence, summing these risk factors up we identify that they contribute 97.5% of CO 2 emissions in Africa.  Ranking of the Contributing Variables-United States Xu and Tsokos [10] [11] structured a nonlinear statistical model that identified the significant risk factors along with the significant interactions that contribute to the CO 2 in the atmosphere in the continental United States. The ranks of the contributing variables with the rate of CO 2 contribution in the atmosphere are listed in Table 4. Thus, these variables and interactions contribute 98.98% of emissions in United States.  In 2013, Teodorescu and Tsokos [12] developed a data driven nonlinear statistical model using CO 2 emissions data for the European Union Countries (EU).

Results and Discussion
They have found that Gas-Fuels contribute 48.72% of the overall CO 2 emissions.

Conclusions
In the present study we investigated fossil fuels risk factors that contribute to the widespread of the most common air pollutant namely carbon dioxide in the atmosphere in Africa. Previous data obtained from Carbon Dioxide Information Analysis Center (CDIAC) shows that there are five attributable variables that are contributing to the emission of carbon dioxide into the atmosphere in Africa.
These attributable variables are Liquid fuels (Li), Solid fuels (So), Gas fuels (Ga), Gas flares (Gf) and Cement production, in addition to seven interaction among them.
In our study, we build a data-driven statistical model in which we discovered that all five attributable variables significantly contribute to the emission of carbon dioxide in the atmosphere along with seven significant interactions which were unknown to be part of factors that significantly cause the emission of the carbon dioxide in the atmosphere of the Africa continent.
The identification of the significance of the five attributable variables and the seven interactions were based on a well-structured statistical data analysis. The data we obtained did not follow the Gaussian probability distribution. We then used the Johnson transformation to transform the response variable (i.e. carbon dioxide) to make it Gaussian, so that we could proceed with statistical modeling.
There was the presence of multicollinearity among the risk factors. However, our model was compared with a different penalization technique which provided very good results according to the RMSE statistic. In statistical modeling, specifically in regression modeling, the parameter coefficients and p-values are affected by multicollinearity. However, this does not affect our predictions and how precisely the predictions are, as well as the goodness of fit of our model. We do not have to be concerned about the severity of multicollinearity in our model if our main aim is to make predictions [18].
The proposed model has high predictive accuracy which is supported by the high values of R 2 and adjusted R 2 . Furthermore, we ranked the attributable va- Furthermore, having this proposed model one can proceed to perform surface response analysis that is with a high degree of accuracy what are the values of attributable variables that would be at the acceptable level which will not lead the CO 2 in the atmosphere to go above the critical value.
Thus, we want to obtain the values of those attributable variables, so that we will not exceed the specified value of CO 2 in the atmosphere. Thus, we want to be at least 95% certain what are the values of the attributable variables to be within the minimum appropriate, acceptable CO 2 in the atmosphere.
In addition, we cannot have a world policy for Global warming because we have studied five different regions of the world and seem to be responding differently with respect to CO 2 . Our findings show that it would be a waste of time and resources to manage the world increasing global warming base through Global uniform policies. It is clear from our study that Global environmental policies are not applicable, but rather regional well-structured policies will address the world problem of Global warming. Finally, our proposed statistical model is highly useful for decision making and strategic planning on controlling the air pollutant CO 2 in the atmosphere in Africa.