Analysis of Global Warming Using Machine Learning

Climate change is a controversial topic of debate, especially in the US, where many do not believe in anthropogenic climate change. Because its consequences are predicted to be dire, such as a mass ocean extinction and frequent extreme weather events, it is important to learn what causes the warming in order to better combat it. In this study, the first challenge dwells on how to construct reliable statistical models based on massive climate data of 800,000 years and accurately capture the relationship between temperature and potential factors such as concentrations of carbon dioxide (CO 2 ), nitrous oxide (N 2 O), and methane (CH 4 ). We compared the performance several mainstream machine learning algorithms on our data, which includes linear regression, lasso, support vector regression and random forest, to build the state of the art model to verify the warming of the earth and identifying factors contributing the global warming. We found that random forest outperforms other algorithms to create accurate climate models which use features including concentrations of different greenhouse gases to precisely forecast global atmosphere. The other challenges in identifying factor importance can be met by the feature of ensemble tree-based random forest algorithm. It was found that CO 2 is the largest contributor to temperature change, followed by CH 4 , then by N 2 O. They all had some sorts of impact, though, meaning their release into the atmosphere should all be controlled to help restrain temperature increase, and help prevent climate change’s potential ramifications.


Introduction
The general scientific consensus is that the Earth is warming.Over the past century, the temperature has already climbed 0.5˚C [1].The "warming of the cli-mate is unequivocal", with the last decade being the warmest decade since 1850 [2].However, there is still debate over if global warming is actually occurring, and if it is, then if it is anthropogenic.A majority, about 51%, of people in the US do not believe in anthropogenic climate change, with 31% of these people saying the warming is natural, and 20% saying the warming is not occurring [3].
With the US being the only country out of 196 in the UN to not sign the Paris Agreement, a commitment to combat climate change, it is clear that there are still many people in the world that deny climate change is caused by humans or is even occurring at all.An argument against anthropogenic global warming of the Earth is due to the increased solar activity in the past few years.This point is moot, since the Sun goes through an eleven-year cycle of solar activity, and the Earth has been continuously warming for the past decade, which does not make sense.As others have said, solar activity has no correlation with global temperature [2].Of course, other conjectures against climate change exist.
According to the Intergovernmental Panel on Climate Change's (IPCC) latest report, the main driving force of global warming is the increase in concentration of carbon dioxide in the atmosphere [2].The vast majority of carbon dioxide recently added to the air is from burning fossil fuels, or because of humans [2].
Increased CO 2 concentration in the air causes increased temperature on the earth, which is known as the greenhouse effect, meaning the atmosphere will trap heat that is released by the sun [4].
So why we should worry about increased CO 2 concentration and global warming?The consequences of global warming can be catastrophic.The increased CO 2 concentration in the air will also lead to an increase in CO 2 absorbed by the ocean, which means the ocean will become more acidic.The pH of the ocean has already decreased 0.1 [2].If the ocean becomes too acidic, the results will be dire, as many organisms will be unable to adapt to acidity, resulting in significant loss of coral reefs and other underwater organisms [5].
In addition, many studies outline dire consequences involved with the global warming effects.According to IPCC report, the number of hurricanes, as well as the intensity of hurricanes, will increase due to the warming ocean water, putting coastal states at risk [2].In general, extreme weather events, such as droughts and floods, will occur more often with global warming.In addition, a warmer temperature means less ice in glaciers and the polar ice caps, and will result in a significant rise in sea level [2].The current projection is 50 cm to 100 cm by 2100, leaving many cities underwater [2].Climate change will also cause mass migrations, both within countries and across borders, since more people will lose their homes to extreme weather [2].Food security will also become an issue for many countries [2].Although increased CO 2 can help with crop production, many other factors will mitigate the benefits, such as lack of water due to droughts and changing temperature, causing a decrease in output [2].Increased CO 2 concentration in the atmosphere can also decrease the nutritional In order to determine the next step to help mitigate climate change, the main factors that drive climate change should be investigated to know how significant each factor is.This study will focus on these factors, as well as many other factors that have potential to cause differences in global temperature.A previous study of temperature over the past 1000 years was conducted, including solar activity, volcanic activity, and greenhouse gas (GHG) concentration [6].In the study, it was found that greenhouse gases predicted the temperature closer than the other two factors did [6].
Meanwhile, machine learning has been applied more and more widely on environmental protection problems and achieves promising results.Chen et al. [7] explored the application of double parallel feed forward neural network on estimating the suspended sediment loads to assist water resources management.
Olyaie et al. [8] compared performance of different neural networks on suspended sediment load of river system.Artificial neural network was also studied for evaluating energy consumption and environmental life cycle for incineration and landfill system in [9].Taormina [10] combined neural network with base flow separation and binary-coded swarm optimization to forecast river quantities.
Theory of variable fuzzy sets and fuzzy binary comparison method have been investigated on assessing water quality in [11].Those works demonstrate the applicability of machine learning techniques on environmental issues.
In this paper, our first aim is to validate global warming based on the collected public data.After, machine learning algorithms are employed to investigate the effects different factors have on the global temperature.Then, we will analyze the plots generated from the algorithms, as well as draw conclusions from the plot.
The paper proceeds as follows: Section 2 is about the dataset we have.Section 3 is about how the data was used in conjunction with different machine learning algorithms and what the algorithms are.Section 4 is about the results from the machine learning analysis of the data.Section 5 summarizes the results and includes how the findings from this paper can be used in future projects.

Data Collection
Data from the past 800,000 years will be compiled from a variety of public data- in parts per million (PPM) [12], N 2 O in parts per billion (PPB) [13], CH 4 in PPB [14], the year, and temperature difference between the average temperature of the last 100 years [15].More accurate temperature data over the past 100 years is obtained from Lawrence Berkeley National Lab.NO2, CH 4 , and CO 2 are used because they are all greenhouse gases that help cause climate change [16] [17] [18].

Data Preprocessing
The data collected over the 800,000 years are not aligned with each other.For example, there may be CO 2 and a corresponding temperature in year 1900, but may lack the corresponding N 2 O and CH 4 concentration at that time.To prepare the data for machine learning, we use linear interpolation to align the data, since machine learning algorithms cannot handle missing data points effectively.

Temperature Increase Analysis
The global temperature change over the past 100 years will be visualized based on the public data provided by Lawrence Berkeley National Lab.The trend of global warming can be observed in the plotted average global temperature over the past 70 years.The coefficient of determination (R 2 ) between global temperature and time is also computed, which can further validate statistically the increase of global temperatures along with time.

Factor Analysis
To investigate the possible factors that contribute to the global temperature increase, we need to conduct factor analysis on potential factors such as CO 2 concentration.Many research works have been conducted to show there is a strong relationship between temperature and CO 2 .The common technique to analyze potential factors includes visual check and statistical correlation computation.In this work, we first visualize the variations of temperature and CO 2 , and we also compute the R 2 to validate the correlation observed statistically.

Applying Machine Learning Algorithms
Machine learning is a collection of statistical methods to analyze trends, find relationships, and develop models to predict things based on data sets.The machine learning algorithms we explore for this global warming study are random forest, support vector regression (SVR), lasso, and linear regression.

Random Forest
Random forest is an algorithm that uses trees as building blocks to construct more powerful prediction models.The algorithm takes an ensemble of a certain number of trees.When building these decision trees, the splits will be based off a random number of predictors, less than the number in the full set.By restricting the number of predictors in each tree, the strong predictors do not drown out weaker predictors, and the final result (the average of the results of each decision tree) of many uncorrelated trees will reduce variance of the predictions.The averaged final result will also be more accurate than if all predictors were used, as a strong predictor won't always be used, decorrelating the trees from certain predictors, and making the average less variable and thus more reliable.

Support Vector Regression
Support vector machines, or SVM, are algorithms that use hyperplanes (a line in more than 3 dimension) to create regressions.Essentially, the algorithm tries to separate the different types of data using a hyperplane that has the largest margin between the groups in a multi-dimensional space.If there is a point of data outside the margin, then there will be a penalty that will affect if the hyperplane really is the optimal choice.SVM can use different kernels, or different ways of finding the hyperplane in a high dimensional space.Support vector regression (SVR) is an extension of this, creating a regression from the principles of SVM.
SVR, like in other regressions, also has a loss function, but it is only increased when the residuals are greater than a certain constant.

Lasso
Lasso, or least absolute shrinkage and selection operator, is an algorithm that uses shrinkage, or when data is shrunk toward a certain point like the mean.The algorithm uses L1 regularization, which adds penalty based on the sum of the absolute value of coefficients, and will shrink some coefficients to zero if they play no role.This prevents the model from over fitting and creating a more general model.At the same time, lasso tries to minimize the sum of squares of the data.

Results and Analysis
With the results, many conclusions can be drawn, since random forests output feature correlations and such using numbers.This will be conducted multiple times and averaged to get as accurate of a result as possible.

Results
After the text edit has been completed, the paper is ready for the template.Duplicate the template file by using the Save As command, and use the naming convention prescribed by your journal for the name of your paper.In this newly created file, highlight all of the contents and import your prepared text file.You are now ready to style your paper.

Temperature Change Over Time
Data about the temperature and the CO 2 concentration over the past 70 years were plotted on a graph (Figure 1).From the plot, the trend that the temperature has warmed over the past few decades is present.In addition, the graph of the concentration of carbon dioxide also correlates with the temperature graph, suggesting that they are related and that it may be a large cause of the warming of the Earth.
To further verify the relationship between carbon dioxide and temperature, as shown in Figure 2, other data with the CO 2 concentration over the past 800,000 years were plotted with the difference in temperature when compared to the average of the past 100 years.Through inspection of the new plot, it can be seen that they are heavily related, and that the concentration of CO 2 heavily influences the temperature of the Earth.Whenever the concentration in CO 2 rises, the temperature rises, and vice versa.Since the increase in CO 2 concentration has been attributed to humans, and CO 2 PPM and temperature seem to be related, it can be inferred that humans caused the rise in temperature through an increase in CO 2 .

Applying Machine Learning Algorithms
The data collected over the past 800,000 years was randomly split into two even samples, one for training and one for testing.We further employed 8-fold cross Here, we provide the key hyperparameters we used for different machine learning algorithms here.The hyperparameters here are selected in hyperparameter ranges we provided by using the 8-fold cross validation.The selected hyperparameters for random forest are 300 for number of trees used, 2 for max number of features, 1 for minimum number of samples required to be at a leaf node.For SVR, we use 2.0 for penalty C of the error term and radial basis function kernel      Figure 8. Plot of predicted temperature using random forest and the actual temperature from the testing set vs. the concentration of CH 4 .
Figure 9. Plot of predicted temperature using SVR and the actual temperature from the training set vs. the concentration of CO 2 .
Figure 10.Plot of predicted temperature using SVR and the actual temperature from the testing set vs. the concentration of CO 2 .score criteria such as mean absolute error can also be used, but they provide no better fitting models for our problem, so we use MSE to quantify the accuracy of the model employed.The training and testing MSE results for compared algorithms are shown in Table 1.It is clear that random forest creates the most accurate models.From the inspection of the plots, random forest is visually more accurate, and creates the most accurate model for predicting the temperature differences based on the concentrations of N 2 O, CO 2 , and CH 4 .We see that random forest runs efficiently on this dataset and has an effective method to estimate missing data.Thus, to build a more accurate model to predict the temperature with a larger set of features, random forest would be the best option out of these four algorithms.The accuracy of the algorithm also allows for an accurate feature importance chart in Table 2.As visible from the feature importance chart in Figure 15, CO 2 is the most significant feature in temperature change, at a factor of 0.6598, followed by methane, which has a factor of 0.2795, then the least significant would be N 2 O, at 0.0607.Through machine learning, the claim set forth by the IPCC and other studies, that CO 2 is the biggest contributor to temperature change, is confirmed.
The chart also shows that the effects of CH 4 and N 2 O are also prevalent, and still affect the temperature of the earth.
Carbon dioxide is a very big factor in determining the temperature of the air.
This means that the amount of carbon dioxide that humans (and not nature) are putting into the air is contributing a large amount to the changes in temperature [14].Both methane and N 2 O have considerable impacts as well.In fact, there is actually little methane in the air when compared to CO 2 (about 1.82 PPM for CH 4 vs. about 399 PPM for CO 2 ) [19], yet the effect of methane is still massive and should never be underestimated because a unit methane has much greater greenhouse effect than a unit CO 2 .Even if there isn't much of a gas in the air, it can still change the temperature.Thus, attention should be paid to all three of these gases.
We proved these three factors contribute to global warming when they are increased in concentration.As the IPCC noted, the effects of global warming can be catastrophic [2].Further, we can use our constructed model and combine the greenhouse releases prediction data to forecast the temperature in the future, which can contribute to the control and prediction of global warming.Now that we have verified the effects of greenhouse gases on the global temperature, the next step is to figure out how to limit the concentrations of these gases in the atmosphere, in order to slow the global temperature rise.In addition, with more data about the concentration of different greenhouse gases going back thousands of years, the models can be strengthened and become more accurate, and can also determine if the extent to which other gases affect the temperature.

Conclusions
As evident from the first part of the results, there is an upward trend in temperature, which correlates with the upward trend in CO 2 concentration.From the correlation analysis between the concentration of CO 2 and the temperature, we further show that increase in CO 2 concentration causing the temperature rise.Afterward, we compared different machine learning algorithms in predicting the temperature using the concentrations of three gases: CO 2 , CH 4 , and N 2 O.It is apparent that random forest is by far the most accurate algorithm of the three tested.By adding more features and more data to train it, it will become even more accurate, and become a useful model for temperature change.This means by predicting the future outputs of CO 2 , CH 4 , N 2 O, and any other features that the algorithm is trained with, random forest will accurately predict the temperature.
The feature importance data gathered from random forest also tells an important story.In our study, we show that CO 2 dominates the global temperature changes, but it is important to note that the unit of CH 4 and N 2 O is ppb while the unit of CO 2 is ppm, which indicates that the effect of CH 4 and N 2 O should never be underestimated.
In current work, only three factors are considered as contributing factors to temperature change can be further considered such as atmospheric circulation, currents, and biodiversity.We compared four machine learning algorithms which have been proven to provide satisfactory performance in many cases.However, other machine learning algorithms, especially ensemble-based algorithms such as xgboost, as well as neural network can also be investigated for seeking better models in future work.

Figure 1 .
Figure 1.Plot of CO 2 ppm and average temperature since 1950.
validation during training process to search for suitable hyperparameters and prevent models from overfitting during training.Then, three different machine learning algorithms were compared: random forest, lasso, and support vector regression.With each algorithm, the parameters were tuned to fit the data and generate accurate training results.The visual results are shown in Figures 3-14.

Figure 2 .
Figure 2. Plot of CO 2 ppm and temperature difference from the average of the last 100 years over the previous 800,000 years.

Figure 3 .
Figure 3. Plot of predicted temperature using random forest and the actual temperature from the training set vs. the concentration of CO 2 .

Figure 4 .
Figure 4. Plot of predicted temperature using random forest and the actual temperature from the testing set vs. the concentration of CO 2 .

Figure 5 .
Figure 5. Plot of predicted temperature using random forest and the actual temperature from the training set vs. the concentration of N 2 O.

Figure 6 .
Figure 6.Plot of predicted temperature using random forest and the actual temperature from the testing set vs. the concentration of N 2 O.

Figure 7 .
Figure 7. Plot of predicted temperature using random forest and the actual temperature from the training set vs. the concentration of CH 4 .

Figure 11 .
Figure 11.Plot of predicted temperature using SVR and the actual temperature from the training set vs. the concentration of N 2 O.

Figure 12 .
Figure 12.Plot of predicted temperature using SVR and the actual temperature from the testing set vs. the concentration of N 2 O.

Figure 13 .
Figure 13.Plot of predicted temperature using SVR and the actual temperature from the training set vs. the concentration of CH 4 .

Figure 14 .
Figure 14.Plot of predicted temperature using SVR and the actual temperature from the testing set vs. the concentration of CH 4 .

Figure 15 .
Figure 15.Importance of each feature, as determined by Random Forest.

Table 1 .
Mean squared error of each model within the training or testing data.

Table 2 .
Importance of each feature, as determined by Random Forest.