Homogeneity of Monthly Mean Air Temperature of the United Republic of Tanzania with HOMER

The long-term climate datasets are widely used in a 
variety of climate analyses. These datasets, however, have been adversely 
impacted by inhomogeneities caused by, for example relocations of 
meteorological station, change of land use cover surrounding the weather 
stations, substitution of meteorological station, changes of shelters, changes 
of instrumentation due to its failure or damage, and change of observation 
hours. If these inhomogeneities are not detected and adjusted properly, the 
results of climate analyses using these data can be erroneous. In this paper for the first time, monthly mean air temperatures of the United Republic of Tanzania 
are homogenized by using HOMER software package. This software is one of the 
most recent homogenization software and exhibited the best results in the 
comparative analysis performed within the COST Action ES0601 (HOME). Monthly 
mean minimum (TN) and maximum (TX) air temperatures from 1974 to 
2012 were used in the analysis. These datasets were obtained from Tanzania 
Meteorological Agency (TMA). The analysis reveals a larger 
number of artificial break points in TX (12 breaks) than TN (5 breaks) time 
series. The homogenization process was assessed by comparing results obtained 
with Correlation analysis and Principal Component analysis (PCA) of homogenized 
and non-homogenized datasets. Mann-Kendal non-parametric test was used to 
estimate the existence, magnitude and statistical significance of potential 
trends in the homogenized and non-homogenized time series. Correlation analysis 
reveals stronger correlation in homogenized TX than TN in relation to non-homogenized 
time series. Results from PCA suggest that the explained variances of the 
principal components are higher in homogenized TX than TN in relation to 
non-homogenized time series. Mann-Kendal non-parametric test reveals that the 
number of statistical significant trend increases higher with 
homogenized TX (96%) than TN (67%) in relation to non-homogenized datasets.


Introduction
The study of climate change and variability in the United Republic of Tanzania (URT) depends on existing longterm observational climate datasets.The value of these datasets, however, strongly depends on its homogeneity [1].A homogeneous climate time series is defined as the one whose variability is only caused by change in weather and climate [2].Unfortunately, long instrumental records are rare if ever homogeneous.The inhomogeneity in these datasets is due, for example, to relocations of meteorological station, change of land use cover surrounding the weather stations, substitution of meteorological station, changes of shelters, changes of instrumentation due to its failure or damage, and change of observation hours [2][3][4].Most of these changes cause sudden shifts (change-points) in the series of local climate data, while some others (particularly urban development) result in gradually increasing biases from the real macroclimatic characteristics [5,6].All of these inhomogeneities can bias a time series and lead to misinterpretations of the studied climate [5].
More recently a comprehensive analysis to assess different homogenization techniques of climate series was included in scientific programme of the COST Action HOME ES 0601: Advances in Homogenization Methods of Climate series: an integrated approach (HOME).HOME objective was to develop a general homogenization method for homogenizing climate and environmental datasets.This task started in 2007 and was accomplished in 2011 with the release of two new software packages: HOMER (for monthly data) and HOM/SPLIDHOM (for daily data) homogenization [18].The aim of this paper is to use HOMER software package to homogenize monthly mean minimum (TN) and maximum (TX) air temperature datasets of the URT in the process of constructing reliable long-term datasets from original climate observations.

Area of Study
The domain of study is the URT which is located in East Africa between longitudes 29˚E to 41˚E and latitudes 1˚S and 12˚S.The Country lies on an area of 945,000 km 2 of which 884,000 km 2 is Land mass and 61,000 km 2 is Lakes, rivers and seashore.The URT has several physical features that contribute to high local variability in its climate: that include topography ranging from sea level to 1600 m in the west, high mountain Kilimanjaro at 5895 m altitude in the North eastern highland, Lake Victoria in the North, Lake Nyasa and River Ruvuma in the South and Lake Tanganyika in the West.Much of the country lies above 1000 m altitude with many areas above 1500 m in the central and North.It also has a complex seasonality associated with Indian Ocean [19][20][21][22].The URT is relatively sparsely covered with weather stations that are unevenly distributed and located in low and high altitudes areas.Most of the meteorological station networks that mainly comprise classical weather stations collecting data since 1900s are managed by the Tanzania Meteorological Agency (TMA).

Data Description
Monthly mean minimum (TN) and maximum (TX) air temperature from1974 to 2012 were used in the analysis.These datasets were obtained from TMA.Table 1 indicates the geographic information of meteorological stations used in this study.

Methodological Procedures
HOMER software was used to detect and correct the inhomogeneities in TN and TX datasets.The software is one of the most recent homogenization software and exhibited the best results in the comparative analysis performed within the COST Action ES0601 (HOME) [19].HOMER comprises additional functions to perform fast quality control of the data, which includes functions of the CLIMATOL R package which allows the user to estimate the station density, correlogram, histograms, box plots, and cluster analysis [2].For the detection of heterogeneities in the datasets HOMER combines three detection algorithms: pairwise-univariate detection, joint detection and ACMANT-bivariate detection, and correct the datasets using ANOVA [1].ACMANT is used to detect the most likely month of a change point (break).If the precise month of change is not known, the default is to validate the break at the end of the year, since detection is mainly performed on annual indices [6].

Missing Data Correction and Outlier Detection
The models used in HOMER for imputation of missing data and for outlier correction are presented in [6].In these models missing datasets are corrected using ANO-VA and Outliers are detected by pairwise comparison of different time series between candidate and best neighbour time series.This is performed by visual inspection of the plots of the difference between candidate and best neighbour time series (Figures 1-3).After a correction step, ACMANT bivariate detection confirms the selection changes on climate data series.

Development of Reference Time Series and Homogeneity Test
Reference time series must encompass the same climatic signal as the candidate series and are developed using several techniques.For example [22] developed a reference series for a 19-stations network that did not vary with time using arithmetic mean of all the other 18 stations in his network for each candidate.After the homogeneity test was run on all the stations, he created a new reference series as before but excluding those stations with inhomogeneities.Like [22,23] run homogeneity tests, and then uses homogenized data to develop reference time series which is used to rerun the homogeneity tests.
Another technique is described in [5], where reference series are created based on correlation coefficients between stations.In this study, reference time series was created as weight average of all 24 stations network of non homogenized datasets, then the homogeneity test was run to assess the quality of homogenization by com-paring both non homogenized data (hereafter NH) and homogenized data with reference series using the following methods;

Correlation Analysis
Correlation analysis was applied to annual time series to compute correlation matrix between annual time series of non homogenized data and homogenized data series to solve: 1) the correlation matrix between time series of non-homogenized and homogenized datasets 2) the Spearman Correlation Coefficient (SCC) between the nonhomogenized and homogenized time series.Also correlation analysis was performed between non-homogenized datasets and reference time series and between homogenized datasets and reference series with the objective of assessing the quality of the corrected dataset and to assess potential improvement in the similarity between time series of non-homogenized and homogenized data.[24] first suggested using the test for significance of Kendall's tau where the time (independent variable) is used as a test for trend.The Mann-Kendal test can be stated most generally as a test for whether Y (dependent variable) values tend to increase or decrease with time (monotonic change).In this study, the Mann-Kendal nonparametric test is used to estimate the existence, magnitude and statistical significance of potential trends in the NH, and HH time series, in order to assess the impacts of

Principal Component Analysis (PCA)
Principal component analysis is the most efficient way of compressing geophysical data both in space and time, as well as separating noise from meaningful data.It enables fields of highly correlated data to be represented adequately by a small number of orthogonal functions and the corresponding orthogonal time coefficients, which account for much of the variances in their spatial and temporal variability.PCA techniques are used to extract from a covariance matrix, robust structures that explain the largest variance of the original matrix and at the same time are uncorrelated.The original data is split into orthogonal spatial patterns (eigenvectors) and corresponding time series coefficients (principal components).An eigenvector pattern that accounts for a large function of the variance (eigenvalues) is considered to be physically meaningful.[27] has provided a lucid outline of the mathematical procedure necessary to define the functions and their time coefficients.The PCA method is capable of extracting the principal components (PCs) of patterns in a time series; each of the PCs is orthogonal to the others.The first PC (PC1) is the most dominant pattern and explains most of the variance; PC2 is the second most dominant PC, followed by PC3, etc.This characteristic of PCA was used in this study to assess homogenization results.The Kaiser criterion of retaining factors with eigenvalues greater or equal to one was used to determine the number of significant PCs [28].

Results and Discussion
Results from PCA on non-homogenized and homogenized data sets suggest the following: 1) the explained variance of the principal components of homogenized datasets are higher than explained variance of the principal components of non-homogenized datasets for both maximum and minimum temperature; 2) explained variance of the principal components of homogenized datasets are higher for TX (62%) than TN (53%); 3) the explained variance of the 2 -5 principal components of non-homogenized datasets are tendentiously higher than for TN than TX (Tables 2-5).
The temporal location and size of the breaks detected in TN and TX are indicated in Table 6.The numbers of detected breaks are lager in TX (12 breaks) larger than in TN (5 breaks).
Results from Correlation matrices between non-homogenized time series of TX and TN air temperature as well as between homogenized time series of TX and TN were computed.Results indicate that the Spearman Correlation Coefficient (SCC) values obtained for homogenized time series are higher in relation to non-homoge-nized time series.In general, SCC values between homogenized TX and TN time series is higher than those obtained in non-homogenized time series.
The calculated Spearman correlation coefficient values obtained between reference series and non-homogenized and homogenized time series (Figures 4 and 5) reveals that: 1) higher SCC between reference annual series and homogenized time series in most of the stations and for both maximum and minimum temperature than between reference annual series and non-homogenized time series; 2) higher SCC values between reference series and homogenised time series for TX but lower values in 2 weather stations (Tabora and Mlingano); 3) higher SCC values between reference series and homogenised time series for TN but lower values in 4 weather station (Songea, Moshi, Kibaha and Igeri).
Results from Mann-Kendal non-parametric test for trend reveals that the number of significant trend increases with homogenized than non-homogenized datasets.This is more evident for TX than TN, for example the number of significant trend for homogenized maximum and minimum air temperature are 96% and 67% respectively.

Conclusion and Recommendations
The TMA meteorological stations network analysed here includes stations located in the Island of Zanzibar and Pemba, also stations near the coastal line and inland   nized time series are higher than those obtained between reference series and non-homogenized time series Trend analysis performed on TX and TN time series reveals an increase of the number of statistical significant trends with homogenized TX (96%) and TN (67%) in relation to the non-homogenized time series.
Results from PCA reveal that homogenization leads to an increase of the similarities in the spatial and temporal variability of TX and TN.This behaviour is more evident for TX than TN.In this study the explained variance for the PCA is higher for homogenized than for non-homogenized datasets independently of the different climates in the region.Finally it should be noted that, to the best of our knowledge, this study is the first effort to homogenize the climate datasets using HOMER in URT.Finally, the presented results show that, homogenized data sets are reliable long-term datasets compared with non-homogenized datasets.

Figure 1 .
Figure 1.Screen capture of HOMER outputs: Mtwara series compared to its neighbours.Pairwise comparisons are sorted according to the increasing values of the noise standard deviation (upper left corner of each plot).The neighbours are sorted based on their cross-correlation with Mtwara.The top panel is the difference time series of Mtwara with Dar es Salaam, which has a standard deviation of 0.14˚C.The second panel is the difference between Mtwara and Kibaha, (0.20˚C).The third panel is the difference of Mtwara and Songea (0.27˚C).

Figure 2 .
Figure 2. Screen capture of HOMER outputs: Raw data series of Mwanza with outlier and missing values.

Figure 3 .
Figure 3. Screen capture of HOMER outputs: Corrected data series of Mwanzahomogenization.This test is suggested for trend analysis by the WMO[25] and has been used in many published works on climate change and climate variability[26].

Figure 4 .
Figure 4. Spearman correlation coefficient (SCC) between annual reference series and time series of non-homogenized (NH) and homogenized with HOMER (HH) of maximum air temperature for the period 1974-2012.stations at high altitudes like Kilimanjaro airport station.Furthermore these weather stations are located in different climatic zones, which may reduce the quality of homogenization.Results indicate that the number of significant trend increases higher with homogenized than non-homogenized datasets.Larger numbers of breaks

Figure 5 .
Figure 5. Spearman correlation coefficient (SCC) between annual reference series and time series of non-homogenized (NH) and homogenized with HOMER (HH) of minimum air temperature for the period 1974-2012.