A Hybrid K-Means-GRA-SVR Model Based on Feature Selection for Day-Ahead Prediction of Photovoltaic Power Generation

In order to ensure that the large-scale application of photovoltaic power generation does not affect the stability of the grid, accurate photovoltaic (PV) power generation forecast is essential. A short-term PV power generation forecast method using the combination of K-means++, grey relational analysis (GRA) and support vector regression (SVR) based on feature selection (Hybrid Kmeans-GRA-SVR, HKGSVR) was proposed. The historical power data were clustered through the multi-index K-means++ algorithm and divided into ideal and non-ideal weather. The GRA algorithm was used to match the similar day and the nearest neighbor similar day of the prediction day. And selected appropriate input features for different weather types to train the SVR model. Under ideal weather, the average values of MAE, RMSE and R 2 were 0.8101, 0.9608 kW and 99.66%, respectively. And this method reduced the average training time by 77.27% compared with the standard SVR model. Under non-ideal weather conditions, the average values of MAE, RMSE and R 2 were 1.8337, 2.1379 kW and 98.47%, respectively. And this method reduced the average training time of the standard SVR model by 98.07%. The experimental results show that the prediction accuracy of the proposed model is significantly improved compared to the other five models, which verify the effectiveness of the method.


Introduction
In the face of limited fossil energy and the need to adjust the energy structure, the exploration of renewable energy power generation technology is of great significance [1]. A study shows that the earth receives about 1.8 × 10 11 MW of power per second from solar radiation [2]. Photovoltaic power generation is one of the most promising solar power technologies [3]. Photovoltaic energy has the advantages of cleanliness, wide distribution and abundant reserves, and has become the best substitute for industrial and residential power generation [4]. According to the 2020 report of the International Renewable Energy Agency, in the past 8 years, the global photovoltaic power generation cost has dropped by more than 70%, and the global installed capacity has reached 578.553 GW [5].
However, due to the chaotic nature of the weather system, the production of photovoltaic energy is highly random, volatile and intermittent, which may lead to grid power and voltage imbalances, and also greatly increase the difficulty of large-scale photovoltaic energy applications [6] [7]. In order to improve the power system's ability to consume photovoltaic energy, many solutions have been proposed, including energy storage optimization [8], demand response strategy [9] [10], power flow optimization [11], stand-alone microgrid [12], and PV power forecasting [13]. Considering economy and feasibility comprehensively, photovoltaic power generation forecast is one of the most promising solutions to the impact of large-scale photovoltaic energy application on the grid [14] [15].
The current photovoltaic power generation forecasting technologies have three main directions: physical methods, time series statistical methods and ensemble methods [14]. [16] proposed a partial function linear regression model to forecast the day-ahead photovoltaic power generation. The regression method has a low amount of calculation, but the prediction accuracy is relatively low. [17] proposed an ANN model based on an extreme learning machine algorithm to predict photovoltaic power generation. Artificial neural network can handle nonlinear problems and has excellent self-learning ability, so it has high prediction accuracy. However, the ANN multi-layer network structure greatly increases the complexity of the model, which makes training and optimizing the model consume a lot of computing resources and longer training time. In [18] [19] [20], the support vector machine (SVM) is used for short-term photovoltaic output forecasting. SVM can also handle non-linear problems, has excellent learning ability and does not rely heavily on prior knowledge. The training speed is fast and has the ability to prevent overfitting, with good generalization.
The ensemble method solves the limitations of a single model by mixing different models with unique functions, thereby improving the prediction performance [21]. For the prediction of photovoltaic power generation, the ensemble method that mixes various effective methods is more effective and accurate [22]. For example, the hybrid GA-SVM model [20], which performed better than the SVM model. In [23], a hybrid Kmeans-GRA-Elman model was proposed, the performance of Kmeans-GRA-Elman was better than BP neural network, Elman, GRA-BPNN and GRA-Elman.
Photovoltaic power generation has obvious seasonal and weather characteris-tics [24]. Weather conditions can be roughly divided into two types: the ideal weather type (sunny day), and non-ideal weather types [25]. For ideal weather, the prediction accuracy of many prediction methods is high enough [26]. It can be seen from [27] [28] that the prediction accuracy of these methods for nonideal weather was much lower than that of ideal weather. In order to improve the prediction performance under non-ideal weather, similar algorithms have been used in many studies to extract output features under similar weather. For example, [29] proposed a prediction method based on similar days and improved BP neural network. The similarity algorithm can effectively extract the output characteristics of different weather types. Moreover, compared to directly using a large amount of historical data to train the model, the use of similar days not only saves a lot of computing resources, but also improves the prediction accuracy of the model. However, if the time interval between the similar day and the forecast day is too long, the characteristics of the photovoltaic array (surface cleanliness, module aging, conversion efficiency, etc.) have changed a lot, which will cause a large error between the predicted result and the actual value [25].
A short-term photovoltaic power generation forecast method using the com- 1) A novel day-ahead PV power forecasting method utilizes SVR, clustering and similarity algorithms is proposed.
2) Clustering historical power data through multi-index K-means++ to obtain power generation modes of different weather types. Overcome the limitation of directly categorizing according to weather tags. According to the average power of each cluster, it is divided into ideal weather cluster and non-ideal weather cluster.
3) The GRA algorithm is used to match the nearest neighbor similar day, and the error caused by the long time interval between the similar day and the forecast day is reduced by using the information of the nearest neighbor similar day.
where, D(x) represents the distance between the sample and the nearest cluster center.
Then use the roulette method to select the next cluster center; Step 3: Repeat step 2 until K cluster centers are selected; Step 4: For each sample x i in the datasets, calculate its distance to K cluster centers, and then put it into the class corresponding to the smallest distance cluster center; Step 5: For each cluster, recalculate its cluster center C i : Step 6: Repeat steps 4 and 5 until the position of the cluster center does not change.
In this part, the historical power data is directly clustered by season to obtain different power generation modes due to the diversity of weather. Moreover, the aging of the equipment itself and its own parameters will be different under different weather, it is difficult for us to accurately measure these changes. The characteristics of historical power data will integrate these changes into it. After clustering the historical power data, the centroid value of each cluster is calculated by the minimum, average and maximum of global horizontal irradiance (GHI), diffuse horizontal irradiance (DHI), relative humidity (RH) and temperature (T)(12 meteorological factor eigenvalues).

Grey Relational Analysis Algorithm
The basic idea of the grey relational analysis algorithm is to judge the correlation degree by comparing the geometric similarity between the reference sequence and several data columns. Generally, the more consistent the change tendency of the reference sequence and the comparison sequence, the higher the degree of correlation between the two variables. The flow of the GRA algorithm is as follows: Journal of Computer and Communications Step 1: Determine the reference sequence y and the comparison sequence x i : where, n and m represent the dimension of the eigenvalues and the number of comparison sequence, respectively.
Step 2: Non-dimensionalization of variables: where, D j (k) contains reference sequence and comparison sequence, D av (k), D min (k) and D max (k) are the average, minimum and maximum values of each column, j represents sum of the number of reference sequence and comparison sequence. Non-dimensionalization is used to solve the problem that the columns cannot be compared due to the different dimensions.
Step 4: Calculate correlation degree. Calculate the average value of the correlation coefficient at each moment (that is, each point in the curve) r i : Step 5: Sort correlation degree. After determining the cluster to which the prediction day belongs, the correlation between the prediction day and each sample in the cluster is calculated by GRA based on 12 meteorological factor eigenvalues, and the date with the correlation degree greater than the threshold (an appropriate correlation value that takes into account the similarity and the number of samples) is regarded as the similar days. Based on GRA global matching results: for the ideal weather, the sample with the highest correlation in the 7 days before the forecast date is set as the nearest neighbor similar day; for non-ideal weather, the sample with the highest correlation in the 30 days before the prediction is set as the nearest neighbor similar day.

Support Vector Regression
Based on the structural risk minimization theory, the support vector machine constructs a hyperplane in the feature space, thereby overcoming the local optimal problem and requiring fewer training samples [30]. When the data type is complex, support vector regression is used. For a set of data ( )  , X i is the input variable of the sample, and Y i is the target value. The support vector machine equation based on Vapnik theory is as follows [31]: where, ω is a vector of weight coefficients, Φ(x) is the nonlinear mapping function and b denotes a bias constant. ω and b can be obtained by the following formula: where, i ξ and * i ξ are slack variables, and C denotes the penalty variable, ε is the insensitive loss function. By introducing Lagrangian multipliers and optimal constraints, (8) can be transformed into: In this paper, the radial basis function (RBF) kernel is applied to construct the SVR model. The RBF kernel is presented as: where, γ is the kernel parameter.

The HKGSVR Model Workflow
The flow chart of the hybrid K means-GRA-LSTM model is shown in Figure 1, and the workflow is as follows: Step 1: Obtain historical photovoltaic output power and meteorological factor data, and deal with the missing and abnormal data in the data set.
Step 2: Use the multi-index K-means++ algorithm to cluster historical photovoltaic power data by season, and calculate the 12 meteorological factor eigenvalues as the central value of each cluster. According to the average power of each cluster, it is divided into ideal weather cluster and non-ideal weather cluster.
Step 3: The Euclidean distance, Pearson correlation coefficient and GRA correlation between the 12 meteorological eigenvalues of the forecast day and the centroid value of each cluster are calculated to determine the cluster to which the forecast day belongs.
Step 4: Calculate the correlation between the predicted day and each sample in the matched cluster through GRA to obtain the similar days (as the training set) and the nearest neighbor similar day (as the validation set) and normalize them. Journal of Computer and Communications Step 5: Select appropriate input features and similar days are used to train SVR. Determine the C and γ of SVR through grid search and cross-validation, and use the nearest neighbor similar day to test.
Step 6: Use the trained model to predict the prediction day.

Clustering Evaluation Metrics
If the ground truth labels are not known, evaluation must be performed using the model itself. The Silhouette Coefficient is an example of such an evaluation, the score is higher when clusters are dense and well separated. Silhouette Coefficient S(i) is defined as follows [32]: where, a(i) is the mean distance between a sample and all other points in the same cluster, b(i) is the mean distance between a sample and all other points in the next nearest cluster. Average the Silhouette Coefficient of all points, which is the total Silhouette Coefficient of the clustering result. Davies-Bouldin index is defined as follows [33]: where, i S is the average distance from the points in the cluster to the cluster centroid, SSE is also an effective metric, that is, the sum of squared errors of the distance between the centroid of each cluster and the points in the cluster. SSE is defined as follows:

Metrics of Photovoltaic Power Forecasting Techniques
In order to evaluate the performance of the proposed method HKGSVR for photovoltaic power generation forecasting, the root mean square error (RMSE), average absolute error (MAE) and coefficient of determination (R 2 ) indicators were calculated. The mean absolute error can better reflect the difference between the predicted value and the true value. RMSE is used to measure the deviation between the predicted value and the actual value, so it is more sensitive to outliers (that is, if the predicted value of a point is very different from the true value, the RMSE of the curve will be very large). R 2 is used to test the fit of the predicted value to the true value, and is generally used to evaluate the prediction performance of the model. They are defined as follows [14].
1) The RMSE is defined as: where, P ai and P fi are the actual and predicted value at i hour. N refers to the number of hours a sample contains.
2) The MAE is expressed as: 3) The R 2 is given as:

Data
In this paper, the general datasets on the DKASC (Desert Knowledge Australia Solar Center) website are used for related experiments. The photovoltaic array is composed of 22 polycrystalline silicon photovoltaic panels with a rated power of 265 W, whose total rated power is 5.83 kW. The photovoltaic array is located at the Desert Knowledge Precinct in Alice Springs, a town in the Northern Territory that enjoys one of the country's highest solar resources in an arid desert environment. The configuration information of the photovoltaic array is shown in Table 1. Meteorology (global horizontal irradiance, diffuse horizontal irradiance, relative humidity and temperature) and historical power data of PV arrays from March 1, 2018 to February 29, 2020 were used in the experiment. The experiment uses data with an interval of 1 hour from 7:00 to 18:00 every day.

Number of Clusters and Weather Division
In order to obtain the appropriate number of clusters for each season, SSE, DBI and Silhouette Coefficient (S) are used for evaluation. Taking autumn as an example, the experimental results are shown in Figure 2.  It can be seen from Figure 2 that SSE decreases as the number of clusters increases. When the number of clusters is 3, the downward trend begins to slow down. DBI has the best performance when the value of K is 3. When the value of K is 2 and 3, the value of S is 0.71 and 0.64, respectively. Then, as the value of K increases, the value of S drops sharply. So the value of K is chosen between 2 and 3. When K = 2, the blue cluster and the red cluster merge into one cluster. However, the blue clusters are mostly smooth arcs, while the red clusters are mostly polylines. Therefore, the value of K is chosen to be 3. The blue cluster is selected as the ideal weather cluster (most of the curves are smooth and the average power is larger in the cluster), and the green and red clusters are non-ideal weather clusters (most of them are broken lines in the clusters, and the average power is small, the average power of the green cluster is 123.09 kW, and the average power of the red cluster is 301.90 kW).
The evaluation of clustering results in each season is shown in Table 2. In order to prevent local optima or other abnormal situations, 100 rounds of experiments were carried out. Considering all indicators and clustering results comprehensively, the number of clusters in spring is 3, the number of clusters in summer is 2, the number of clusters in autumn is 3, and the number of clusters in winter is 3.
The clustering results of each season are divided into ideal weather clusters and non-ideal weather clusters by comparing the average power and geometric shape (arc and polyline) of each cluster. The average power of each cluster in the four seasons is shown in Table 3. The ideal weather clusters are mostly smooth arcs, and the average power is relatively large. The non-ideal weather clusters are mostly broken lines, and the average power is small. Therefore, spring cluster 1, summer cluster 1, autumn cluster 1 and winter cluster 2 and 3 are divided into ideal weather clusters, and the rest are non-ideal weather clusters.

Selection of Similar Day Threshold and Nearest Similar Day
The similar days are obtained by calculating the GRA correlation between the predicted days and the samples in the matching clusters. In order to improve the prediction accuracy while reducing the computational cost and speeding up the training speed of the model, it is necessary to select an appropriate correlation threshold. A higher correlation threshold can improve the prediction accuracy, but too few training samples may cause overfitting. After comprehensive consideration, the similar day correlation threshold of each forecast day, the nearest neighbor similar day and its correlation degree are shown in Table 4 and Table  5. It can be seen that the nearest neighbor similar days of ideal weather are mostly adjacent days, while the time intervals of nearest neighbor similar days of non-ideal weather are relatively long.

Design of SVR Model
This part is mainly to explore the optimal C and γ of SVR, which are usually related to the characteristics of power generation in different seasons. Grid search and cross-validation are used to find the optimal number of C and γ for SVR.
This experiment uses PyCharm (python3.6) to train and optimize the SVR model on a Win 7 System personal computer with Intel core i5-3230CPU, 2.60 GHz processor and 4 GB RAM. It can be seen from Table 6 and Table 7 that the optimal training time of the ideal weather model for each season is 2.0342, 1.9506, 2.3272, 0.6826 s, and the average time is 1.74865 s. And the optimal training time of the model for each season of non-ideal weather is 0.2490, 0.2400, 0.2240, 0.2060 s, and the average time is 0.22975 s. The number of similar days matched has a greater impact on the training optimization time. Comparing Table 6 and Table 7, it can be found that because the data complexity of non-ideal weather is higher than that of ideal weather, the C of non-ideal weather is generally larger than that of ideal weather, and the γ of non-ideal weather is generally smaller than that of ideal weather.

Feature Selection
In order to select the appropriate input feature, GRA and Pearson correlation analysis is performed between the power generation and various meteorological factors. The historical power and meteorological data for the year from March 1, 2018 to February 28, 2019 are used for analysis. The result is shown in Figure 3. The definition of Pearson correlation coefficient is as follows: where, X and Y are meteorological factors and photovoltaic output power respectively, and N is the number of sampling points per day.
It can be seen from Figure 5 that the Pearson correlation coefficients between photovoltaic power generation and T, RH, GHI, and DHI are 0.35, −0.41, 0.97, and 0.35, respectively. GHI has the greatest impact on photovoltaic output, and there is a negative correlation between relative humidity and photovoltaic power. The GRA correlations between photovoltaic power generation and T, RH, GHI, and DHI are 0.68, 0.60, 0.88 and 0.67, respectively. GHI still has the largest impact on photovoltaic output.
Based on the above analysis, the paper proposes 10 feature combinations. The prefix N represents the nearest neighbor similar day, P, G, and M respectively represent Power, GHI and meteorological factor eigenvalues. For example, NG_MG represents the nearest neighbor day GHI and predicted day meteorological factor eigenvalues and GHI.
For ideal weather, due to its high prediction accuracy, the main consideration for the selection of its input features is to select a feature combination that is easier to obtain and requires less data accuracy while ensuring sufficient predic- It can be seen from Tables 8-10 that the MAE of NG_MG feature combination in each season is 1.3733, 2.0817, 1.6475, and 2.2323 kW, respectively. And From the perspective of comprehensive performance, NG_MG feature combination has higher prediction accuracy and robustness, so NG_MG is selected as the input feature of non-ideal weather.  weather. The feature combination is NP_M. The average value of R 2 is 0.9966. From the spring forecast results in Figure 5, it can be seen that HKGSVR has the highest degree of fit, HKGLSTM is the second, and the SVR trend is more consistent with the predicted day. From the summer forecast results in Figure 6, it can be seen that each point of HKGSVR has a high degree of fit, and HKGLSTM has a good performance except for one point that has a lower degree of fit. Through the observation of the autumn forecast results in Figure 7, HKGSVR still performs best, and the trends of SVR and HKGARIMA are more consistent with the forecasted day. From the winter forecast results in Figure 8, it is found that both HKGSVR and SVR perform better, and HKGLSTM and HKGBP have poor performance due to less training data.

Forecasting Results and Discussion
According to the MAE value of each model given in Figure 9, the HKGSVR's average value of MAE is 1.8337 kW, which is the minimum value of all models.    Observation in Figure 10 finds that the average RMSE of the HKGSVR model is 2.1379 kW, which is the best value among all models. The proposed model's average RMSE enhancement with respect to the compared five models is 24.13%, 52.12%, 59.87%, 61.37%, 52.66%, respectively.
Comparing the R 2 values of the models shown in Table 11 shows that the average R 2 of the proposed model is 0.9847, which is better than other models.    Under non-ideal weather, compared to the standard SVR (the average training optimization time is 11.5389 s), the average training optimization time of the HKGSVR model is 0.2225 s, which is 98.07% less than the standard SVR.

Conclusion
A hybrid day-ahead photovoltaic power generation prediction model based on K-means++, GRA and SVR is proposed. The historical power data are clustered by multi-index K-means++, and divided into ideal weather clusters and non-ideal weather clusters according to the average power of each cluster. And it chooses the appropriate feature combination for different weather, different feature combination which has a greater impact on the model performance. It also uses GRA to match the similar day and the nearest neighbor similar day of the prediction day to improve the prediction accuracy and reduce the training optimization time of the model. Compared with the standard SVR model under ideal weather, the HKGSVR model not only improves the prediction accuracy but also greatly shortens the training time. Under non-ideal weather, the average MAE, RMSE and R 2 of the proposed model are 1.8337, 2.1379 kW and 98.47%, respectively, which have better performance than the other five models. And the training time is 0.2225 s, which is 98.07% less than the standard SVR. When there are more accurate forecasting weather information and less training data, the HKGSVR model has higher forecast accuracy. In general, HKGSVR has higher accuracy, shorter training time and better generalization performance. Therefore, the model can be used to predict the daily power generation of photovoltaic power plants. However, in terms of model structure optimization, this paper uses grid search, so there is room for improvement in the optimization speed and search range, which will be the direction of further research.