^{1}

^{1}

^{*}

With the continuous increase of solar penetration rate, it has brought challenges to the smooth operation of the power grid. Therefore, to make photovoltaic power generation not affect the smooth operation of the grid, accurate photovoltaic power prediction is required. And short-term forecasting is essential for the deployment of daily power generation plans. In this paper, A short-term photovoltaic power generation forecast method based on K-means++, grey relational analysis (GRA) and support vector regression (SVR) (Hybrid Kmeans-GRA-SVR, HKGSVR) was proposed. The historical power data was clustered through the multi-index K-means++ algorithm. And the similar days and the nearest neighbor similar day of the prediction day were selected by the GRA algorithm. Then, similar days and nearest neighbor similar days were used to train SVR to obtain an accurate photovoltaic power prediction model. Under ideal weather, the average values of MAE, RMSE, and R
^{2} were 0.8101 kW, 0.9608 kW, and 99.66%, respectively. The average computation time was 1.7487 s, which was significantly better than the SVR model. Thus, the demonstrated numerical results verify the effectiveness of the proposed model for short-term PV power prediction.

In recent years, although breakthroughs have been made in the exploitation of shale gas and deep-sea combustible ice has also progressed, the fact that fossil energy reserves have limits has not changed. The development and utilization of renewable energy technologies is still of great significance [

However, the chaotic nature of the weather system makes the production of photovoltaic power generation highly random, volatile and intermittent, which greatly increases the difficulty of large-scale application of PV power generation [

In order to improve the power system’s ability to consume photovoltaic energy, many solutions have been proposed, including energy storage optimization [

The current photovoltaic power generation forecasting technologies have three main directions: physical methods, statistical-time series methods and ensemble methods [

Satellite Images and Sky Images make predictions by tracking and predicting the trajectory of the cloud, but are limited by image resolution and processing algorithms [

The ensemble method solves the limitations of a single model by mixing together different models with unique functions, thereby improving the prediction performance [

Under ideal weather condition, a hybrid Kmeans-GRA-SVR model is proposed in this paper. The main contributions of this paper include:

1) A novel short-term PV power forecasting method that utilizes SVR, clustering and similarity algorithms was proposed.

2) In order to increase the operation speed and reduce the operation cost, a multi-index clustering algorithm is used to cluster historical power data to obtain ideal weather and non-ideal weather.

3) In order to improve the prediction accuracy of photovoltaic power generation under ideal weather conditions, clustering algorithm and GRA algorithm are used to obtain similar days in the same cluster as the forecasting day.

4) Through the GRA algorithm to obtain the nearest neighbor similar days to solve the problem of decreased prediction accuracy caused by the large time interval between the similar days and forecasting day.

The remainder of this paper is organized as follows. Section 2 describes the hybrid Kmeans-GRA-SVR model. Section 3 illustrates clustering and model evaluation metrics. Section 4 introduces the experiments and result analysis. Finally, conclusions are given in Section 5.

K-means++ clustering algorithm is an improved version of K-means algorithm. This algorithm separates the K initial cluster centers more from each other. In this work, which is selected as the classifier due to its higher efficiency and improved robustness compared with others (e.g., standard K-means, K-medoids, Gaussian mixture models, etc.) [

Step 1: Randomly select a sample as the first cluster center c_{1};

Step 2: Calculate the probability of each sample being selected as the next cluster center:

D ( x ) 2 ∑ x ∈ X D ( x ) 2 (1)

where, D(x) represents the distance between the sample and the nearest cluster center.

Then use the roulette method to select the next cluster center;

Step 3: Repeat step 2 until K cluster centers are selected;

Step 4: For each sample x_{i} in the datasets, calculate its distance to K cluster centers, and then put it into the class corresponding to the smallest distance cluster center;

Step 5: For each cluster, recalculate its cluster center c_{i}_{:}

c i = 1 c i ∑ x ∈ c i x (2)

Step 6: Repeat steps 4 and 5 until the position of the cluster center does not change.

In this part, K-means++ clustering method is used to directly cluster the historical power data of each season. The reason for selecting historical power data for clustering is that the aging of the equipment itself and its own indicators are different under different weather conditions. It is difficult for us to finely measure these changes. The characteristics of historical power data will integrate these changes into it. After clustering the historical power data, the centroid value of each cluster is calculated by the minimum, average and maximum of global horizontal irradiance (GHI), diffuse horizontal irradiance (DHI), relative humidity (RH) and temperature (T). The Euclidean distance between the 12 meteorological factor characteristic values of the forecast day and each cluster centroid is calculated to determine the cluster to which the forecast day belongs.

Grey relational analysis refers to the quantitative description and comparison method of the development and change of a system. The basic idea is to judge the correlation degree by comparing the geometrical similarity between the reference data column and several data columns. It reflects the degree of correlation between the curves. Generally, the more consistent the change tendency of the reference sequence and the comparison sequence, the higher the degree of correlation between the two variables. The flow of the GRA algorithm is as follows [

Step 1: Determine the reference sequence y that reflects the characteristics of the system behavior and the comparison sequence x_{i} that affects the system behavior:

y = { y ( k ) | k = 1 , 2 , ⋯ , n } (3)

x i = { x i ( k ) | k = 1 , 2 , ⋯ , n } , i = 1 , 2 , ⋯ , m (4)

where, n and m represent the dimension of the eigenvalues and the number of comparison sequence, respectively.

Step 2: Non-dimensionalization of variables:

d j * ( k ) = D j ( k ) − D a v ( k ) D max ( k ) − D min ( k ) , k = 1 , 2 , ⋯ , n ; i = 0 , 1 , 2 , ⋯ , m ; j = 1 , 2 , ⋯ , m + 1 (5)

where, D_{j}(k) contains reference sequence and comparison sequence, D_{av}(k), D_{min}(k) and D_{max}(k) are the average, minimum and maximum values of each column, j represents sum of the number of reference sequence and comparison sequence.

Non-dimensionalization is used to solve the problem that the columns cannot be compared due to the different dimensions.

Step 3: Calculate correlation coefficient ξ_{i}(k):

ξ i ( k ) = min i min k | y ( k ) − x i ( k ) | + ρ max i max k | y ( k ) − x i ( k ) | | y ( k ) − x i ( k ) | + ρ max i max k | y ( k ) − x i ( k ) | (6)

where, ρ is called the resolution coefficient, here, ρ is 0.5.

Step 4: Calculate correlation degree.

Calculate the average value of the correlation coefficient at each moment (that is, each point in the curve) r_{i}:

r i = 1 n ∑ k = 1 n ξ i ( k ) , k = 1 , 2 , ⋯ , n (7)

Step 5: Sort correlation degree.

After determining the cluster to which the prediction day belongs, the correlation between the prediction day and each sample in the cluster is calculated by GRA based on 12 meteorological factor eigenvalues. And the date with the correlation degree greater than the threshold (an appropriate correlation value that takes into account the similarity and the number of samples) is regarded as the similar days. For the ideal weather, the sample with the highest correlation in the 7 days before the forecast date is set as the nearest neighbor similar day.

SVM obtains the ability to linearly analyze the nonlinear characteristics of the sample by mapping low-dimensional data to high-dimensional space. Based on the structural risk minimization theory, SVM constructs the optimal classification surface in the feature space, thereby overcoming the local optimal problem and requiring fewer training samples. When the data type is complex, SVR can be used, which was first developed by Vapnik et al. [

f ( x ) = ω T ϕ ( x ) + b (8)

where ω is a vector of weight coefficients, ϕ ( x ) is the nonlinear mapping function (mapping x to a high-dimensional feature space), and b denotes a bias constant. In addition, b and ω can be obtained by the following formula:

minimize : 1 2 ‖ ω ‖ 2 + C ∑ i = 1 n ξ i + ξ i * (9)

subject to:

{ y i − 〈 ω , ϕ ( x i ) 〉 − b ≤ ε + ξ i 〈 ω , ϕ ( x i ) 〉 + b − y i ≤ ε + ξ i * ξ i ≥ 0 , ξ i * ≥ 0 (10)

where ξ i _{ }and ξ i * are slack variables，and C denotes the penalty variable, ε is the insensitive loss function.

In this paper, the radial basis function (RBF) kernel is applied to construct the SVR model. The RBF kernel is presented as:

K ( x i , x i ) = exp ( − γ ‖ x i − x i ‖ 2 ) (11)

where γ is the kernel parameter.

The working process of the hybrid Kmeans-GRA-SVR model is shown in

Step 1: The historical power and meteorological factors data are used as training and test data, and the missing and abnormal data in the data set are processed.

Step 2: The historical photovoltaic power data of the four seasons are clustered separately through the K-means++ algorithm, and the minimum, average and maximum values of GHI, DHI, RT, and T are regarded as the central value of each cluster.

Step 3: According to the 12 meteorological factor eigenvalues, the Euclidean distance between the forecast day and each cluster center is calculated to determine the cluster category to which the forecast day belongs.

Step 4: The correlation between the prediction day and each sample in the cluster is calculated by GRA to obtain similar days and nearest neighbor similar days. We normalize the data of similar days and nearest neighbor similar days as training set and validation set.

Step 5: After determining the C and γ of SVR through grid search and cross-validation, the SVR is trained to obtain a prediction model and predict the output power on the prediction day.

The structure of the prediction model under ideal weather conditions is shown in

If the ground truth labels are not known, evaluation must be performed using the model itself. The Silhouette Coefficient is an example of such an evaluation, the score is higher when clusters are dense and well separated. Silhouette Coefficient S(i) is defined as follows:

S ( i ) = b ( i ) − a ( i ) max { a ( i ) , b ( i ) } (12)

where, a(i) is the mean distance between a sample and all other points in the same cluster, b(i) is the mean distance between a sample and all other points in the next nearest cluster. Average the Silhouette Coefficient of all points, which is the total Silhouette Coefficient of the clustering result.

Davies-Bouldin index is defined as follows:

DBI = 1 n ∑ i = 1 n max j ≠ i ( S i ¯ + S j ¯ ‖ ω i − ω j ‖ 2 ) (13)

where, S i ¯ is the average distance from the points in the cluster to the cluster centroid, ‖ ω i − ω j ‖ 2 is the distance between the centroid of cluster i and j. The Davies-Bouldin index is lower if the model clusters have better separation.

Sum of squared errors (SSE) is also an effective metric. That is, the sum of squared errors of the distance between the centroid of each cluster and the points in the cluster. SSE is defined as follows:

SSE = ∑ i = 1 K ∑ d i s t ( x , c i ) 2 (14)

In order to evaluate the performance of the proposed method HKGSVR for photovoltaic power generation forecasting, the root mean square error (RMSE), average absolute error (MAE) and coefficient of determination (R^{2}) indicators were calculated. They are defined as follows.

1) The RMSE is defined as:

RMSE = 1 N ∑ i = 1 N P f i − P a i (15)

where, P_{ai} and P_{fi} are the actual and predicted value at i hour. N refers to the number of hours a sample contains.

2) The MAE is expressed as:

MAE = 1 N ∑ i = 1 N | P f i − P a i | (16)

3) The R^{2} is given as:

R 2 = ( N ∑ i = 1 N P f i P a i − ∑ i = 1 N P f i ∑ i = 1 N P a i ) 2 ( N ∑ i = 1 N P f i 2 − ( ∑ i = 1 N P f i ) 2 ) ( N ∑ i = 1 N P a i 2 − ( ∑ i = 1 N P a i ) 2 ) (17)

In this paper, the general datasets on the DKASC (Desert Knowledge Australia Solar Center) website are used for related experiments. The photovoltaic array is composed of 22 polycrystalline silicon photovoltaic panels with a rated power of 265 W, whose total rated power is 5.83 kW. The photovoltaic array is located at the Desert Knowledge Precinct in Alice Springs (a town in the Northern Territory that enjoys one of the country’s highest solar resources in an arid desert environment). The geographic location, physical object and configuration information of the photovoltaic array are shown in

Item | Information |
---|---|

Array Rating | 5.83 kW |

Panel Rating | 265 W |

Number of Panels | 22 |

Panel Type | HSL 60S |

Array Area | 36.74 |

Inverter Size/Type | SMA SMC 6000 A |

Array Tilt/Azimuth | Tilt = 20, Azimuth = 0 (Solar North) |

Installation Completed | Sat, 2 Jul 2016 |

data with an interval of 1 hour from 7:00 to 18:00 every day.

In order to obtain the appropriate number of clusters for each season, SSE, DBI and Silhouette Coefficient (S) are used for evaluation. Taking summer as an example, the experimental results are shown in

After comprehensively considering each evaluation index and cluster observation results, the number of clusters in summer clustering is set to 2, and the result is shown in

Metrics | Spring | Summer | Autumn | Winter |
---|---|---|---|---|

2 cluster SSE | 44,578.4063 | 20,278.6656 | 26,813.2643 | 4218.8007 |

3 cluster SSE | 32,045.1222 | 14,902.9881 | 15,383.4952 | 2942.0887 |

4 cluster SSE | 24,464.0061 | 11,971.8954 | 11,657.9236 | 2472.9994 |

2 cluster DBI | 0.6723 | 0.9740 | 0.6410 | 0.7048 |

3 cluster DBI | 0.7846 | 0.9440 | 0.7252 | 1.0029 |

4 cluster DBI | 1.0187 | 1.0247 | 0.7773 | 0.8355 |

2 cluster S | 0.6383 | 0.6156 | 0.7119 | 0.4861 |

3 cluster S | 0.5893 | 0.5866 | 0.6390 | 0.5074 |

4 cluster S | 0.5231 | 0.5705 | 0.4073 | 0.5092 |

optima or other abnormal situations, 100 rounds of experiments were carried out. Finally, the number of clusters in spring is 3, the number of clusters in summer is 2, the number of clusters in autumn is 3, and the number of clusters in winter is 3.

The selection of the threshold for similar days is not only related to the season, but also to the weather conditions on the forecast day. For ideal weather, it is observed that when the correlation degree is greater than 0.85, the power curve of similar days in each season is in a relatively ideal state (as shown in

Under ideal weather conditions, the forecast days of each season, the correlation threshold of similar days, and the selection of nearest neighbor similar days are shown in

Item | Spring | Summer | Autumn | Winter |
---|---|---|---|---|

Forecasting day | September 15, 2019 | February 09, 2020 | April 18, 2019 | August 17, 2019 |

Similar days threshold | 0.90 | 0.88 | 0.92 | 0.86 |

Nearest neighbor similarity day | September 14, 2019 | February 08, 2020 | April 15, 2019 | August 16, 2019 |

Correlation degree | 0.9295 | 0.9452 | 0.9913 | 0.9579 |

This part is mainly to explore the optimal C and γ of SVR, which are usually related to the characteristics of power generation in different seasons. Grid search and cross-validation are used to find the optimal number of C and γ for SVR. The optimal SVR structure for each season under ideal weather is shown in

Structure | Spring | Summer | Autumn | Winter |
---|---|---|---|---|

Input form | (564, 2) | (408, 2) | (612, 2) | (156, 2) |

C | 100,000.00 | 100,000.0 | 100,000.00 | 1.00 |

γ | 0.0001 | 0.0001 | 0.002 | 1.00 |

Time (s) | 2.0342 | 1.9506 | 2.3272 | 0.6826 |

It can be seen from ^{2} is 0.9966. The MAE are 1.4521, 1.4661, 0.7120, and 0.2132 kW respectively. The average value of RMSE is 0.9608 kW. The model has high prediction accuracy for ideal weather.

As shown in

Metrics | Spring | Summer | Autumn | Winter | Average | Std |
---|---|---|---|---|---|---|

MAE (kW) | 1.1763 | 1.3753 | 0.5269 | 0.1618 | 0.8101 | 0.5640 |

RMSE (kW) | 1.4521 | 1.4661 | 0.7120 | 0.2132 | 0.9608 | 0.6103 |

R^{2} | 0.9945 | 0.9932 | 0.9987 | 0.9999 | 0.9966 | 0.0032 |

Time (s) | 2.0342 | 1.9506 | 2.3272 | 0.6826 | 1.7487 | 0.7288 |

Metrics | Spring | Summer | Autumn | Winter | Average | Std |
---|---|---|---|---|---|---|

MAE (kW) | 2.1586 | 1.8429 | 2.8045 | 1.9542 | 2.1901 | 0.4300 |

RMSE (kW) | 2.4877 | 2.3955 | 3.2315 | 2.2143 | 2.5822 | 0.4475 |

R^{2} | 0.9837 | 0.9819 | 0.9737 | 0.9873 | 0.9817 | 0.0058 |

Time (s) | 8.1325 | 7.4614 | 7.5504 | 7.6254 | 7.6924 | 0.3009 |

enhancement with respect to the SVR model is 45.51%, 25.37%, 81.21%, 91.72%, respectively. The presented model’s RMSE improvement relative to the SVR model is 41.63%, 38.80%, 77.97%, 90.37%, respectively. The average R^{2} of the proposed model is also better than the SVR model.

A hybrid day-ahead photovoltaic power generation prediction model (HKGSVR) based on K-means++, GRA and SVR was proposed. Both historical power data and weather data were used to train the model. Moreover, samples of similar days and nearest neighbor similar days were used to train the prediction model. The average values of MAE, RMSE, and R^{2} were 0.8101 kW, 0.9608 kW, and 99.66%, respectively. The average computation time was 1.7487 s, which was significantly better than the SVR model. Thus, the demonstrated numerical results verify the effectiveness of the proposed model for short-term PV power prediction.

The authors declare no conflicts of interest regarding the publication of this paper.

Lin, J.M. and Li, H.M. (2020) A Short-Term PV Power Forecasting Method Using a Hybrid Kmeans-GRA-SVR Model under Ideal Weather Condition. Journal of Computer and Communications, 8, 102-119. https://doi.org/10.4236/jcc.2020.811008