_{1}

^{*}

Collaborative filtering algorithm is the most widely used and recommended algorithm in major e-commerce recommendation systems nowadays. Concerning the problems such as poor adaptability and cold start of traditional collaborative filtering algorithms, this paper is going to come up with improvements and construct a hybrid collaborative filtering algorithm model which will possess excellent scalability. Meanwhile, this paper will also optimize the process based on the parameter selection of genetic algorithm and demonstrate its pseudocode reference so as to provide new ideas and methods for the study of parameter combination optimization in hybrid collaborative filtering algorithm.

User-based collaborative filtering and item-based collaborative filtering are deemed as two classic methods in collaborative filtering algorithms while traditional collaborative filtering algorithms have problems such as cold start and data sparsity. A single filtering algorithm cannot give the best recommendation. According to the advantages of parallel and distributed cloud platform, multiple algorithms can be executed simultaneously based on applying multiple nodes [

A well-established recommendation system is usually composed of three parts: the recording plate, recording the user’s historical behavior; analysis plate; algorithm plate. As the core of the whole recommendation system, the related research of the recommendation algorithm has become the hot research direction, because the recommendation result is closely related to the performance of the recommended algorithm. According to different needs of users, recommendations will differ from each other [

However, the recommended algorithm fusion based on the cloud platform has shortages in terms of parameter selection and combination optimization. Among the proposed algorithms, whether the K-Nearest Neighbor (KNN) [

In the improved hybrid collaborative filtering algorithm described above, multiple weights and thresholds are used. If the exhaustive method is used to select the optimal combination, the efficiency of the algorithm will be greatly affected. This paper will use the Genetic Algorithm [

In order to perform mathematical operations on user features and item features,

encoding is necessary. This paper selects the MovieLens dataset [

The user’s age is divided into: 0 - 16 for children, 17 - 39 for young people, 40 - 60 for middle-aged people, 60 or older for the elderly, and the integers for these four types of people are respectively demonstrated as 1, 2, 3, 4 code representation. The user gender is coded as 1 for men and 0 for women. In the MovieLens dataset, the movie project attributes include 19 movie tags such as romance films and suspense films. The movie can have multiple types abd the encoding rule of this article is: the target movie belongs to the current type, and is represented by 1, and if it is not, it is represented by 0.

In this paper, the Mean Absolute Error (MAE) [

In order to establish a scoring matrix suitable for the collaborative filtering algorithm, the initial data needs to be pre-processed, and the pre-processing results are shown in

In

User | Item | |||||
---|---|---|---|---|---|---|

I_{1} | I_{2} | … | I_{j} | … | I_{n} | |

U_{1} | r_{11} | r_{1}_{2} | … | r_{1}_{j} | … | r_{1}_{n} |

U_{2} | r_{2}_{1} | r_{22} | … | r_{2j} | … | r_{2n} |

… | … | … | … | … | … | … |

U_{j} | r_{i}_{1} | r_{i}_{2} | … | r_{ij} | … | r_{in} |

… | … | … | … | … | … | … |

U_{m} | r_{m}_{1} | r_{m}_{2} | … | r_{mj} | … | r_{mn} |

In the improved hybrid collaborative filtering algorithm, user score similarity and item score similarity are calculated by Pearson Correlation Coefficient (PCC) [

1) User Rating Similarity.

The similarity between user i and user j is as shown in Equation (1).

s i m U s e r 1 ( i , j ) = ∑ s ∈ S ( r i , s − r ¯ i ) ( r j , s − r ¯ j ) ∑ s ∈ S ( r i , s − r ¯ i ) 2 ∑ s ∈ S ( r j , s − r ¯ j ) 2 (1)

In formula (1), the r ¯ i and r ¯ j show the mean core of user i and user j, while r i , s and r j , s show the scores of the common item s from user i and user j.

The similarity of the user characteristic information is calculated by the formula (2) and the formula (3).

D i s ( i , j ) = ∑ k = 1 p ( i k − j k ) 2 (2)

s i m U s e r 2 ( i , j ) = 1 1 + D i s ( i , j ) (3)

In Equation (2), p represents the number of user characteristic attributes, i k represents the value of the k-th feature of the i-th user, j k represents the value of the k-th feature of the j-th user. Then, we calculate the value according to the characteristics of the user i and the user j. The Euclidean Metric (EM) [

Finally, the above two similarities are mixed together by the weight combination method. See Equation (4), and the comprehensive similarity between user i and user j is obtained.

s i m U s e r ( i , j ) = w 1 × s i m U s e r 1 ( i , j ) + ( 1 − w 1 ) × s i m U s e r 2 ( i , j ) (4)

In the formula (4), w 1 is the weight.

2) Item Score Similarity.

The similarity calculation of the user i scores the item m and the item n is as shown in the formula (5).

s i m I t e m 1 ( m , n ) = ∑ i ∈ I ( r i , m − r ¯ m ) ( r i , n − r ¯ n ) ∑ i ∈ I ( r i , m − r ¯ m ) 2 ∑ i ∈ I ( r i , n − r ¯ n ) 2 (5)

In the formula (5), r i , m and r i , n respectively refer to the scores of the user i for the item m and the item n, r ¯ m and r ¯ n refer to the average scores of the items m and n, and I refers to the set of users who have evaluated both the item m and the item n.

The similarity of the item characteristic information is calculated by the formula (6) and the formula (7).

D i s ( m , n ) = ∑ k = 1 q ( m k − n k ) 2 (6)

s i m I t e m 2 ( m , n ) = 1 1 + D i s ( m , n ) (7)

In formula (6), q represents the number of item feature information, m k represents the code value of the k-th feature information of the m-th item, and n k represents the code value of the k-th feature information of the n-th item. According to the encoded value of the characteristic information of given item m and item n, the Euclidean distance between them is calculated, and finally the characteristic information similarity of the item m and the item n is calculated according to the formula (7).

Finally, the above two similarities are mixed together by the weight combination method. See Equation (8), and the comprehensive similarity between the project m and the project n is obtained.

s i m I t e m ( m , n ) = w 2 × s i m I t e m 1 ( m , n ) + ( 1 − w 2 ) × s i m I t e m 2 ( m , n ) (8)

In formula (8), C is the weight.

3) Selection of Nearest Neighbor

According to the analysis of the advantages and disadvantages of the nearest neighbor selection method, the threshold selection method is adopted as the nearest neighbor selection method. Set w 3 as the user similarity threshold, when the similarity between the user i and the user j is greater than or equal to w 3 , that is s i m U s e r ( i , j ) ≥ w 3 , and the user j is a member of the nearest neighbor group of the user i. Similarly, the item similarity threshold is set, when the similarity between the item m and the item n is greater than or equal to w 4 , that is s i m I t e m ( m , n ) ≥ w 4 , then the item n is a member of the nearest neighbor group of the item m.

There are usually two methods for selecting nearest neighbors: one is the K-Nearest Neighbor (KNN), and the K users with the biggest similarity are selected. This method mainly performs sorting in descending order according to the calculated user similarity, and selects the user with the K closest neighbors. The second is the threshold selection method, and the user whose similarity is greater than the threshold is selected as the nearest neighbor. In this method, as long as a similarity threshold is preset, the user whose similarity is bigger than the threshold is used as the nearest neighbor set of the target user [

By comparing the two selection methods of nearest neighbors, the K-nearest neighbor method is simple and easy, but it has the following disadvantages: the number of nearest neighbors is artificially specified, and the K nearest neighbors of the target user do not necessarily belong to the nearest neighbor of the target user; in addition, it is difficult to determine the value of K, if the K value is too large, it may lead to an extensive coverage and the nearest neighbor selection will not be accurate and if the K value is too small there will be less targets, resulting in lower accuracy.

By using the threshold selection method, we can perfectly avoid the defect of KNN and this paper uses this method by improving collaborative filtering algorithm. However, the threshold selection method also has defects because it is difficult to ensure the threshold value. In this paper, the genetic algorithm is adopted to optimize the threshold selection while using collaborative so as to improve the accuracy of the algorithm.

After a series of calculations of the nearest neighbor group of the user, the score of the target user for other unevaluated items is predicted by the recommendation formula according to the result. The definition of the user recommendation formula is shown in formula (9).

P I t e m ( t , u ) = r ¯ t + ∑ k ∈ N t ( r u , t − r ¯ k ) × s i m I t e m ( t , k ) ∑ k ∈ N t s i m I t e m ( t , k ) (9)

In formula (9), r ¯ t and r ¯ k mean the average score of item t and item k, N t refers to the nearest neighbor set of item t, r u , t indicates the score of user u on item t.

Similarly, the weighting method can be used to mix the two results, see Equation (10).

P ( t , u ) = w 5 × P U s e r ( t , u ) + ( 1 − w 5 ) × P I t e m ( t , u ) (10)

In formula (10), w 5 is weight value P U s e r ( t , u ) is the recommendation result based on user-user relationship while P I t e m ( t , u ) shows the recommendation result based on item-item relationship and P ( t , u ) represents the final mixed score result of the user u on the item t.

The hybrid recommendation algorithm is a collaborative filtering algorithm that combines the user’s historical rating information, feature information, and item score information by weight method. The basic idea of the algorithm is to combine the user score similarity with the user characteristic information similarity by weight method to obtain the nearest neighbor group of the user, so as to obtain the recommendation result based on the user-user relationship; on the other hand, the weight method is also used to combine the item score similarity with the item feature information so as to obtain the nearest neighbor, so as to obtain the recommendation result based on the item-item relationship. Finally, the combination between user recommendation result and item recommendation result is carried out to generate a final result.

The hybrid collaborative filtering recommendation algorithm described above deals with the cold start and data sparse problems, but brings about the selection of threshold and weight. On the one hand, it is necessary to select the appropriate weight w 1 , w 2 , w 5 , and on the other hand, in the calculation of the user’s nearest neighbor group and the nearest neighbor group of the item, there is also a similarity threshold value w 3 , w 4 selection problem. The value range of these parameters is [0, 1] and if we use the exhaustive method to select the best combination of parameters, even if we choose the combination of the amplitude of 0.1 parameters, there are 11 to the 5^{th} choices. Assuming that it is executed once and takes 10 seconds, and it will take more than half a month to find the optimal value selection through an exhaustive method, which will greatly reduce the efficiency of the algorithm [

The genetic algorithm is used to select the optimal combination of five parameters w 1 , w 2 , w 3 , w 4 , w 5 , so that the recommendation algorithm and the average absolute deviation MAE of the test data set tend to be the smallest [

In genetic algorithm optimization, the fitness is defined as an improved recommendation algorithm based on the combination of parameters, and the reciprocal of the average absolute deviation MAE of the recommendation result and the test data set, i.e. 1/MAE. It can be seen that the smaller the MAE value of the recommended result is, the greater the fitness is, indicating that the combination of the parameters is better.

The flow of parameter selection optimization of the hybrid collaborative filtering algorithm based on genetic algorithm is shown in

Numerical Value | Coding |
---|---|

0 | 0 |

0.1 | 1 |

0.2 | 2 |

0.3 | 3 |

0.4 | 4 |

0.5 | 5 |

0.6 | 6 |

0.7 | 7 |

0.8 | 8 |

0.9 | 9 |

1 | * |

in the chromosome, and then substitutes them into the hybrid collaborative filtering algorithm described above until the end condition of the algorithm is satisfied, and the next generation inheritance will be implemented if it is not satisfied operating. The improved algorithm uses weights and thresholds in five places, and the values of these five parameters are all [0, 1]. By using genetic algorithm, the parameter combination of the recommendation algorithm will be improved and ihe pseudo code of the main function optimization is shown in

In this paper, the traditional collaborative filtering algorithm is improved, and the user and item scores are combined with the feature that attributes to generate recommendations. The cold start problem in the traditional collaborative filtering algorithm is solved, and the sparseness of user rating data is alleviated to some extent. The paper also demonstrates the parameter combination optimization of improved collaborative filtering algorithm and introduces the algorithm process combining improved algorithm as well as genetic algorithm. What’s more, the flow and pseudo code of the combined algorithm is also given to solve the parameter combination optimization issue of collaborative filtering algorithm

Function name: main GA |
---|

Function: Optimized parameter combination |

BEGIN % the ipop array variable represents a population and consists of multiple chromosome individuals and is used to store the chromosomes of the i-th generation population. % the evals array stores the fitness values of the chromosomes at the corresponding locations in the ipop array. % uses the inialPops function to create the initial population. ipop = inialPops( ); % implements genetic manipulation cyclically with a minimum of 50 generations of genetic ma-nipulation. for i = 1 to 50 for k = 1 to m % m is the number of groups evals[k] = calculatefitnessvalue(ipop[k]); % calculates the fitness of the corresponding chromosome. if(evals[k] > best.fitness) { best.fitness = evals[k]; % records the most adaptive value with the corresponding chromosome. best.pop = ipop[k]; } end if end if(best.fitness >= n) { % n is the preset MAE threshold. select( ); % select operation. cross( ); % cross operation. matation( ); } % variation operation. else stop; % reaches the threshold requirement and stops executing the genetic algorithm. end if end END |

while genetic algorithm is introduced.

The recommendation algorithm plays a pivotal role in the development of ecommerce. As the number of user is soaring, the user model becomes more and more complex, and the recommendation effect of a single recommendation algorithm becomes worse and worse. The hybrid collaborative filtering algorithm based on genetic algorithm proposed in this paper combines multiple recommendation algorithms and can process a large amount of user data with good scalability, and can achieve better recommendation results.

This work was supported in part by a grant from the characteristics innovation project of colleges and universities of Guangdong Province (Natural Science, No. 2016KTSCX182, 2016), a grant from the Youth Innovation Talent Project of colleges and universities of Guangdong Province (No. 2016KQNCX230, 2016).

The authors declare no conflicts of interest regarding the publication of this paper.

Zhu, Z.J. (2018) Research on Parameter Optimization in Collaborative Filtering Algorithm. Communications and Network, 10, 105-116. https://doi.org/10.4236/cn.2018.103009