Identification Model for Needy Undergraduates Based on FFM

In recent years, as the enrollment rate of Chinese colleges has increased year by year, the identification of needy undergraduates has become increasingly important. However, the traditional way to identify college students with financial difficulties mainly relies on manual review and collective voting, which easily causes subjectivity and randomness. To alleviate the problem above, this paper establishes an automatic identification model for needy undergraduates based on the 1842 questionnaires collected from undergraduates in WHUT. Firstly, this paper filters the questionnaire preliminary using the local outlier factor algorithm. Secondly, this paper combines mutual information, Spearman rank correlation coefficient and distance correlation coefficient by rank-sum ratio to select features for eliminating noise from irrelevant features. Thirdly, this paper trains filed-aware factor machine model and com-pares it with other models, such as Logistic Regression, SVM, etc. Eventually, this paper finds that filed-aware factor machine performers much better than other models in the identification of needy undergraduates, and prominent features affecting the identification of needy undergraduates are the year of the family income, cost of living provided parents, etc.


Introduction
The number of undergraduates in higher education institutions in China has been on the rise since 2000, when the country began to expand enrollment in higher education. At the same time, the tuition fees of various universities and colleges are also rising, which has caused certain economic pressure for many students  [3]. Luo Suo and Jian Gong, in 2015, used BP neural network to create a nonlinear mapping between the economic conditions of college students and the needy undergraduates identifying [4]. Aifeng Li, Zhineng Xiao, Biyun Liang, in 2017, collected 36,546 data concerning dining consumption of students in three months, used Datist, a big data analysis software to build a model, acquired concerning dining habits, consuming behaviors, situations in school and consumption indicators of the students, and then selected needy undergraduates [5] through data cleaning, based on the campus big data platform and data mining technology before using logistic regression, random forest and other algorithms for data mining and analysis, and established a model on the identification of needy undergraduates eventually [7].
The policy, however, still needs to be improved in terms of simplicity, effectiveness and accuracy, on grounds of that the current efficient subsidy for needy undergraduates in China has been widely used in the past decade. For example, the school's identification of needy undergraduates mainly focuses on manual audit and class voting, which easily causes subjectivity and randomness.
Field-aware Factorization Machines was first proposed in 2016, which has not been used to identify needy undergraduates and it is suitable for sparse data.
Therefore, this paper adopts the three-classification Field-aware Factorization Machines method to establish an identification model for needy undergraduates, so as to solve the problem of the lack of uniformity in the identification standards and the subjectivity of the identification process and obtains a better performance than the predecessors. All the work above is supported by Wuhan University of Technology.

Data Collection
The subject of the questionnaire survey is the undergraduate of Wuhan University of Technology. The design of the questionnaire is mainly based on the questionnaire of Chinese college students' family economic situation and the intermediate process of the identification of needy undergraduates. There are totally 19 questions in the questionnaire. Some questionnaires are issued through the Internet, while others are issued in paper form. Finally, the questionnaires were screened by two staff members and 1842 questionnaires were left.

Data Encoding
In order to train the model, we transformed all discrete features, such as "Nation", "Locality", "City", etc., into one hot encoding, while coding of other continuous variables remains the same. Final outcomes of the data encoding are listed in Table 1.

Anomaly Detection
On grounds of invalid questionnaires affecting the accuracy of the model, it is necessary to clean these data. And this paper chooses the LOF (Local Outlier Factor) algorithm to solve it.
The definitions of some terms and symbols in LOF are as follows [8]: where o is the kth point closest to point p (does not include p).
that is, if point o is within the k nearest neighbors of point p, the kth reachable distance will be the k-distance of p. Otherwise, it will be the real distance between o and p.

4) Local reachability density
The local reachable density of point p is expressed as: represents the reciprocal of the average reachable distance of points in the k-domain of point p. The higher the density value is, the more likely it is to belong to the same cluster. The lower the density, the more likely it is to be an outlier.

4) Local outlier factor
The local outlier factor of point p is expressed as: represents the average ratio of the local reachable density of points in the k-domain of point p to the local reachable density of point p. If the ratio is close to 1, it means that the density of p is similar to the density of its domain point, and p may belong to the same cluster as the domain point. If this ratio is less than 1, it means that the density of p is higher than the density of its domain point, and p is the dense point. If this ratio is greater than 1, it means that the density of p is less than the density of its domain point, and p is more likely to be an outlier [9].
The main flow of the LOF is as follows: 1) For each data object p in the overall data set, find its k-domain and calculate their reachable distance; 2) Calculate local reachability density of data object p; 3) Calculate local outlier factor of data object p; 4) Repeat the above steps, calculate local outlier factor for all data object. Sort them and select outliers based on the preset threshold.
The algorithm cleans 1842 questionnaires collected from the survey of needy undergraduates and selects 1756 valid questionnaires finally.

Feature Selection
As some of the features in this questionnaire are excessive, feature selection is carried out in order to make better use of prior knowledge and avoid or alleviate the problem of overfitting [10]. The following will use mutual information, Spearman rank correlation coefficient and distance correlation coefficient to carry out feature selection.
When R1 is 0, there is no correlation between the two; When R1 is positive, it means that the probability of both occurrences is relatively high; when R1 is negative, it means that the two are negatively correlated, that is, they are mutually exclusive.
The result is shown in Figure 1. It can be seen that the two most representative features are monthly average living expenses and annual family income, followed by the number of elderly support; the less significant features were number of workers, failing status, ethnicity, family type, etc.

Spearman Rank Correlation Coefficient
Spearman rank correlation coefficient R2 is a nonparametric measure of statistical dependence and assesses the monotonic relation between two variables.
For Spearman rank correlation coefficient, the variable's rank is used instead of the value itself, which is the average of their positions in the ascending order of the values. A perfect monotone function occurs a value of or 1 for Spearman coefficient, and 0 occurred to no correlation [12]. The specific calculation formula is as follows: where * The value of R2 is between −1 and 1. When the value is 1, it means that the two random variables s and t are positively correlated. When the value is −1, it means that there is a completely negative correlation between s and t. When the value is 0, it means that s and t are linearly independent [13].
The result is shown in Figure 2. It can be seen that the two most representative features are monthly average living expenses and annual family income, followed by the number of old people to support, housing area, household debt, household work and city size. The less significant features are failing status, ethnicity, etc.

Distance Correlation Coefficient
The distance correlation coefficient R3 is used for the independence of the two variables s and t. When R3 = 0, it means that s and t are independent of each other. And the larger the R3, the greater the correlation between s and t.
The correlation coefficients of s and t are expressed as follows [14]: The results are shown in Figure 3. It can be seen that the two most representative features are monthly average living expenses and annual family income.
And the contribution of the other variables is relatively insignificant.

Rank-Sum Ratio
In order to integrate the above three methods for feature selection, we applied rank-sum ratio comprehensive evaluation method to mutual information R1, Spearman rank correlation coefficient R2 and distance correlation coefficient R3 to obtain the contribution ranking of 60 features. Equation (8) is used to calculate the rank sum ratio, where n is the number of indices.
The ranking results are shown in Figure 4.

Principles of Factorization Machine
Factorization Machine (FM) was first proposed by Rendle in 2010 [15]. The factorization method is essentially applied to solve the problem of feature combination under sparse data. It is a general model that can be applied to all real data sets.
For a given vector ( ) The expression of the second-order FM model is as follows: Figure 2. The final scores of the features using the Spearman rank correlation coefficient. Refer to Table 1 for the features in Figure 2.  Table 1 for the features in Figure 3.  Table 1 for the features in Figure 4.
where, n is the feature dimension, 0 w is the global offset, i w is the strength of the ith variable ( ) Let's say is the interaction between the ith and jth variables [17]. Simple parameters ij w for each interaction are not adopted, because the

Principles of FFM
Field-aware Factorization Machine (FFM) was first proposed by Yuchin Juan in 2016 [18]. On the basis of FM, FFM groups features of the same properties into the same field. To take the above classification of "Health of parents" as an example, "Health of parents = Healthy", "Health of parents = Either seriously ill", "Health of parents = Both seriously ill" and "Health of parents = Single family" all represent parents' conditions in health and can be put into the same field.
The same categorical feature generated by one hot encoding can all be placed in the same field.
In FFM, a latent vector , j i f V is learned for each feature i x and for each field j f . Therefore, the latent vector is not only related to the feature, but also to the field. In other words, different latent vectors are used when the feature "Health of parents = Healthy" is associated with the feature "annual household income" and "household housing situation", which is consistent with the intrinsic difference between feature "annual household income" and "household housing situation".
If there are n features of the sample belonging to f fields, the quadratic term of FFM has nf latent vectors. In the FM model, there is only one latent vector for each feature. Actually, FM can be regarded as the special case of FFM, which is the FFM model when all features are attributed to a field.
According to the field sensitivity of FFM, Equation (10) can be obtained below.
( ) 1 0 , , where, j f is the field to which the jth feature belongs.
If the dimension of the latent vector is k, then the number of quadratic term of FFM is nfk, far greater than FM model's [19] [20] [21].

Application of FFM to the Identification of Needy Undergraduates
In order to fit our data set, we changed traditional FFM model to three-class classification FFM. The final model is described as follows: ( ) The input of the model is the data set consisting of the remaining 50 features after feature selection.
The output of the model is the probability of non-needy undergraduates, needy undergraduates and extremely needy undergraduates. According to the principle of maximum probability, we choose the best classification. Use softmax function as activation function: Use cross entropy loss as loss function: where p Y is the true label of sample p expressed in the form of one hot encoding. Specifically, if sample p is a non-needy undergraduates, then ( ) Y is the output vector obtained after the sample p is fed to the model, namely

Coefficient Setting
According to the passage above, an identification model for needy undergraduates based on three-class classification Field-aware Factorization Machines was established.
In the experiment, the feature dimension(n) is 50, the latent vector parameter(k) is 30 and the number of fields is 18. Adagrad algorithm is used for gradient updates, with a learning rate of 0.01; the maximum number of iterations is 30.

Bootstrap Method
In order to minimize the randomness in the experiment, we repeat the experiment 10 times using bootstrap. The following is the specific process: 1) Randomly select 100 sets of data as the test set; 2) In the remaining data set, randomly select 1000 sets of data as the training set; 3) Train FFM model using the training set; 4) Calculate the accuracy ( )

Result
In order to show the excellent performance of our model, we took many models, such as Logistic Regression, SVM, Bayesian Network, Decision Tree and FM for comparative experiments. The specific experiment process is consistent with the passage above. Final results are shown in Table 2 below. It can be seen from Table 2 that the FFM model has the best performance.

Conclusions
From the results of feature selection, it is found that prominent features affecting the identification of needy undergraduates are the year of the family income, cost of living provided parents, family farming, family houses, and total number of children. So when it comes to making artificial evaluation of needy undergraduates, more considerations should be taken into the factors above. for different fields, so as to further improve the precision of the model.

Future Work
In the future, we will continue our work in two directions. One is the data, the other is the model. When collecting data, we only collected data of needy students from Wuhan University of Technology. However, in reality, there are still some differences in the scale of identification of undergraduates in different schools. Therefore, we will collect data of undergraduates from other schools to further improve the generalization ability of the model.
Recently, more complex Factorize Machine models have been proposed, such as DeepFM [23] and xDeepFM [24]. We will try to apply these models to the identification of needy undergraduates in the future.