1. Introduction
The number of undergraduates in higher education institutions in China has been on the rise since 2000, when the country began to expand enrollment in higher education. At the same time, the tuition fees of various universities and colleges are also rising, which has caused certain economic pressure for many students admitted to universities from rural areas. Therefore, it is essential to identify and fund needy undergraduates, which is not only to train country’s qualified talents, but also to increase the vitality to the world of knowledge. However, due to the large number of undergraduates, the identification process of needy undergraduates is not stable, so an accurate and effective way is needed to provide relevant support for the financial aid of needy undergraduates.
At present, all countries in the world take the family economic survey as the main means to identify needy undergraduates, and the standards of each country are also based on the restrictions on the supporting objects. Among them, the United States takes family income as the only criterion for identifying needy undergraduates, because the perfect income verification and tax collection system in the United States can effectively report and supervise residents’ non-earned income. In Japan, household income and assets indicators are combined with various classification indicators to determine undergraduates’ family economic status. Uganda in Africa relies on proxy variables, such as the class of father’s job and vehicle, to measure family income. The Nigerian Student Loan Board uses a four-factor property test that measures a family’s financial status by its parents’ occupation, income, household size and the number of children in education. In the small African country of Malawi, the family which wants to receive a student loan must meet one of the following conditions: the parents or guardians are unable to provide financial assistance to him, and the parents or guardians do not have a clear and fixed source of income and other economic reasons approved by the loan committee. In some Latin American countries, the household economic survey is quite rigorous and detailed. For example, in Peru, parents of undergraduates, applying for student loans, are even interviewed on property, such as houses, cars, land, parents’ jobs, employers and wage earners [1].
Pathman, DE (Pathman, DE), Konrad, TR (Konrad, TR) et al., exploited the data of 723 undergraduates from 69 states collected by statistical methods to analyze the impact of receiving state-funded scholarships and repaying loans on needy undergraduates in 2004 [2]. Jiyun Kim, Stephen L. DesJardin, Brian P. McCall used a random utility model to explore the effects of student expectations about financial aid on postsecondary choice focusing on income and racial/ethnic differences in 2009 [3]. Luo Suo and Jian Gong, in 2015, used BP neural network to create a nonlinear mapping between the economic conditions of college students and the needy undergraduates identifying [4]. Aifeng Li, Zhineng Xiao, Biyun Liang, in 2017, collected 36,546 data concerning dining consumption of students in three months, used Datist, a big data analysis software to build a model, acquired concerning dining habits, consuming behaviors, situations in school and consumption indicators of the students, and then selected needy undergraduates [5]. Tao, BR (Tao, Bairui), Liu, KD (Liu, Kaida) et al., based on GA-SVM [6], established a targeted poverty reduction model for the needy undergraduates in 2018 based on the information of freshmen’s admission and undergraduates’ daily life consumption. Yao Bei, in 2019, extracted five categories of feature clusters through data cleaning, based on the campus big data platform and data mining technology before using logistic regression, random forest and other algorithms for data mining and analysis, and established a model on the identification of needy undergraduates eventually [7].
The policy, however, still needs to be improved in terms of simplicity, effectiveness and accuracy, on grounds of that the current efficient subsidy for needy undergraduates in China has been widely used in the past decade. For example, the school’s identification of needy undergraduates mainly focuses on manual audit and class voting, which easily causes subjectivity and randomness.
Field-aware Factorization Machines was first proposed in 2016, which has not been used to identify needy undergraduates and it is suitable for sparse data. Therefore, this paper adopts the three-classification Field-aware Factorization Machines method to establish an identification model for needy undergraduates, so as to solve the problem of the lack of uniformity in the identification standards and the subjectivity of the identification process and obtains a better performance than the predecessors. All the work above is supported by Wuhan University of Technology.
2. Data Collection and Processing
2.1. Data Collection
The subject of the questionnaire survey is the undergraduate of Wuhan University of Technology. The design of the questionnaire is mainly based on the questionnaire of Chinese college students’ family economic situation and the intermediate process of the identification of needy undergraduates. There are totally 19 questions in the questionnaire. Some questionnaires are issued through the Internet, while others are issued in paper form. Finally, the questionnaires were screened by two staff members and 1842 questionnaires were left.
2.2. Data Encoding
In order to train the model, we transformed all discrete features, such as “Nation”, “Locality”, “City”, etc., into one hot encoding, while coding of other continuous variables remains the same. Final outcomes of the data encoding are listed in Table 1.
2.3. Anomaly Detection
On grounds of invalid questionnaires affecting the accuracy of the model, it is necessary to clean these data. And this paper chooses the LOF (Local Outlier Factor) algorithm to solve it.
The definitions of some terms and symbols in LOF are as follows [8]:
1) K-distance
(1)
where o is the kth point closest to point p (does not include p).
2) Reachable distance
The kth reachable distance from point o to point p is defined as:
(2)
that is, if point o is within the k nearest neighbors of point p, the kth reachable distance will be the k-distance of p. Otherwise, it will be the real distance between o and p.
3) K-domain
K-domain of point p, marked as
, is a collection containing all points within the k-distance radius of point p, including the points on the circle. Number of elements in the
is marked as
.
4) Local reachability density
The local reachable density of point p is expressed as:
(3)
represents the reciprocal of the average reachable distance of points in the k-domain of point p. The higher the density value is, the more likely it is to belong to the same cluster. The lower the density, the more likely it is to be an outlier.
4) Local outlier factor
The local outlier factor of point p is expressed as:
(4)
represents the average ratio of the local reachable density of points in the k-domain of point p to the local reachable density of point p. If the ratio is close to 1, it means that the density of p is similar to the density of its domain point, and p may belong to the same cluster as the domain point. If this ratio is less than 1, it means that the density of p is higher than the density of its domain point, and p is the dense point. If this ratio is greater than 1, it means that the density of p is less than the density of its domain point, and p is more likely to be an outlier [9].
The main flow of the LOF is as follows:
1) For each data object p in the overall data set, find its k-domain and calculate their reachable distance;
2) Calculate local reachability density of data object p;
3) Calculate local outlier factor of data object p;
4) Repeat the above steps, calculate local outlier factor for all data object. Sort them and select outliers based on the preset threshold.
The algorithm cleans 1842 questionnaires collected from the survey of needy undergraduates and selects 1756 valid questionnaires finally.
2.4. Feature Selection
As some of the features in this questionnaire are excessive, feature selection is carried out in order to make better use of prior knowledge and avoid or alleviate the problem of overfitting [10]. The following will use mutual information, Spearman rank correlation coefficient and distance correlation coefficient to carry out feature selection.
2.4.1. Mutual Information
Mutual information measures the amounts of information that one random variable contains about another, and the reduction in the uncertainty of one random variable due to the knowledge of the other variable. By the consideration of two random variables, feature s and true label t, with a joint probability mass function
and marginal probability mass function
and
. Mutual information R1 is the relative entropy between the joint distribution and the product distribution
[11]. The specific calculation formula is as follows:
(5)
When R1 is 0, there is no correlation between the two; When R1 is positive, it means that the probability of both occurrences is relatively high; when R1 is negative, it means that the two are negatively correlated, that is, they are mutually exclusive.
The result is shown in Figure 1. It can be seen that the two most representative features are monthly average living expenses and annual family income, followed by the number of elderly support; the less significant features were number of workers, failing status, ethnicity, family type, etc.
2.4.2. Spearman Rank Correlation Coefficient
Spearman rank correlation coefficient R2 is a nonparametric measure of statistical dependence and assesses the monotonic relation between two variables. For Spearman rank correlation coefficient, the variable’s rank is used instead of the value itself, which is the average of their positions in the ascending order of the values. A perfect monotone function occurs a value of or 1 for Spearman coefficient, and 0 occurred to no correlation [12].
![]()
Figure 1. The final scores of the features using mutual information. Refer to Table 1 for the features in Figure 1.
The specific calculation formula is as follows:
(6)
where
is the rank of
, and
is the rank of
,
,and N corresponds to the number of samples,
is the observed value of sample points.
The value of R2 is between −1 and 1. When the value is 1, it means that the two random variables s and t are positively correlated. When the value is −1, it means that there is a completely negative correlation between s and t. When the value is 0, it means that s and t are linearly independent [13].
The result is shown in Figure 2. It can be seen that the two most representative features are monthly average living expenses and annual family income, followed by the number of old people to support, housing area, household debt, household work and city size. The less significant features are failing status, ethnicity, etc.
2.4.3. Distance Correlation Coefficient
The distance correlation coefficient R3 is used for the independence of the two variables s and t. When R3 = 0, it means that s and t are independent of each other. And the larger the R3, the greater the correlation between s and t.
The correlation coefficients of s and t are expressed as follows [14]:
(7)
The results are shown in Figure 3. It can be seen that the two most representative features are monthly average living expenses and annual family income. And the contribution of the other variables is relatively insignificant.
2.4.4. Rank-Sum Ratio
In order to integrate the above three methods for feature selection, we applied rank-sum ratio comprehensive evaluation method to mutual information R1, Spearman rank correlation coefficient R2 and distance correlation coefficient R3 to obtain the contribution ranking of 60 features. Equation (8) is used to calculate the rank sum ratio, where n is the number of indices.
(8)
The ranking results are shown in Figure 4. Remove features ranked 50th to 60th, including Northeast, Divorced Family, Insured Residents on Record, Ethnic, Small City, Property Damage, Earthquake or Fire, Always Passed, the Number of Preschoolers, Northern Coast. Finally, 50 features are left.
3. Factorization Machine
3.1. Principles of Factorization Machine
Factorization Machine (FM) was first proposed by Rendle in 2010 [15]. The factorization method is essentially applied to solve the problem of feature combination under sparse data. It is a general model that can be applied to all real data sets.
For a given vector
, The expression of the second-order FM model is as follows:
(9)
![]()
Figure 2. The final scores of the features using the Spearman rank correlation coefficient. Refer to Table 1 for the features in Figure 2.
![]()
Figure 3. The final scores of the features using distance correlation coefficient. Refer to Table 1 for the features in Figure 3.
![]()
Figure 4. The outcomes of the rank-sum ratio of the 60 features. Refer to Table 1 for the features in Figure 4.
where, n is the feature dimension,
is the global offset,
is the strength of the ith variable
, and
is a latent vector introduced for the feature
with the hyperparameter k(
) [16].
Let’s say
is the interaction between the ith and jth variables
[17]. Simple parameters
for each interaction are not adopted, because the parameters for each cross terms of parameters are no longer independent of each other. For example, the coefficient of
and
(
and
respectively) both have a common item
. That is, all samples containing non-zero combination of
can be used to learn the latent vector
. Even under the condition of sparse data, FM is able to learn the parameters of cross terms well.
3.2. Principles of FFM
Field-aware Factorization Machine (FFM) was first proposed by Yuchin Juan in 2016 [18]. On the basis of FM, FFM groups features of the same properties into the same field. To take the above classification of “Health of parents” as an example, “Health of parents = Healthy”, “Health of parents = Either seriously ill”, “Health of parents = Both seriously ill” and “Health of parents = Single family” all represent parents’ conditions in health and can be put into the same field. The same categorical feature generated by one hot encoding can all be placed in the same field.
In FFM, a latent vector
is learned for each feature
and for each field
. Therefore, the latent vector is not only related to the feature, but also to the field. In other words, different latent vectors are used when the feature “Health of parents = Healthy” is associated with the feature “annual household income” and “household housing situation”, which is consistent with the intrinsic difference between feature “annual household income” and “household housing situation”.
If there are n features of the sample belonging to f fields, the quadratic term of FFM has nf latent vectors. In the FM model, there is only one latent vector for each feature. Actually, FM can be regarded as the special case of FFM, which is the FFM model when all features are attributed to a field.
According to the field sensitivity of FFM, Equation (10) can be obtained below.
(10)
where,
is the field to which the jth feature belongs.
If the dimension of the latent vector is k, then the number of quadratic term of FFM is nfk, far greater than FM model’s [19] [20] [21].
4. Establishment of Identification Model for Needy Undergraduates
4.1. Application of FFM to the Identification of Needy Undergraduates
In order to fit our data set, we changed traditional FFM model to three-class classification FFM. The final model is described as follows:
(11)
where
,
,
,
.
The input of the model is the data set consisting of the remaining 50 features after feature selection.
The output of the model is the probability of non-needy undergraduates, needy undergraduates and extremely needy undergraduates. According to the principle of maximum probability, we choose the best classification. Use softmax function as activation function:
(12)
Use cross entropy loss as loss function:
(13)
where
is the true label of sample p expressed in the form of one hot encoding. Specifically, if sample p is a non-needy undergraduates, then
,
is the output vector obtained after the sample p is fed to the model, namely
[22].
4.2. Experiments
4.2.1. Coefficient Setting
According to the passage above, an identification model for needy undergraduates based on three-class classification Field-aware Factorization Machines was established.
In the experiment, the feature dimension(n) is 50, the latent vector parameter(k) is 30 and the number of fields is 18. Adagrad algorithm is used for gradient updates, with a learning rate of 0.01; the maximum number of iterations is 30.
4.2.2. Bootstrap Method
In order to minimize the randomness in the experiment, we repeat the experiment 10 times using bootstrap. The following is the specific process:
1) Randomly select 100 sets of data as the test set;
2) In the remaining data set, randomly select 1000 sets of data as the training set;
3) Train FFM model using the training set;
4) Calculate the accuracy
of the FFM model in the test set;
5) Return to step 2, repeat the experiment for ten times;
6) Calculate the average accuracy of acc as the final performance of FFM model.
4.3. Result
In order to show the excellent performance of our model, we took many models, such as Logistic Regression, SVM, Bayesian Network, Decision Tree and FM for comparative experiments. The specific experiment process is consistent with the passage above. Final results are shown in Table 2 below.
It can be seen from Table 2 that the FFM model has the best performance.
5. Conclusions
From the results of feature selection, it is found that prominent features affecting the identification of needy undergraduates are the year of the family income, cost of living provided parents, family farming, family houses, and total number of children. So when it comes to making artificial evaluation of needy undergraduates, more considerations should be taken into the factors above.
![]()
Table 2. The outcomes of the six different methods.
In addition, through comparative experiments, it can be found that FM and FFM have excellent performance in solving the classification problem under sparse data, because they can effectively learn the combination of features. Compared with FM, the concept of field is introduced in FFM, the features of the same property are attributed to the same field, different latent vectors are learned for different fields, so as to further improve the precision of the model.
6. Future Work
In the future, we will continue our work in two directions. One is the data, the other is the model.
When collecting data, we only collected data of needy students from Wuhan University of Technology. However, in reality, there are still some differences in the scale of identification of undergraduates in different schools. Therefore, we will collect data of undergraduates from other schools to further improve the generalization ability of the model.
Recently, more complex Factorize Machine models have been proposed, such as DeepFM [23] and xDeepFM [24]. We will try to apply these models to the identification of needy undergraduates in the future.
Acknowledgements
This paper is financially supported by the National Undergraduates Innovation and Entrepreneurship training Program, Wuhan University of Technology, China (No. S201910497069).