Quality Assessment of Training Data with Uncertain Labels for Classification of Subjective Domains

In order to improve the performance of classifiers in subjective domains, this paper defines a metric to measure the quality of the subjectively labelled training data (QoSTD) by means of K-means clustering. Then, the QoSTD is used as a weight of the predicted class scores to adjust the likelihoods of instances. Moreover, two measurements are defined to assess the performance of the classifiers trained by the subjective labelled data. The binary classifiers of Traditional Chinese Medicine (TCM) Zhengs are trained and retrained by the real-world data set, utilizing the support vector machine (SVM) and the discrimination analysis (DA) models, so as to verify the effectiveness of the proposed method. The experimental results show that the consistency of likelihoods of instances with the corresponding observations is increased notable for the classes, especially in the cases with the relatively low QoSTD training data set. The experimental results also indicate the solution how to eliminate the miss-labelled instances from the training data set to re-train the classifiers in the subjective domains.


Introduction
Recently, much research is aimed at predicting the status of individuals in subjective domains, including their emotional states, their heath, and their personality by using a set of training data, acquired from a variety of sensors and interpreted or labelled by a first-person or a third-person [1] [2] [3].As described by C. E. Brodley in [4], labeling noises of training data set are emergent in such Y.Dai domains for several reasons including data-entry error, inadequacy of the information used to label each object, especially, uncertainty of the states.In [5], statistical taxonomy of label noise inspired by [6] is summarized.There are three kinds of models defined: noisy completely at random model, noisy completely at random model, noisy random model, and noisy not at random.All of these models assume that the true classes are existed, and whether the labelling error occurs is told by introducing a binary variable.However, in the above domains of emotion, health, or personality, the absolute ground truth is unknown.The subjects including first-person and third-person subjectively provide labels as their choices, so that the class-uncertain label noise naturally appears, whoever providing these labels, a skillful expert or a nonprofessional.For example, in the domain of Traditional Chinese Medicine (TCM), the status of health is described fundamentally by 13 Zhengs which are diagnosed by TCM doctors based on the information acquired from five senses.Further, a disease severity of these 13 Zhengs are scored according to the subjective observation.Accordingly, scores of 13 Zhengs labelled by TCM doctors are ambiguous, because the absolute ground truth of the 13 Zhengs is unknown.
Many methods are proposed to deal with label noise.In the literature, there exist three main approaches to take care of label noise [5].A first approach focuses on building algorithms that are robust to label noise; second, the quality of training data is tried to improve by identifying mislabeled data; eventually, label noise-tolerant learning algorithms aim at building a label noise model simultaneously with a classifier, which uncouples both components of the data generation process and improves results of the classifier.All of these methods are based on the underlying premise that the label errors are independent of the ground truth of the classes, that is, the ground truth of the classes is existed although the observation labels are contaminated for some reasons.However, it is obvious that the above premise is not available for the domains being of subjectivity, because the observation labels are also influenced by the uncertainty of the classes.
In order to deal with the above issue about the label noise of training data caused by the subjective labelling, especially in the domain that the ground truth is uncertain, we defined a metric QoSTD that is intended to measure the quality of the training data with uncertain labels for the classification of subjective domains, so as to predict the states of person's emotion, health, and so on.QoSTD

Y. Dai
We trained binary classifiers for the states based on the support vector machine (SVM) model and the discrimination analysis (DA) model, so as to validate the relation of QoSTD with the performance of classification.Furthermore, the QoSTD is used as weights of predicted class scores to adjust the likelihoods of the instances without the absolute ground-truth.To evaluate the effectiveness of QoSTD in dealing with the label noise brought by subjective labelling, we used TCM Zheng training data set that was used in [7] and [8] for experiments.
The experimental results show that the proposed method improved the consistency of likelihoods of instances with the corresponding observations notably for the classes, especially in the cases with the relatively low QoSTD training data set.The experimental results also indicated the solution how to eliminate the miss-labelled instances from the training data set to re-train the classifiers in the subjective domains.

Related Works
The literature contains many studies on the classification in the presence of label noise [4].In [5], a method to identifying and eliminating mislabeled training instances for supervised learning is proposed.The paper focuses on the issue of determining whether or not to apply filtering to a given data.However, for the work described in the paper, the data were artificially corrupted.Therefore, the application of this method to relatively noise free datasets should not significantly impact the performance of the finally classification procedure.Moreover, the authors indicated that a future direction of this research would be to extend the filter approach to correct labeling errors in training data.However, it is difficult to judge labeling errors in the subjective domains, because the absolute ground truth is unknown.In [9], authors propose to use a unsupervised mixture model in which the supervised information is introduced, so as to compare the supervised information given by the learning data with an unsupervised modelling.For this model, the probability that the jth cluster belongs to the ith class is introduced to measure the consistency between classes and clusters.However, it is not possible to obtain an explicit solution of the above probability for the classes of the subjective domains.In [10], a self-training semi-supervised support vector machine (SVM) algorithm and its corresponding model selection method is proposed to train a classifier with small training data.The model introduces the Fisher ratio which represents the separability of a corresponding set.
It is obvious that the above parameter is not available for the classes of the subjective domains in the case that the ground truth of the classes is unknown.In [11], the quality of class labels in medical domains is considered.However, the ground truth of the training data used in the experiments is assumed to be certain, and those were corrupted artificially to analyze the impact of inputted noise on the classification.Moreover, reference [12] analyzers a number of pieces of evidence supporting a single subjective hypothesis within a Bayesian framework.Reference [13] introduces an emotion-processing system that is based on fuzzy inference and Y. Dai subjective observation.In [14], to make the annotation more reliable, the proposed method integrates local pairwise comparison labels together to minimize a cost that corresponds to global inconsistency of ranking order.In [15], the authors construct subjective classification systems to predict sensation of reality from multimedia experiences based on EEG and peripheral physiological signals such as heart rate and respiration.In [16], the authors propose a machine learning based data fusion algorithm that can provide real time per frame training and decision based cooperative spectrum sensing.For the labelled data imbalance, the authors in [17] propose a framework based on the correlations generated between concepts.The general idea is to identify negative data instances which have certain positive correlations with data instances in the target concept to facilitate the classification task.In [18], robust principal component analysis and linear discriminant analysis are used to identify the features, and support vector machine (SVM) is applied to classify the tumor samples of gene expression data based on the identified features.However, all of these methods didn't consider how to deal with the effects of training data's mislabeling on the classification.
On the other hand, various methods have been proposed that utilize TCM to infer the health status of an individual as a means of auto-diagnosing.References [7] and [8] propose methods that use TCM Zheng to infer the health status of individuals by using images of their face and eyes, data on their emotional and physical state, and Zheng scores assigned by different TCM doctors (TCMDs).
Reference [19] and [20] analyzes the effect of multimodal sensor data on the Zheng classification.However, all of these papers don't consider in introducing the metric QoSTD as the weights of the predicted class scores of the instances, so as to improve the reliability of the classification in the subjective domain.

Measuring the Quality of Subjectively Labelled Training Data
Because the target is the status' classification in the subjective domains, the data Eigen feature vectors of the instance is obtained by calculating the eigenvalues and eigenvectors of A A ′ * ; this is based on the method of principle component analysis (PCA) [21].With ranking the eigenvalues in a descending order, the corresponding top P eigenvectors are selected to form a matrix U with the size of M P * .Then, the matrix of eigenfeatures regarding the samples is computed by the following equation. [ where s and p are indices indicating the sample and the eigenfeature, respectively.Thus, the size of EF is S P * .
The eigenfeature vector is then used to represent the instance.The samples belonging to a given state and those not belonging to that state are considered to overlap due to the subjectivity of the labelling.Accordingly, a matrix called QoSTD is defined to measure how well the training data set can be divided into binary classes.This allows us to explore the influence of the features and the subjectively labelled data on the state that is perceived.QoSTD is calculated not only based on the partition of the training data, but also the clustering ability of those.We call these two metric as the partition and the clustering.These determine the performance of the classification regarding the training data.Let the score of State j for Samples labelled by Observer i be denoted as ij s z .In the train- ing data set, those that have scores larger than the value of 0 for state j are considered being labelled as state j, and compose the data set and those that have a score of 0 for State j compose the data set and the clustering of the data set for State j labelled by Observer i is defined as where # indicates the number of data points; ( )    We assumed that the cluster with the more positive instances is the positive cluster, and the cluster with the more negative instances is the negative cluster.For the case of (a), it is obvious that the cluster 1 is regarded as the negative cluster, and the cluster 2 is regarded as the positive cluster according to the results of Fig.

Using QoSTD for Classification
As mentioned above, the metric The following is the scheme that trains classifiers utilizing ij QoSTD .Generally, existing supervised learning algorithm, for example, discrimination where, _ ij s score r denotes the adjusted score of instance s belonging to State j labelled by observer i.
Then, the likelihood of the instance belonging to State j is calculated by the Equation ( 6).
( ) where, ij s l indicates the likelihood of instance s belonging to State j labelled by observer i; the parameter a is the slope parameter.
For the instance s, if the value of _ ij s l r is more than a threshold T_max, it is assigned to the positive lass of State j; if that value is less than a threshold T_min, it is assigned to the negative class of State j; otherwise, whether the instance is belonged to State j is uncertain.Then, the uncertain instances are eliminated which is defined by the following Equation (7), reflects the consistency of the labelled score of the assigned instances from the training data with the likelihood of those.Let the labelled score that is larger than the value of 0 is denoted as ij s pz , the likelihoods of the assigned instance is denoted as _ ij s l ra , and the num- ber of the assigned instances is denoted as 1 ij S , Then, ( )( ) On the other hand, ij Recall , which is defined by Equation ( 8), reflects the rate of the number of the assigned instances to the all.
It is obvious that the larger the values of Step 1 Constructing the binary classification model; Step 2 Calculating the adjusted likelihood _ ij s l r of instances by Equations ( 5) and (6); Step , the instance s is not assigning to State j; otherwise, the instance s is assigned to the positive or negative class of State j according to the likelihood; Step After constructing the binary classification model, a new instance could be assigned to the positive class of State j, if its _ ij s l r is larger than T_max; however, it is assigned to the negative class of State j, if its _ ij s l r is less than T_min.Otherwise, which the instance is belonged to is uncertain.

Training Data Set
In this study, the real-world training data set that was used in [7] 5) and (6).For the SVM model, the value of parameter a in the Equation ( 6) is set as 1/100, and for the DA model, the value of that is set as 50.Moreover, T_max = 0.99, and T_min = 0.01.For a instance s, it is assigned to Zheng j, if the calculated likelihood is larger than 0.99;  if the likelihood is less than 0.01, the instance s is not belonged to Zheng j; otherwise, the assigned post of the instance s is uncertain.The training procedure is repeated with the refined training data set, until the limited rounds or _ ij Recall Obj are reached.QoSTD value less than 0.6.It is observed that the qual- ity of Zheng scores labelled byTCMD2 and TCMD3 are not as good as those labelled by TCMD1 and TCMD4.We thus think that it is certain that the ij QoSTD can be used as a criterion for judging the quality of the subjectively labelled training data.If the ij QoSTD is less than a threshold, the following learning procedure should be given up, so as to ensure the performance of the classification.

About Adjusting the Predicted Class Scores
As described in Section 3, the predicted class scores of the instances are adjusted by introducing   )( ) where, _ ij s l a indicates the likelihoods of the assigned instances in the case that the class scores are not adjusted by the Equation (5).
Table 2 shows the values of Table 3 shows the corresponding results, while the scores of the instances are labelled by the TCM doctor identified as 3 (TCMD3).For Table 3, the results about Zheng 1 are empty, because there were not any instances labelled by TCMD3 for Zheng 1 in the training data set.
From the results of Table 2 and Table 3, we can see that the values of

About Re-Training
As described in Section 4, the training produce is repeated for constructing the classifiers with the refined training data set until the limited rounds or _ ij Recall Obj are reached.In our experiments, the limited rounds is set as 20, _ ij Con Obj is set as 0.7, and _ ij Recall Obj is set as 0.1.Table 4 shows the re- sults of maximal ij Con , the corresponding round, and the difference of maximal and first round ij Con regarding thirteen Zhengs, while the scores of the in- stances are labelled by the TCM doctor identified as 1 (TCMD1).Table 5 shows the corresponding results, while the scores of the instances are labelled by the TCM doctor identified as 3 (TCMD3).
From the results of Table 4 and Table 5, we can see that the differences of maximal and first-round ij Con is larger or equal to 0, whatever the cases of TCM doctors and the models used to training the classifiers.Especially, in most cases, these values are larger than 0.Moreover, the differences of maximal and   Con is 0.80, and the corresponding round is 9th round by SVM model.For Zheng 2 in the case of TCMD3, the maximal 1,2 Con is 0.90, and the corres- ponding round is 4th round by DA model; the maximal 1,2 Con is 0.60, and the corresponding round is 6th round by SVM model.It is obvious that this matches the issue described in [18] that discarding an uncertain instance in the training dada set maybe influence the performance of the classification because that is an exception rather than an error for the small training data set.So, we deduce that we can't say that re-training the classifiers with the refined training data set consecutively must improve the performance of the classification.The proposed solution regarding the above issue is to find the round that make

Conclusions
This paper defined the is an aggregation of two components which reflect the ability of clustering and partitioning of the training data set.The training data include the features extracted from the multimodal sensor data of subjects, the subjective scores of various items in a first-person questionnaire, and observation scores of classes in a subjective domain which are provided by third-persons.By using this metric, we can analyze the influence of subjectively labelled data on the quality of the training data, and we can estimate the sufficiency of the training data for the classification.When QoSTD for a particular class is less than a predetermined value, this indicates that the training data for this class can't satisfy the performance of classification.
used as training data are generally diverse.For an instance, the data used to extract features or attributes maybe include the data measured by sensors or other equipment, or the data from the first-person questionnaires; with direct observation to the object, the states of the instance are labelled by third-persons for supervised learning.Although the kinds of obtained data are heterogeneous, all of the features extracted from the different modes are handled in the same way as the features of different modes.For example, the histogram, shape, and the texture of an image are the features of the image mode, and the blood pressure measured by a bio-sensor is the feature of bio-sensor mode.All of these features are considered to be homogeneous.They are denoted as mn s a , which are norma- lized for each data set.Here, s, m, and n indicate indices of the sample, the mode, and a certain mode's feature.The combined features of all of training samples yield a matrix A with the size of S M * .Here, S, and M are the number of samples and total features, respectively.On the other hand, the labelled state scores Y. Dai from the third-person for each instance are denotedas ij s z , and the values range from 0 to 10.Here, s, i, and j indicate indices of the sample, the observer, and the state, respectively.
indicates the number of the samples which are labelled as State j and clustered into the positive cluster of State j.So, the larger the values of ij par and ij clu is, the better the separability of the training data set for State j is.If these values are equal to 1, this means that the training data are completely separable.Accordingly, the quality of the training data set for classifying State j labelled by Observer i is defined as ij QoSTD by the following expression, which is an aggregation of ij par and Y. Dai

w and 2 w
are the weights of partition and clustering, reflect the im- portance of the partition and the clustering ability of the training data in the classification.In the case that these two factors are equivalently important, both are set to 0.5.The value of ij QoSTD is equal to 1, if the training data set is com- pletely separable for the Sate j which are labelled by Observer i.

Figure 1
Figure 1 shows the example of 120 data points of ,1 s ef and . Figure1(a)   is the instances' scatter regrading State j1 labelled by an observer, and Figure1(b) is the distribution regarding State j2.The dark blue points indicate the corresponding positive instances belonged to that state, and the light blue points indicate the negative instances not belonged to that. Figure 1(c) is the clustering of the data points by K-means.The instances of cluster 1 are indicated by light orange points, and the instances of cluster 2 are indicated by dark orange points.
(a), and (c).So does the case of (b).Based on the definition of QoSTD, and combining the results of instances' clustering in Figure1(c), it is obvious that the quality of the data points' distribution in Figure1(a) is better than that in Figure 1(b) for training the classification model, although the class of positive instances and the class of negative instances are overlapped either in the case of (a) or in the case of (b).In fact, the value of ij QoSTD regarding the case (a) is 0.78, and the value of that regarding the case (b) is 0.43.So, we think that the larger the values of ij QoSTD are, the better the quality of the training data set labelled by Observer i for classifying State j is.When 1 ij QoSTD = , this indicates that the data set labelled by Observer i can be divided completely into two classes with a positive or negative State j.

ij
QoSTD could be used to judge the quality of training data for the classification.Accordingly, the value of ij QoSTD is consi- dered to be used as a weight of the predicted class scores of the instances regarding States j.For calculating ij QoSTD , the data modes used as the training set are deter- mined based on the context in which the data were collected and the capacity for computations.Next, the features of matrix Aare extracted from the multimodal data set.The eigenfeatures matrix EF is obtained by Equation (1).Then, the value of ij QoSTD for State j labelled by Observer i is calculated using Equations(2), (3), and (4).
from the training data set, and the classification model is trained again with the refined training data.Two measurements, ij Con and ij Recall , are introduced to assess the per- formance of classifying the classes without the absolute ground-truth.ij Con , ij  Recall Obj .Then, the whole training procedure is as the below.
[8] is utilized for predicting the individual's health status represented by the states of TCM's thirteen Zhengs (Clod syndrome, Pyretic syndrome, Deficiency of vital energy, Qi stagnation, Blood asthenia, Blood stasis, Jinxu, Phlegm retention, Heart syndrome, Lung syndrome, Spleen syndrome, Liver syndrome, Kidney syndrome), so as to validate the effectiveness of the proposed method.This dataset contains multimodal sensor data about the health status of various individuals.These data include scores of measured physical states and reports of subjective information obtained by first-person questionnaires; in addition, features are extracted from images of the individual's tongue, face, and eyes.The corresponding labelled data set comprises the scores of thirteen Zhengs given by four TCM doctors (TCMDs) who inspected and diagnosed the provided samples.The labelled Zheng scores range from 0 to 10.However, most of these data have values less than 5 because the subject volunteers were students at the university, and thus they were generally healthy.The data from the first-person questionnairescontainsnine types of feelings and thirteen physical states related tohealth status, as proposed by the World Health Organization (WHO).The scores of the corresponding items range from 0 to 5. The features that were extracted from the images of the faces, and tongues are shown in Figure 2. The extracted features were combined with the above feelings and physical Y. Dai states to form the matrix A. Each of these items is the modes of the features.The training data set includes five modes: Feelings, Physical States, Eye, Tongue, and Face.The modes and the number of features for each mode are shown in Table 1.The total number of features is 71.There are 150 instances from 32 individuals in the dataset, each of which includes 71 features and the corresponding thirteen Zheng scores labelled by the four TCMDs.The matrix EF of eigenfeature vectors of the instances is obtained by Equation (1) with calculating the eigen values and eigen vectors of A A ′ * .Then, the ma- trix EF is used to train the binary classifiers of TCM Zhengs.Although two kinds of classification models are trained In order to verify the above statement that ij QoSTD can be utilized as the weight of the predicted class score to improve the performance of the classifiers, especially, in the case that the training data are subjectively labelled and the ground truth is uncertain, all of existing supervised classification models are available.For these two kinds of classifiers, one is SVM model that is trained by utilizing the MATLAB (Mathworks, Natick, MA, USA) function fitcsvm.The kernel function here is a polynomial of order three.The other is DA model that is trained by MATLAB function fitcdiscr.Based on the above binary classification model, the class scores of the instances belonging to Zheng j are obtained by using the MATLAB function predict.Then, the class scores are used to calculate the likelihood measures of the corresponding instances by Equations (

Figure 3
Figure 3 shows the values of QoSTD as the weights of those scores.For exploiting how adjusting the predicted class scores improve the performance of the classification, another measurement that reflects the consistency of the labelled score of the assigned instances from the training data with the likelihoods of those in the case without adjusting the class scores is introduced.This measurement _ ij Con ori is calculated by the Equation (9).

Figure 3 .
Figure 3. Quality of the training data set for all thirteen Zhengs.

ij
QoSTD and the first round's results of ij Con and _ ij Con ori of the instances in the above training data set regarding thirteen Zhengs by using the DA-based and the SVM-based binary classifiers, while the scores of the instances are labelled by the TCM doctor identified as 1 (TCMD1).
almost of the Zhengs, compared with the results of ij ori Con .In the case of TCMD1, compared with the corresponding values of ij ori Con , the values of ij Con for all of the Zhengs are gained with the SVM classification model; those rise for twelve of thirteen Zhengs with DA model.In the case of TCMD3, similar with the case of TCMD1, except two Zhengs with DA model, the values of ij Con are increased for the Zhengs with neither SVM nor DA model.Especially, the increased rates are relatively notable for the almost Zhengs in the case that the ij QoSTD is less than 0.5.Moreover, it is observed that the most of ij Con are larger with SVM modal compared with DA model; however, the most of increased rates of ij Con to _ ij Con ori are larger with DA model compared with SVM model.This means that the DA-based classifiers are more sensitive to ij QoSTD than the SVM- based classifiers, although the SVM-based classifiers seem to have the better classification ability.Accordingly, we can say that adjusting the predicted class scores with ij QoSTD as the weights really improves the performance of the classifiers trained, especially in the cases that the classifiers are trained by the data set that is with the low values of ij QoSTD , whatever the classification models which are used to train the classifiers.
first-round ij Con is relatively high in the case of TCMD3 which corresponds to the relatively low ij QoSTD .So we can deduce that eliminating the unsigned examples in the training data set and re-training the binary classifiers with the refined training data set can really improve the performance of the classifiers for the classification, especially in the case that ij QoSTD is relatively low.However, it is noted that for some of Zhengs in

2
of ij Con do not reach the value of _ ij Con Obj that is set as 0.7, although they raise after re-training.It is indicated that the provided training data set cannot make the corresponding classifiers achieve the required performance for these Zhengs.In such cases, constructing the classifiers for these states should be given up.It is also noted that the round of re-training that make ij Con maximal is dif- ferent.For example, for Zheng 2 in the case of TCMD1, the maximal 1,Con is Y. Dai 0.84, and the corresponding round is 2th round by DA model; the maximal 1,2 condition regarding _ ij Recall Obj , so as to adopt the classifier trained in this round.As a whole, we used the real-world training data set to train the classification models.This training data set involved in thirteen Zhengs labelled by TCM doctors with the label noises that were caused because the absolute ground-truth is unknown, while the current research in the literature almost utilizes the artificial corrupted training data set.The experiments verified that the ij QoSTD is rele- vant to the performance of classifying the classes without the absolute groundtruth.There is the high positive correlation between ij QoSTD and ij Con , in- troducing the measurement of ij QoSTD as the weights of the predicted class scores to adjust the likelihoods of the instances really improved the performance of the classifiers.
QoSTD metric as a way to measure the quality of training data subjectively labelled by observers (i), which was used to improve the prediction of states (j) without the absolute ground-truth.The ij QoSTD was used as the weights of the predicted class scores to adjust the likelihoods of the instances.Moreover, two measurements of ij Con and ij Recall were de- fined in order to assess the performance of the classifiers trained by the subjective labelled data in a more suitable way.The training procedure was repeated by the refined training data set, until the object values of effectiveness of the proposed method, real-world training data set was used to train the classifiers based on the DA and SVM classification models.This training data set involved in thirteen Zhengs labelled by TCM doctors with the label noises that was caused because the absolute ground-truth is unknown.The experimental results showed the effectiveness of the proposed method in improving the performance of the classifiers for the instances without the absolute ground truth.Furthermore, the proposed method indicated the solution how to eliminate the instances with the label noises from the training data set.As an area of future work, we intend to utilize the other training data set in Y. Dai the field of emotion, personality, and so on, to train the classifiers based on the proposed method, to verify the effectiveness of our method in improving the classification in the subjective domains.
Recall Obj ; if not, the final classification model can't be determined, and the training procedure is given up.
ij Con Obj , and the corresponding ij Recall is larger than _ ij

Table 1 .
Modes and the number of features.

Table 4 and
Table 5, the maximal values