Classification of Attribute Mastery Patterns Using Deep Learning

It is very important to identify the attribute mastery patterns of the examinee in cognitive diagnosis assessment. There are many methods to classify the attribute mastery patterns and many studies have been done to diagnose what the individuals have mastered and or Montel Carl Computer Simulation is used to study the classification of the attribute mastery patterns by Deep Learning. Four results were found. Firstly, Deep Learning can be used to classify the attribute mastery patterns efficiently. Secondly, the complication of the structures will decrease the accuracy of the classification. The order of the influence is linear, convergent, unstructured and divergent. It means that the divergent is the most complicated, and the accuracy of this structure is the lowest among the four structures. Thirdly, with the increasing rates of the slipping and guessing, the accuracy of the classification decreased in verse, which is the same as the existing research results. At last, the results are influenced by the sample size of the training, and the proper sample size is in need of deeper discussion.


Introduction
Cognitive Diagnosis Assessments (CDAs) are used to evaluate the strengths and weaknesses of subjects in terms of cognitive skills learned [1]. Different from the other methods, CDAs not only provide the general results of evaluation, but also show detailed information of individual cognitive skills. Then, according to the evaluation results of CDAs, the subjects were remedied or instructed. The cognitive skills are also called attributes in the CDAs. The core of the CDAs is to iden-tify or classify the attributes or cognitive skills which the subjects had mastered; it is the attribute mastery pattern of the subjects. The accurate and effective identification of the attribute mastery pattern will directly affect the results of the evaluation in the CDAs.
Psychometric models are often used to identify or classify attribute mastery patterns. In the past several years, the psychometric models of CDAs have been developed rapidly and variously. For example, a series of models have been explored based on the DINA model [2]: HODINA (Higher-order DINA) [3], GDINA (Generalized DINA) [4], RDINA (Reparameterized DINA) [4], HORDINA (higher-order reparameterized DINA) [5], P-DINA [6], time series G-DINA (sequential G-DINA) [7] and multi-level GDINA model [8]. However, these psychometric methods usually make strict assumptions about the specific probability function form of the subjects' item responses. There will be poor classifications if the observed data do not fit the model well [9], and the current psychometric methods usually work well for large-scale assessments and unfit for the small-scale at the classroom level [10] [11] [12]. And then nonparametric methods have been proposed. Those methods are HaiMing distance discrimination method [13], clustering method [10] [11] [14] and the general nonparametric classification method [12].
With the development of Artificial Intelligence (AI), great progress has been made in the core algorithms of AI. The Artificial Neural Network (ANN) algorithm in AI was once used to classify the attribute mastery pattern. Current researches indicate that neural networks can be used for CDAs to classify the attribute mastery patterns, and it has some advantages with assumptions that do not depend on the distribution of subjects and can minimize the error of classifications [15]. The results of some research showed that different parameters and attribute numbers of artificial neural network had an impact on classification accuracy. Compared with parametric methods, the performance of the ANN approach was obviously better, especially when model-data misfits were present [9]. The later study found that ANN was more accurate than the DINA model in recovering skill prerequisite relations [16].
Although the ANN has been used in CDAs and has more advantages than parametric methods under some conditions, there are still some unknown problems to be studied. How the structures of the attributes might affect the ANN method in CDAs, and whether the parameters of guessing and slipping would impact the accuracy of the classification? This paper does some exploration in the above aspects and contains the following parts. At first, the Deep Learning algorithm of the ANN is introduced. In the second part, the DINA model is shown, as it is the frame work of this study, which is used to generate the real response matric. Thirdly, about the simulation study, the structures of the attributes, the parameters of guessing and slipping, and the sample of training in ANN, are discussed, focusing on how they may affect the accuracy of the DL in the CDAs.
At last, the results of simulation are summarized and discussed.

The introduction of the Deep Learning
Deep Learning (DL) [20] [21] [22] [23] [24] is one of the algorithms in ANN and it has been used widely to face recognition, natural language processing and image processing. In order to get to know the DL, the basic concept will be derived from the one simple neuron icon, shown in Figure 1. The left is the input quantities including x 1 , x 2 , x 3 and the constant (+1). The middle is the node, the right is the output function The expression of the relationship between the input quantities and the output quantities is as follows (1). The parameter b in the above is the intercept, shown as +1 in the Figure 1. f is the activation function. The sigmoid function is often chosen as the activation function, which is shown in the following (2).
There is only one neuron and only one layer from the input to the output in Figure 1. It is the DL of one layer. The multi-layer is shown in Figure 2.
n l represents the number of layers, from left to right as shown in the figure above, which includes the input layer, the middle hidden layer, and the right most output layer. The parameter It is the connection parameter between the jth unit of the lst layer and the (i + 1)th unit of the next layer; and it is the offset term. The initial values of the parameters w and b parameters are randomly generated from N(0, 1). The back propagation algorithm is widely used to estimate the parameters w and b [23] [24].
And it has been complied and realized in the MATLAB program. In the paper, it will be used to classify the mastery pattern of the subjects in CDAs, as the focus of this study is the application of the DL in CDAs, not the back propagation algorithm to estimate the parameters w and b. More information about the back propagation algorithm can be got from Rumerlhart, Hinton & Williams [23] and Lecun [24].
As an algorithm in Artificial Intelligence, DL has many advantages. For example, there is no overfitting in DL. When a model learns the details and noises in the training data to the extent, overfitting will occur, which will have a negative impact on the performance of the model on the new data. Regulation, Dropout and Early-stopping are often done to prevent overfitting. Another advantage is weak assumption that means it doesn't care about the distribution of the parameters during the estimation. The initial value of w and b can be generated randomly, and then they are estimated by the back propagation algorithm based on the sample data.
All the parameters including w and b have no strong distribution assumed and some are outside of the models themselves. But the application of DL is based on large-scale data, whether DL can be used on the small-scale data and

Deep Learning in CDAs
The process from the input to the output quantities is the training or learning in DL. As the output quantities is the object, which is existed or classified. Take the diagnosis imagination of a dog for an example, the output is a picture of a dog, and the input is the imagination to be identified. The parameters w and b will be estimated by the process of training or learning using the back propagation algorithm. Then there is a question: what's the input and output quantities in the CDAs with DL. In the CDAs, the attribute mastery patterns of the subjects are classified by their response matric on the examination, and then the cognitive diagnosis model is the connection of the skill mastered and the real response. In the DL used in the CDAs, the real response is the input, the real skills mastered are the output, and the connection between them is the hidden layers. But the real skills mastered of the subjects are unknown at all.
To overcome this limitation, we can use the ideal mastery patterns based on the attribute hierarchy, which are also the ideal item response data, consisting of response patterns that can be fully accounted for by the presence or absence of attributes without random errors or slips [9].

DINA Model
DINA model [2] is the basic cognition diagnosis model, and its item response function is as the following.

Q-Matrix
In the CDAs, the Q-Matrix was a J × K binary matrix, which was used to relate attributes to categories [25] [26]. J was the test length and K was the numbers of attributes. The element q jk in row j and column k was equal to 1, if attribute k was needed by item j and 0 otherwise. Thus, given K attributes, there were at most 2 K -1 distinct item attribute profiles. For example, there were 3 attributes as  [27], and more information could be found from their papers.

Attribute Structure in the Simulation
Four kinds of different structures of six attributes were compared in the simulation. The structures were shown in the following figures. The four kinds of structures were Linear, Convergent, Divergent and Unstructured respectively as shown in the following Figure 4.
As Figure 4 showed, the Linear of the hierarchy attributes was that if the individuals wanted to master the attribute A2, they must master the attribute A1 first. After the attribute A1 and A2 had both been mastered, the A3 could be mastered, which meant the later attributes were mastered based on the skills ahead. In the Divergent, after the A1 had been mastered, then the A2 and A3 could be mastered, but there was no correlation between the A2 and A3. The relationship among the attributes in the Convergent, was that the A6 mastered was based on the hierarchy of the A1, A2, A3, and A5 or that of the A1, A2, A4, and A5. In the Unstructured, the A1 was the prerequisites of the A2, A3, A4, A5 and A6, but there was no correlation among the A2, A3, A4, A5 and A6. In this paper, the performance of the DL in CDAs, was discussed under the four different kinds of hierarchy.

Q-Matrix of the Attribute Structures
The ideal attributes mastery patterns from Figure 4 are shown in Table 1.

The Simulation of Classification with DL in CDAs
The hierarchy of the cognitive attributes, the random error and guessing parameters of the examinee, and the frequency of the training and testing in DL are all considered in the paper. Four kinds of hierarchy with six attributes were studied, which were Divergent, Convergent, Linear, and Unstructured. The structures of the attributes were shown in Figure 4. The value of slipping (s) and guessing (g) was as the following: s = g = 0.05, 0.1, 0.15 or 0.2. The sample size was 1000, which was divided into two cases: one training number 500, testing number 500, and another training number 800, and testing number 200. The whole study had a total of 4 × 4 × 2 = 32 factors. The testing has the same length of 35 items. The credibility and feasibility of classifying the attribute mastery patterns based on DL were compared in different conditions.
In this simulation study, the response matrix of the examinee was the input layer of DL, and the attribute mastery pattern was the output layer. The identification was processed by training the existing data, including the simulated response matrix and ideal attribute mastery pattern. And then the subjects left were to be classified. For example, if the whole sample of simulation is 1000, then the response matrix and attribute mastery pattern of the 500 samples will be trained, the remaining 500 samples will be tested and classified based on the results of the previous sample trained.
In order to explore the value setting of model hyperparameters and optimizer hyperparameters, some trails were carried out. When set to the following values, the result is ideal in comparison: the number of the layers is 5, the input layer is the response matrix and its value is the number of the item response, the output is the attribute mastery patterns, and the value is the number of the mastery patterns; the middle is the hidden layer, and the number of the hidden neurons are 80, 80, and 60 respectively. As for the optimizer hyperparameter, the learning rate is 1, the number of epochs is 150, and the bacth size is 100. Of course, there may be other settings that can make the results more accurate, and this requires constant experimentation.
The process of simulation consists of the following sections. First, when the number and structure of the attributes had been identified as the above, then the Q r matrix was computed as shown in Table 1, and the ideal mastery pattern was also presented in Table 1. Second, the subjects were generated based on the ideal mastery pattern, the sample of the subject was 1000. The real response matrix was simulated under the DINA model. In addition, the sample of subject was divided into two groups, one group was for training and the other group was for testing or classification with the DL in CDAs. At last, the factors were discussed and summarized, which influenced the accuracy of the identification of the DL in CDAs. The formulations of the indicators are shown as the following.

Pattern
F is the frequencies of the simulation and M k is the number that the attribute mastery patterns estimated are the same with the true patterns. k is the attribute k, n k is the number n k examinees have been estimated correctly, if the sample of examinee is N. The mean PMR and the mean MMR across the F = 100 replications were then obtained and reported as percentages respectively, under each data set.

Results
The results of the PMR are shown in Table 2. It can be seen from the table that the PMR decreases with the increasing of the value of s and g in different attribute structures. What's more, the attribute structures affect the results of classification by DL. The accuracy of the classification by DL decreases with the complication of the structures of the attributes. For example, the Divergent structure is relatively the most complicated, and the accuracy of this structure is the lowest.
On the contrary, the Linear is relatively the simplest and the accuracy of this structure is the highest. The sample size of the training in DL is also another factor to influence the results of PMR. It was found that the value of PMR will increase with the increasing of the sample size of the training.
The results of MMR are shown in Table 3. It is shown that the value of MMR will change with the difference of the attribute structure. The hierarchy was more complex, the value of MMR was much lower. For example, the divergent structure was the most complicated, the MMR of this structure was the lowest. The Linear was the simplest, and its' MMR was the highest. Secondly, the value of the MMR will decrease with the increasing of s and g. In addition, the sample  size of the training in DL will also have an impact on the MMR. The size of sample was much larger, and the value of MMR was much greater.

Discussion
Classifying the attribute mastery patterns is of great importance in CDAs. The accuracy of classification will directly affect the credibility of CDAs. A number of studies have made a fine comparison of the existing methods, some were used to develop the existing methods, and some were to explore the new methods. As an algorithm of ANN, DL algorithm has been widely used in industrial realization of AI. DL, as a nonparametric classification method, has also been used in CDAs, and it has been found that the performance of the DL was better than that of the parametric methods when the observed data do not fit the model well [9]. Based on the above background, this study is of great value and significance for the application of DL in CDAs, especially for the evaluation of online learning in the future.
The results appeared the more complicated the attribute structure, the lower the classification accuracy; the larger the slipping and guessing value, the less the classification accuracy; and the larger the training sample size, the higher the classification accuracy. But different from the previous methods [18] [30], the results in this paper was that DL had the lowest value on the Divergent structure, which was the most complicated. The order of accuracy from good to bad is linear, divergent, convergent and unstructured. These research results and rules suggest that when the guessing and slipping parameters are large and the attribute structure relationship is complex, we can consider increasing the sample size of learning and training to improve the accuracy of classification; secondly, when the attribute structure relationship is linear, we can choose DL as the first choice for classification.
The focus of this paper was to discuss the factors that influence the identification accuracy of the DL in CDAs. Only six attributes, four structures, four guessing and slipping rates are discussed. There are still many problems to be dis- can be used on a small scale and how it will perform? These will be explored later. Secondly, there are still some deficiencies in the practical application of this paper, and in the future, the application of the DL in CDAs should also be explored. Some factors are discussed in this paper, and there are still some issues that need to be further studied, such as the current number of the attributes was only 6, but with the increase of the number of attributes, how the complexity of attribute structure may affect the diagnostic classification of the DL; when the Q matrix is misplaced, what will happen in the DL of the CDAs with different hierarchy; how will the parameters of DL affect the classification, such as the number of the layers and neuron, and etc. These issues will be studied further. The above conclusion is only limited to the simulation answer, using DINA model, when attribute relationship is 6 attributes, with 4 different structures, and 4 different guessing and slipping rate conditions.