^{1}

^{2}

^{2}

The academic community is currently confronting some challenges in terms of analyzing and evaluating the progress of a student’s academic performance. In the real world, classifying the performance of the students is a scientifically challenging task. Recently, some studies apply cluster analysis for evaluating the students’ results and utilize statistical techniques to part their score in regard to student’s performance. This approach, however, is not efficient. In this study, we combine two techniques, namely, k-mean and elbow clustering algorithm to evaluate the student’s performance. Based on this combination, the results of performance will be more accurate in analyzing and evaluating the progress of the student’s performance. In this study, the methodology has been implemented to define the diverse fascinating model taking the student test scores.

Clustering is one of the most significant techniques in data mining that explores data sets [

The Elbow method is one of the most common approaches in identifying the optimal value of K-clusters in a data set. Syakur et al. took advantage of this approach by combining the traditional K-means algorithm with the Elbow method to enable an optimal way of counting clusters of segmented performance profiles [

The primary objective of this research was to introduce a new clustering model which combines the K-means algorithm with four functionalities: the Elbow method, scaling, and normalization/standardization on a dataset. For computer science students from Oakland University, the dataset was based on the following classification: course name, course grade, cumulative GPA, and number learning segments requiring more attention. Each of these classifications was referred to as the student’s performance. After clustering students in groups, an improvement plan was structured for each group of students emphasizing the areas where each student was not performing well, recommending chapters for review, homework to retake, and topics to dedicate more attention to.

The rest of this paper is organized as follows: Section Two the methodology describes the broad theoretical that is applied in the study. In Section Three includes results and discussions and finally, Section Four concludes the paper with some future work suggestions.

In the proposed system, optimizing the number of clusters of K-Mean approaches was the main intent of this research study considering the Elbow approach to define the number of clusters during the processes of evaluation. The proposed method could be described by the following flowchart in

One semester of study datasets for computer science students were selected to analyze and make predictions about future student performance. The data sets were passed through four stages of processing. The first step was converting grade/course data type from a string of data to just one number. The next step was scaling that applies standardization/normalization to dataset features that have a magnitude variance larger than others. The third step was applying the Elbow method to the dataset to define the optimal number of K-clusters. The fourth step was running the K-means algorithm on the dataset to partition or group students in clusters based on their performance while the SSE (the Error Sum of Squares, the sum of the squared differences between each observation and its group’s mean) was calculated and recorded to determine the number of clusters.

Scaling of datasets is a method of standardizing/normalizing the range of independent variables and a common requirement for many machine learning estimators implemented in Scikit-Learn (for the Python programming language) and it

might dominate the objective function and make the estimator unable to learn from other features as correctly as expected [

z = x − μ σ (1)

This formulation is a specification that has z as the standardized/normalized value, x is the raw value of the data point, μ is the population mean, and σ is the population standard divisor for the dataset.

The K-means technique is a type of partitioning/clustering method that was first established by J. B. MacQueen [

W ( C k ) = ∑ x i ∈ C k ( x i − μ k ) 2 (2)

This formulation is a specification in which W ( C k ) represents the within-cluster total, x i is a data point for a cluster, C k indicates a cluster for each data point, and μ k defines the mean value of the points that is assigned to the cluster C k . Therefore, the sum of total within-clusters of the sum of squares measures compactness (TW) as follows

TW = ∑ K = 1 K W ( C k ) = ∑ k = 1 k ∑ x i ∈ C k ( x i − μ k ) 2 (3)

The Elbow approach is a technique which looks at the percentages of variance illustrated as a function of the optimal number of clusters in the K-means [

The Elbow method is described in Equation (4) as the within-groups sum of squares (WSS), where the squared average distance of all the data points (the means of each of the individual groups or group means) for a cluster is a distance that is statistically measured from the group means to the same cluster centroid [

WSS = ∑ i = 1 m ( x i − c i ) 2 (4)

The combination of both the K-means and Elbow method can locate the value of k at the optimal cluster to determine k as the number of clusters formed. The Elbow method is used to choose the best number of k clusters for grouping data within the K-means technique. The Elbow method can be expressed by the sum of the squares error [

SSE = ∑ k − 1 k ∑ x i ∈ S k ‖ x i − c k ‖ 2 2 (5)

This formulation is a specification with k that is equal to many clusters that formed C, which is the i^{th} cluster, with X, representing the data given at each cluster.

In this study, the previously described approach has been followed based upon a computer generated algorithm to group students in multiple clusters based on their performance.

From the proposed system, the preprocessing stage has been used to convert the data type from string to numeric value, as shown in

Course Number | Grade | GPA |
---|---|---|

CSI-1420 | B+ | 1.54 |

CSI-2310 | B− | 1.54 |

CSI-2300 | A | 1.65 |

CSI-2999 | A | 1.65 |

CSI-3660 | A− | 1.87 |

CSI-2310 | B− | 1.87 |

Course Number | Grade | GPA |
---|---|---|

CSI-1420 | 3.6 | 1.54 |

CSI-2310 | 3 | 1.54 |

CSI-2300 | 4 | 1.65 |

CSI-2999 | 4 | 1.65 |

CSI-3660 | 3.8 | 1.87 |

CSI-2310 | 3.2 | 1.87 |

CSI-2999 | 4 | 2 |

CSI-2999 | 4 | 2.1 |

CSI-2310 | 2.8 | 2.33 |

CSI-2999 | 2.9 | 2.36 |

In

The scaling technique is structured to standardize/normalize the data sets, which is a common procedure followed by many machine learning estimators that use Scikit-Learn. If one feature of the dataset has a value that is larger than others, it might dominate the objective function and misguide the estimator to learn from other features correctly as expected. So, if the data features do not look like normally distributed data, then the estimator may behave badly. For example, if it is assumed that there are two features of a person such as weight in pounds (lbs.) and height in feet (ft.), and there is a desire to predict whether a person needs an “S” or “L” size shirt based upon these two factors, the following formula can be used by taking the sum of weight and height to determine the best fit. To clarify, if it is assumed that there are two people, one in a cluster such that one person in cluster X = (175 lbs. + 5.9 ft.) in size “L”, and another person in cluster Y = (115 lbs. + 5.2 ft.) in size “S”; if a third person in cluster Z = (140 lbs. + 6.1 ft.), then the previously described method will classify cluster Z in the

cluster nearest to cluster X or cluster Y. If the features are not scaled the height will not affect the clustering, and Z will be allotted in the cluster size “S”. From

From

In

In

as expected at k = 2, because the data set features contain different ranges of values, so it is important to scale the values of the features to the same range to get more accurate results from the Elbow method. In this study, a small data set was considered; if a large scale of data set had been considered, then the differences could be noticed in the elbow between the scaled and unscaled data sets. Applying this model has demonstrated that the optimal number of clusters for any given dataset can be achieved.

This research paper provided a description of a simple and efficient method to help college students that are under-performing, are very close to falling under their university’s minimum academic standards, or are on academic probation as indicated by their GPA. Many universities have 2.0 GPA minimum standards and will issue a warning to the students when a student’s GPA falls below that standard, providing a probationary period of one semester to raise their GPA. Students on probation are usually not able to participate in college activities including working in school or receiving scholarships. Those students will find themselves out-of-school in one semester if they do not improve their GPA. Previous research combining the K-means clustering algorithm with different methodologies provided the inspiration in creating an efficient procedure for providing failing students with targeted suggestions for significantly raising their performance above the 2.0 minimum standards. This was a procedure that combined the K-means algorithm with four functionalities: the Elbow method, scaling, normalization, and standardization. First, a dataset was selected containing GPAs and grades per course for computer science students from Oakland University. The dataset was setup first so that students were clustered into groups based on performance similarity for GPA and Grades/course where the number of clusters k were provided to the K-means algorithm as input. Next, the Elbow method was applied to the same data set where the Elbow method defined the optimal number of clusters for the data set. Following that, the K-mean algorithm was combined with Elbow and scaled data where the data sets were passed through four stages of processing: 1) converting Grade/Course data type from string to a number; 2) scaling that applied both standardization and normalization to the dataset feature that had a different range to improve clustering accuracy; 3) applying the elbow method to the dataset to define the optimal number of clusters K to group students in clusters based on their performance while SSE (Sum Square Error) was calculated and recorded to determine the number of clusters; and 4) comparing the three scenarios, where the numerical result revealed that the third scenario approach where the data was scaled before combining k-means with Elbow method was accurate and efficient in optimally clustering students based on their academic results. This approach clearly demonstrated the advantage of scaling the data before combining the K-means algorithm with the Elbow method. After clustering students in groups, an improvement plan was structured for each group of students based on their performance, focusing on the areas where the student was not performing well, suggesting chapters for review, homework to retake, and topics to dedicate more focus.

The authors declare no conflicts of interest regarding the publication of this paper.

Omar, T., Alzahrani, A. and Zohdy, M. (2020) Clustering Approach for Analyzing the Student’s Efficiency and Performance Based on Data. Journal of Data Analysis and Information Processing, 8, 171-182. https://doi.org/10.4236/jdaip.2020.83010

- z: Standardized value.

- x: Raw value of the data point.

- μ: Population mean.

- σ: Population standard division for dataset.

- k:Number of clusters.

- W ( C k ) Total within-cluster.

- x i : Data point for cluster.

- C k : Cluster for each data points.

- WSS: Within Sum of Squares.

- SEE: Sum of Squared Error.

- μ k : Mean value of the points that is assigned to the cluster C k .

- TW: Sum of total within-clusters of the sum of squares measures compactness.