Analyzing Differences between Online Learner Groups during the COVID-19 Pandemic through K-Prototype Clustering

Online learning is a very important means of study, and has been adopted in many countries worldwide. However, only recently are researchers able to collect and analyze massive online learning datasets due to the COVID-19 epidemic. In this article, we analyze the difference between online learner groups by using an unsupervised machine learning technique, i.e., k-prototypes clustering. Specifically, we use a questionnaire designed by domain experts to collect various online learning data, and investigate students’ online learning behavior and learning outcomes through analyzing the collected questionnaire data. Our analysis results suggest that students with better learning media generally have better online learning behavior and learning results than those with poor online learning media. In addition, both in economically developed or undeveloped regions, the number of students with better learning media is less than the number of students with poor learning media. Finally, the results presented here show that whether in an economically developed or an economically undeveloped region, the number of students who are enriched with learning media available is an important factor that affects online learning behavior and learning outcomes.


Introduction
Online learning has been growing continuously in the past two decades. Distance How to cite this paper: Ge, G.G., Guan, Q.L., Wu, L.S., Luo, W.Q. and Zhu, X.Y. (2022) Analyzing Differences between Online Learner Groups during the COVID-19 Pandemic through K-Prototype Clustering. Journal of Data Analysis and Information Processing, 10, 22-42. https://doi.org/10.4236/jdaip.2022.101002 education has evolved from offline to online settings with the access to the Internet and affected by the COVID-19 epidemic, online learning has become a global learning method. Schools and universities have witnessed the unprecedented use of online collaboration tools and applications to support the continuing education of students and educators. The scale of digitally supported online learning or remote education increased exponentially in 2020, for those who can use digital devices connected to the Internet and likely changed the way education is provided forever [1]. There is no doubt that online education will become an important part of the new global education landscape.
During the COVID-19 epidemic, China carried out large-scale online learning.
The purpose of this study is to research and analyze some of the learning problems of students. For this reason, we designed a questionnaire and collected a lot of relevant data. We introduced our research questions in detail in Section 1.2, and mainly introduced our questionnaire design and data collection and processing in Section 2.

Online Learning
Online learning has been on the increase in the past few decades. But many researchers have focused on specific areas of online education such as innovations in online learning strategies [2], quality in online education [3], designing sociable online learning environments [4], self-regulated learning in Open Online Courses [5], challenges of online learning [6], self-efficacy and self-regulation in online learning [7], attrition and achievement gaps in online learning [8], and online course dropout [9].
[10] reviewed research on online learning from 1993 to 2004. They reviewed 76 articles and divided the research into four themes: 1) course environment; 2) learners' outcomes; 3) learners' characteristics; and 4) institutional and administrative factors. The author describes the first theme as the course environment (n = 41, 53.9%) is an overarching theme that includes classroom culture, structural assistance, success factors, online interaction, and evaluation. [10] for their second theme found that studies focused on exploring the learning outcomes in the cognitive and affective domains through various research methods that have been used in the teaching process (n = 29, 38.2%). Another research theme focused on learners' characteristics (n = 12, 15.8%) and the social interaction, instructional design, and demographics of online learners. The final theme of their report was the institutional and administrative aspects (n = 13, 17.1%) in online learning. Their findings revealed that there was a lack of scholarly research in this area and most institutions did not have formal policies in place for course development as well as faculty and student support in training and evaluation [11]. [12] reviewed 695 articles on distance education and online learning from 2000 to 2008. In this review, the top three topics were interaction and communities of learning (n = 122, 17.6%), instructional design (n = 121, 17.4%) and learner characteristics (n = 113, 16.3%).The least number of studies (less than 3%) found in studies examining the following research themes were these themes: management and organization (n = 18), research methods in DE and knowledge transfer (n = 13), globalization of education and cross-cultural aspects (n = 13), innovation and change (n = 13), and costs and benefits (n = 12). This study examined research areas in online learning, trends, priority areas, and gaps in distance education research.
[11] based on the previous systematic reviews [10] [12] [13], reviewed 619 articles on online learning from 2009 to 2018. Online learning research in this study is grouped into twelve different research themes which include Learner characteristics, Instructor characteristics, Course or program design and development, Course Facilitation, Engagement, Course Assessment, Course Technologies, Access, Culture, Equity, Inclusion, and Ethics, Leadership, Policy and Management, Instructor and Learner Support, and Learner Outcomes. In this review, the specific themes of Engagement (n = 179, 28.92%) and Learner Characteristics (n = 134, 21.65%) were the two topics that researchers like to study most. Articles focusing on Instructor Characteristics (n = 21, 3.39%) were least common in their statistics. Table 1 shows some of the most and least researched themes on online learning in recent years. Current research in online learning is predominately focused on engagement and learner characteristics. Engagement themes can be subdivided into many areas, such as social presence [14] [15] [16], teaching presence [17] [18] [19], learner-learner interactions [15] [20] [21], participation patterns in online discussion [22] [23] and so on. Although many studies have been conducted on specific online learning topics, there are three problems with these studies: 1) it pays more attention to the system research of education and neglects the detailed teaching experience; 2) it is almost difficult to collect a large amount of data to analyze the research object; 3) there are few studies on the amount of learning media and the impact of learning behavior on students' learning effects. The content of our research solves these three deficiencies.

The Present Study
Based on questionnaires designed by education experts and related data sets collected, this study explores what are the effects of different educational resources on students' learning behaviors and online learning results. Lee [24] examined perceptions of adequate resources that could facilitate or inhibited students' adoption of an online learning system. They indicated that improvement of resources is necessary to help students to understand and use the online learning system. The study of [25] contributes to knowledge about how textbook resources could be leveraged in a bite-sized e-learning environment. Here we explore the difference between the hardware learning media of students in different clusters.
Finally, previous work has shown that perceived resources [24] have impacts on online learning adoption. The richer the perceived resources, the more positive influence on online learning. Accordingly, we will investigate how clusters with different hardware resources are different in online learning and examine the difference between students' learning media and their learning effects in economically developed regions and students in economically undeveloped regions. As such, this study was guided by four research questions: • Research question 1: How to distinguish online learners through analyzing their questionnaire data, i.e., how to separate similar online learners from dissimilar ones?
• Research question 2: What do the similar online learners have in common?
• Research question 3: How the online learners' learning behaviors, e.g., participation and learning time, are affected by learning media?
• Research question 4: What are the impacts of learning media on online learners' experiences, such as learning satisfaction and learning outcomes?

Methods
In order to study our problem, we designed a questionnaire and collected a large amount of data, then processed the data and used unsupervised machine learning method k-prototypes to cluster, too and statistics the data. Finally, the test method is used to test the hypothesis on the cluster data. The third part of our article is mainly about data processing and clustering, the fourth part is a statistical analysis of the data, and the fifth part is a hypothesis test, and some conclusions are very meaningful for the development of online education.

Study Design
This study is the latest data analysis using an unsupervised machine learning approach. The data for this study were from the questionnaire we collected. There are four main subjects of our survey, namely students, teachers, parents and school administrators. We invited education experts to design different questionnaires based on different roles. In our research, we mainly study the questionnaire of secondary school students. Our questionnaire has a total of 20 questions, including the region of the student, the grade the student is attending, the length of study time per day, learning behavior, learning status, learning expectations, etc.

Participants
Study participants were primary and secondary school students from an anonymous province in China, their parents, teachers, and their school administrators.
All participants joined our study by filling out online questionnaires anonymously and voluntarily. The area where the students are located is distributed in both urban and rural areas, and the grade distribution is from primary school to high school.
The choice of participants using the online questionnaire can be found in the literature [26]. The total number of students in anonymous provinces in China is about 15 million. In China, students in grades 1 -6, 7 -9, and 10 -12 are called primary school, middle school, and high school students, respectively. Approximately 37.5% students, their parents, teachers, and school administrators participated in the survey. All the people who participated in the questionnaire were viewed as ideal for this study as we were interested in what is the learning situation of students with different learning media in economically developed regions and economically undeveloped regions. This is the first large-scale online learning in an off-campus regular school, which provides important data for our study.

Collecting Data
We collected a total of 5,791,860 student questionnaires. Other common concerns with data we collected include potential cheating or speeding, where we define cheating as the inconsistency of the information before and after filling in the questionnaire, and we define speeding as clicking through questionnaire tasks as quickly as possible, paying minimal to no attention to the task itself. To eliminate the influence of these two factors, we added questions with consistent information to the questionnaire and recorded the time taken by the participants to fill out the questionnaire.

Data Cleaning
Data cleaning techniques have been extensively covered in multiple surveys [27] [28] and tutorials [29] [30]. In our study, we mainly focus on student data, we defined our dirty data into the following three categories: • Data entry errors: In our questionnaire, there is a question about how long do you study online every day, the time selection range we give is 0 -15 hours, beyond this range is considered to be a wrong input data.
• Cheating data errors: In the questionnaire we designed, there are two questions, one of which is what is the location and category of your school, and the other is your grade. Based on the content of the first question, we can determine whether the answer to the second question filled in by the participant is correct.
For example, the participant's first question answered is an urban secondary school, G. G. Ge et al. so his grade should be between grade 7 and grade 12 (In our study, the first to sixth grades are defined as primary school, and the seventh to twelfth grades are defined as secondary school). If his grade is not in this specified range, we think his data is cheating data. The same method is used for primary school data.
• "Speeding" data errors: We recorded the time it took for each participant to fill in the questionnaire. The questionnaire we designed for students has a total of 20 questions and 90 options, so it takes at least 90 seconds to complete the questionnaire. If the time required for the participant to fill out the questionnaire is less than 90 seconds, then we regard it as speeding data.
According to the above three standards, we cleaned up student data, and the data of 775,516 participants were deleted. After the data was cleaned, there were still 5,015,344 participants' data. We have also cleaned up some redundant data options. After cleaning up, there are still 58 options.

Data Analysis
This study used a combination of unsupervised machine learning (k-prototypes clustering) and non-parametric statistical analyses. The objective of clustering is to partition a set of data objects into clusters such that data objects in the same cluster are more similar to each other than those in other clusters [31]. Partition clustering algorithms are widely applied clustering statistical methods. The k-means algorithm is used to analyze numeric data and the k-modes algorithm extends the k-means to cluster categorical data [32]. The k-prototypes algorithm integrates both the k-means and k-modes algorithms, to cluster mixed data [33]. K-prototypes clustering has been used education fields, such as virtual learning environment [34], educational contents [35], and with student health monitoring system [36] and other educational technologies [37]. For an overview of clustering analysis, see [38], and for clustering specifically in educational technology applications, see [39].
We perform k-prototypes clustering using the k modes [40] [41], pandas, and metrics packages in python. For k-prototypes clustering, one must determine how many clusters the analysis will create. Here, we use the Sum of Squared Errors (SSE) Score and Average Silhouette to determine the optimal number of clusters, which sometimes are referred to cluster validity metrics. We can invoke these two methods from the metrics package under scikit-learn in python language. Scikitlearn is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license. It provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities. These methods are implemented according to the ideas of [42]. We used Euclidean distance metric when calculating the clusters. Because the data in our questionnaire has both numerical data and categorical data, we used the comprehensive evaluation method of Euclidean distance and Hamming distance to calculate clusters. We also used a variety of non-parametric statistics due to non-normal data distributions after clustering. All non-parametric analyses were conducted by using IBM SPSS 23. SPSS Statistics software offers a range of advanced features, including ad hoc analysis, hypothesis testing and reporting. This makes it easier to access and manage data, select and perform analyses. We used Mann-Whitney-U-tests or Kruskal-Wallace-H tests, and for pairwise comparisons we report pvalues adjusted with Bonferroni corrections.

Results
• Research question 1: How to distinguish online learners through analyzing their questionnaire data, i.e., how to separate similar online learners from dissimilar ones?
We use the unsupervised clustering method k-prototypes to distinguish learners with different characteristics. Before using the k-prototypes clustering method to cluster the data, we first used Sum of Squared Errors and average silhouette to evaluate how the data should be clustered into several categories. The two effects of evaluating the optimal number of clusters are shown in the Figure 1  For Figure 2, the k value corresponding to the point with the highest score is the number of people that should be divided. According to the evaluation results, when the number of clusters k is 2, the data classification effect is the best. Therefore, we use k = 2 to cluster online learners in developed and undeveloped regions.  According to the question types designed by our survey data questionnaire, we divide the questions into three major questions, which are our question 2 to question 4. Question 2 is mainly the objective information part of the questionnaire filled out by online learners. Here we include the following questions: We attribute these questions to objective questions. Questions 3 and 4 mainly discuss issues related to students' subjective wishes. In later chapters, we will detail the content of their questions.
According to the estimated k value, we divide the undeveloped regions into two clusters, and the economically developed regions are also divided into two clusters. We have made statistics on the results divided into two clusters. Table 2 is the statistical results of the two clusters in the undeveloped regions, and Table   3 is the statistical results of the two clusters in the developed regions. In the table, we have counted the results of the ten options contained in the above ques- Let's first compare the data of two clusters in undeveloped regions. In Table   1, we can clearly see that cluster 1 and cluster 2 are significantly different in the distribution of these ten options. In terms of the study online hours, the average G. G. Ge et al. learning time of cluster 1 is 4.75, while the average learning time of cluster 2 is 10. 16. In terms of learning media used, cluster 1 and cluster 2 mainly use smartphones for online learning, but cluster 1 and cluster 2 have some gaps in computers, tablets, and paper materials. The percentages of cluster 2 for these three options are higher than cluster 1, for example, cluster 1 is 26.5% for online learning using computers, and cluster 2 is 30.4%. Computers, tablets, and other learning equipment can only be purchased under certain economic conditions, which indicate that the second group may be the group with better learning media in economically undeveloped regions. At the same time, the percentage of special education in cluster 2 is higher than that in cluster 1, which indicates that the educational resources of the school in cluster 2 should be better than that in cluster 1.
These results indicate that clusters with richer teaching and learning media may have better learning behaviors and learning effects.  We divide students' online learning behaviors into three parts: students' classroom learning behaviors, students' learning behaviors when they encounter problems, and students' learning behaviors after class. The details of these three parts are as follows: • RQ3-1: Homework submission; • RQ3-2: In-class test; • RQ3-3: Video conference; • RQ3-4: In-class commenting; • RQ3-5: Viewing homework that achieved an excellent grade; • RQ3-7: Live commenting; • RQ3-8: Discussion; • RQ3-9: Solving independently by searching online; • RQ3-10: Re-watch recorded lectures when you encounter knowledge points that you have not mastered; • RQ3-11: Attend Q&A sessions organized by teachers; • RQ3-12: Ask teachers by using social platforms; • RQ3-13: Communicate with other students; • RQ3-14: Re-watch lecture videos after class; • RQ3-15: Carefully studied other course materials provided by your teacher; • RQ3-16: Carried out home-based self-study activities; • RQ3-17: Ask the teacher when you encounter a problem; • RQ3-18: The quality of the work done online be as good as offline. Table 4 and Table 5 are statistics on the learning behaviors of students in undeveloped and developed regions respectively. Items 1 -8 in the table are students' classroom learning behaviors, and items 9 -13 are students' learning behaviors when they encounter problems, 14 -18 items are students' learning behaviors after class.
These three types of learning behaviors correspond to questions 4, 11, and 14 in our questionnaire. Table 4 is the statistics of online learning behaviors of students in different clusters in undeveloped regions. By observing the table data, we can see that cluster 2 performs better than cluster 1 in these three types of learning behaviors. In terms of classroom learning behaviors, the most common behaviors for students are homework submission, class testing and excellent homework viewing. Among them, more than 80% of each cluster has submitted homework, and the least common behavior is screen sharing, only about 10%. There is a big gap between cluster 1 and cluster 2 in classroom test, cluster 2 undeveloped (In-class-test Mean = 0.44) > cluster 1 undeveloped (In-class-test Mean = 0.33), When students encounter knowledge they don't understand in online learning, they usually re-watch the recorded lectures or independently searching online to solve the problem. They seldom participate in the Q&A sessions organized by teachers or use social platforms to ask teachers. However, we can still find that the data of students in cluster 2 is better than the data of students in cluster 1 in terms of participating in the question and answer session of the teacher organization and using the social platform to ask the teacher, cluster 2 undeveloped (Q&A Mean = 0.34) > cluster 1 undeveloped (Q&A Mean = 0.25), cluster 2 undeveloped (Ask teachers by using social platforms Mean = 0.34) > cluster 1 undeveloped (Ask teachers by using social platforms Mean = 0.25) This shows that cluster 2 may have better teacher resources than cluster 1. After the class, most of the students can earnestly study other course materials provided by the teacher and carry out selfstudy activities at home. Table 5 is the statistical information of different clusters in developed regions. In developed regions, we can find that cluster 2 performs better than cluster 1, Table 4. Cluster data statistics of undeveloped regions. and the gap in some aspects is relatively large. For example, the proportion of students in cluster 2 taking classroom exams is 16% higher than that of students in cluster 1, and the proportions of students in cluster 2 participating in teacher-organized question-and-answer sessions and using social platforms to ask teacher questions are 10% and 13% higher than those in cluster 1, respectively. This shows that even in developed regions, due to differences in learning media, different student groups may have different learning behaviors. Obviously, in terms of various learning behaviors, the students in cluster 1 are worse than the students in cluster 2, and the gap is larger in developed regions.
Although due to differences in teacher resources/learning media, developed regions and undeveloped regions have divided into different student groups, according to the data in Table 4 and Table 5, we find that the learning behaviors of students in developed regions are generally better than those in undeveloped regions. We compare cluster 1 in developed regions with cluster 1 in undeveloped regions and compare cluster 2 in developed regions with cluster 2 in undeveloped regions. We want to use this comparison to find the differences between the groups with fewer learning media in developed and undeveloped regions through this comparison, and the differences between the groups with more teacher resources/learning media. First, comparing cluster 1, we can find that students in clus- and attend Q&A sessions organized by teachers. This is a comparison between the groups with better learning media in developed and undeveloped regions, but there is a big difference between them, which shows that the economic development status determines the difference in learning media, which will be to a large extent affect students' learning behavior.
• Research question 4: What are the impacts of learning media on online learners' experiences, such as learning satisfactory and learning outcomes?
In addition to differences in teacher resources, learning media, and learning behaviors for students online learning, there are also some differences in the learning experience. Table 6 and Table 7 are the data of developed and undeveloped regions that we obtained after clustering the data. Table 6 is the result of statistical data after the undeveloped regions are divided into two clusters. Comparing the data of cluster 1 and cluster 2 in undeveloped regions, there are some obvious differences. Comparing the data of cluster 1 and cluster 2 in undeveloped regions, there are some obvious differences. The students in cluster 1 are not as serious as the students in cluster 1 (cluster 1 undeveloped (Mean = 2.46) > cluster 2 developed (Mean = 2.30)), and the probability of eyestrain caused by long staring at screens in cluster 2 is higher than that in cluster 1 (cluster 1 undeveloped (Mean = 0.75) < cluster 2 undeveloped (Mean = 0.81)). At the same time, Cluster 2 also has a high probability in terms of confusion in setting up the platforms (cluster 1 undeveloped (Mean = 0.17) < cluster 2 undeveloped (Mean = 0.22)) and digital resource utilization capabilities (cluster 1 undeveloped (Mean = 0.34) < cluster 2 undeveloped (Mean = 0.38)). This indicates that the students in cluster 2 have more online learning media than those in cluster 1. The students in cluster 2 use more teaching software and believe that they have cultivated their digital resource utilization ability. This result can also be found by analyzing their satisfaction with teachers (cluster 1 undeveloped (Mean = 1.92) > cluster 2 undeveloped (Mean = 1.90)) and online learning media (cluster 1 undeveloped (Mean = 2.13) > cluster 2 undeveloped (Mean = 2.09)). Students in cluster 2 are more satisfied with teachers and learning media. Table 7 shows the statistical results of two clusters in developed regions. The result of the difference between the two clusters in the developed regions is basically the same as the result of the difference between the two clusters in the undeveloped regions, but the differences between some internal clusters in the developed regions may be greater, for example, digital utilization ability (cluster 1 developed (Mean = 0.42) < cluster 2 developed (Mean = 0.53)), communication ability (cluster 1 developed (Mean = 0.26) < cluster 2 developed (Mean = 0.33)).
To compare the difference between developed regions and undeveloped regions, we also compare the developed regions cluster 1 and the undeveloped regions cluster 1, and the developed regions cluster 2 and the undeveloped regions cluster 2 are compared. For cluster 1, groups in developed regions have better experience in learning status (cluster 1 undeveloped (Mean = 2.46) > cluster 1 developed (Mean = 2.31)), satisfaction with teachers (cluster 1 undeveloped (Mean = 1.92) > cluster 1 developed (Mean = 1.79)), satisfaction with online learning      For cluster 2, the difference between developed and undeveloped regions is similar to that of cluster 1. This indicates that the difference in learning media will lead to differences in students' online learning experience. In general, there is more learning media in developed regions, and the higher their satisfaction with online learning, the better their experience. At the same time, they also feel more tired and use more online teaching software. Students in undeveloped regions are not as good as those in developed regions in terms of online learning satisfaction and G. G. Ge et al.
learning fatigue, but they believe that online education has changed the inequality of educational resources to a certain extent, allowing them to hear more famous teacher courses.

Conclusions
In the previous sections, we counted the data when answering the questions.
Although the statistics of different groups in different regions are different, are their distributions the same? To this end, we performed a Mann-Whitney-U-Test. We have done four types of Mann-Whitney-U-Test, namely, developed region cluster 1 and cluster 2, undeveloped region cluster 1 and cluster 2, developed region cluster 1 and undeveloped region cluster 1, as well as developed region cluster 2 and undeveloped region cluster 2. According to these four tests, we tested their performance in question 2 to question 4 respectively, and their P values were all at the level of 0.000** < 0.001, which indicates that the distributions between different groups are not the same. Therefore, our research method of collecting data from different groups and comparing them is very meaningful.
Online education has always been a research direction of educators. Due to the COVID-19 epidemic, large-scale online learning by students provides a good case for our research. This allows us to collect large-scale data that is difficult to collect normally, because, under normal circumstances, there are not so many people who choose to study online. When most scholars research online learning, the participants are often relatively limited. For example, many online learning participants may be data from a certain website or data from a certain university.
This has led to very little research on the online learning situation of students in different regions. Our research can provide good reference for future research on online learning and economic development. Education plays an important role in economic development, and at the same time, the economy provides an important guarantee for the development of education. Higher education has provided a positive and significant impact on economic development, and engineering and natural science majors have played the most prominent role in this process [43]. The research of Mishra and Agarwal [44] showed that for undeveloped countries, economic growth often corresponds to an increase in education expenditure. Our research is also based on the different conditions of economic development, to study the differences in the online learning status of different student groups. We asked four questions in total and answered them based on the results of statistical data. We can summarize the results of these four questions as follows: 1) Whether in developed or undeveloped regions, students have different group gathering effects based on their own learning media, teacher resources, and learning behaviors; 2) Within the same area, the differences between different groups are relatively large in terms of learning media. The difference between different groups in undeveloped regions is smaller than the difference between different groups in 3) Learning media in developed regions are better than those in undeveloped regions. This leads to students in developed regions that perform better than students in undeveloped regions in terms of online learning behavior, and the learning experience is better than that in undeveloped regions; 4) For students in developed regions, spend a long time on online learning on average, have more learning software installed, and the number of times teachers tutor students is relatively large, so they feel more exhausted than students in un- However, our research also has some shortcomings. Our data did not have test scores, which makes it very difficult for us to analyze whether the performance of online learning students is better than offline students in the future. And we can also study whether students' good online learning behaviors will affect students' academic performance. We may try some more on this.