Human Perception of Group Synchronization Error in Remote Learning: Dependencies of Voice and Video Contents in One-Way Communication

This paper examines dependencies of voice and video contents on human perception of group (or inter-destination) synchronization error in remote learning by Quality of Experience (QoE) assessment. In our assessment, we use two videos and three voices (two voices for one video and one voice for the other video). We also investigate influences of silence periods in the voices and temporal relations between the voices and videos (called the tightlycoupled and loosely-coupled contents here). The voices are spoken by a teacher according to the videos. Each subject as a student assesses the group synchronization quality by watching each lecture video and the corresponding explanation voice, and then the subject answers whether he/she perceives the group synchronization error or not. As a result, assessment results illustrate that silence periods mitigate the perception rate of the error, and we can also find that we can more easily perceive the error for tightly-coupled contents than loosely-coupled ones.

The applications sometimes need to output multiple media streams synchronously at all the terminals (i.e., group (or inter-destination) synchronization) [10] [11] [12] [13]. If the output timings of each media unit (MU), which is an information unit such as a video picture and a voice packet for media synchronization, among multiple terminals are different from each other, the quality of experience (QoE) [14] [15] may be damaged seriously. This may influence the learning effect in remote learning, for example.
To solve the problem, it is necessary to carry out group synchronization control [16]- [21], which adjusts the output timing of MU among multiple terminals (or destinations). In [22] and [23], two types of error ranges are employed under media synchronization control. One is the imperceptible range, in which users cannot perceive the error, and the other is the allowable range, in which users feel that the synchronization error is allowable. Thus, it is important to clarify human perception of group synchronization errors. However, the perception has not sufficiently been clarified so far.
Some papers clarify the human perception of media synchronization errors such as lip synchronization and pointer synchronization [24] [25] [26]. In [24], Steinmetz clarifies the human perception of lip synchronization error and pointer synchronization error. He concludes that the lip synchronization errors within about ±80 ms are hardly perceivable, and almost everyone perceives the errors beyond around ±160 ms; also, the pointer synchronization errors within about −750 ms and +500 ms are hardly perceivable, and almost everyone perceives the errors less than around −1000 ms or larger than about +1250 ms. The results mean that the human perception depends on voice and video contents. In [25], Staelens et al. investigate the influence of lip synchronization error on the ability to perform real-time language interpretation during video conferencing. Younkin and Corriveau obtain the minimum amount of audio-visual synchronization error that can be detected by users [26]. However, in [24] [25] [26], they do not handle the human perception of group synchronization error.
In this paper, therefore, we clarify the human perception of group synchronization error in remote learning. To investigate the dependencies of voice and video contents on the human perception, we use two types of video contents and the corresponding voice contents with/without silence periods; that is, how tightly the voice and video contents are related to each other.
The remainder of this paper is organized as follows. Section 2 describes the group synchronization in remote learning. Section 3 explains the assessment method. Section 4 presents and discusses assessment results. Section 5 concludes the paper.

Group Synchronization in Remote Learning
In remote learning, it is necessary to perform the group synchronization control [15] [16] [17] [18] [19], which tries to output each MU simultaneously at all the different terminals in multicast communication. If the control is not carried out, the MU cannot be outputted at the same time at the terminals; that is, the group synchronization error occurs.
The configuration of our remote learning system is shown in Figure 1. The system consists of a single teacher terminal, N (≥1) student terminals, and a file server. The teacher terminal uses a microphone, and each student terminal employs a headset. The file server multicasts a video stream to the teacher terminal and all the student terminals. The teacher orally explains the video contents while watching the video. The voice stream of the teacher captured via the microphone is multicast from the teacher terminal to all the student terminals. Each student listens to the teacher's voice while watching the same video.
In Figure 1, the video delay from the file server to the teacher terminal is denoted by D ft , and the video delay from the file server to student terminal i (1 ≤ i ≤ N) is denoted by D fsi . Also, the voice delay from the teacher terminal to student terminal i is denoted by D tsi . We can define the two types of group synchronization errors as the differences among D ft , D fs1 , …, D fsN , and those among D ts1 , …, D tsN in Figure 1. We assumed here that the global clocks (that is, clock ticks at the sources and destinations have the same advancement, and the current location times are also the same [10] [11]) are used at all the terminals and the file server.

Assessment Method
In our assessment system, we set N = 1, and D ts1 = 0 for simplicity. In this case, the group synchronization error about the video is expressed by D ft − D fs1 . If the voice delay exits (i.e., D ts1 ≠ 0), the student can start to hear the voice at D ft + D ts1 . Therefore, if D ft + D ts1 − D fs1 = 0, the student does not perceive any synchronization error. We examined the influence of D ts1 on the human perception of group synchronization error [26]. As a result, we found that the perception rate of the error depends on the group synchronization error plus the voice delay. Thus, we can set D ts1 = 0 without losing generality in this paper. Because the group synchronization error is D ft − D fs1 , we can produce the error by making the difference D ft − D fs1 in starting time between the voice and video at the student terminal as shown in Figure 2.
In Figure 2, student terminal 1 saves video files as the video server. Also, the terminal stores voice files which have been recorded in advance to speak always in the same way in the assessment; one of the authors played the teacher's role and saved her voice. Each subject used the headset at student terminal 1. We produced the group synchronization error by changing the start times of voice and video outputs at student terminal 1. At the beginning of assessment, we presented the perfect situation (i.e., the group synchronization error is zero) to each subject; that is, we started to output the voice and video files simultaneously at the student terminal. We used the single stimulus method [27] for QoE subjective assessment. However, when the subject requested the perfect one during the assessment, we showed it. The subject did not know the value of the error presented in the assessment.
After presenting each error, we asked each subject (student) the following question: "Did you perceive the group synchronization error?" The subject answered either "Yes" or "No." He/she judged whether error was perceived or not by monitoring the temporal relation between the teacher's voice and displayed video contents.
To examine the dependency of voice and video contents, as shown in Table 1 and Figure 3, we used three voices (called Voices 1, 2, and 3 here) and two videos (called Videos 1 and 2) in terms of the following two factors: Temporal   relation (called tightly-coupled or loosely-coupled in this paper) and silence periods (with or without silence periods). Tightly-coupled contents have tighter relations between the voice and video temporally than loosely-coupled contents, in which the voice does not have such relations. As loosely-coupled contents, we did not use contents without silence periods (see Table 1) because voice contents change according to the scene change in video contents; that is, the voice and video contents are tightly-coupled in this case. Note that voice and video contents in lip synchronization are much tighter related to each other compared with the tightly-coupled contents handled in this paper.

No Silence Period
We used Video 1 and Voice 1 which teach I/O devices (a mouse and a printer).
The video and voice contents explain the structure and pointer of the mouse, and the charging, exposing, developing, transferring, and fusing of the printer.
Their output duration is 1 minute and 32 seconds as shown in Figure 3(a). The voice does not include any silence period. In Figure 3

Tightly-Coupled Contents
We employed Video 1 and Voice 2 explaining the I/O devices (see Table 2). The explanation of Voice 2 is almost the same as Voice 1, but Voice 2 is simplified from Voice 1 by reducing the number of words to produce silence periods as shown in Figure 3(b). The number of words in Voice 2 is about 110, and that in Voice 1 is around 170. Note that Voice 2 starts to explain each video scene when the scene change occurs.  [32] Tiger: Playing with teddy bear [33] During the assessment, each subject watched Video 1 and listened to Voice 2.
We changed the group synchronization error from −700 ms to +550 ms at intervals of 50 ms.
The total assessment time per subject was about 95 minutes including break times. The number of subjects was 15 females, and their ages were between 28 and 37.

Loosely-Coupled Contents
We used Video 2 having six scenes of animals [28]- [33] (cat, bear, dog, elephant, bird, and tiger as shown in Table 2) and Voice 3 which teaches the English general vocabulary as the names of these six animals as follows: "This animal is xxx, it's called yyy, and its spell is zzz." In this explanation, "yyy" and "zzz" are Eng- It should be noted that the locations of silence periods in Figure 3(c) are similar to those in Figure 3(b). We selected the random order for the three different starting times for each subject. We changed the group synchronization error from −700 ms to −300 ms at intervals of 50 ms in random order for the starting time of 0 sec. We changed group synchronization error from −600 ms to +600 ms at intervals of 100 ms in random order for the starting time of 2.5 sec. Also, we changed the group synchronization error from 300 ms to 700 ms at intervals of 50 ms in random order for the starting time of 5.0 sec.
The total assessment time per subject was about 80 minutes including break times. The number of subjects was 13 females and 2 males. Their ages were between 33 and 39.

Assessment Results
In this section, we show assessment results for voice contents having no silence period (Voice 1) and silence periods (Voice 2). We also show the results of tightly-coupled and loosely-coupled contents for voices with silence periods (Voices 2 and 3, respectively).

No Silence Period
We plot the perception rate as a function of the group synchronization error for Voice 1 and Video 1 in Figure 4 (the results of Voice 2 will be explained in Sub-  produced by changing the start times of voice and video at the student terminal as described in Section 3. In Figure 4, we see that the perception rate is 0% when the group synchronization error is between about −150 ms and +150 ms (the results are almost the same as those in [34]). When the absolute error exceeds about 150 ms, the perception rate starts to increase up to 100% (at the absolute error of 500 ms). If we assume that the imperceptible range denotes a range in which the perception rate is less than or equal to 20% [34] [35], the range is between around −200 ms and +200 ms. If the allowable range is assumed to be a range in which the perception rate is greater than or equal to 60% [34] [35], the range is beyond the absolute error of about 300 ms. The group synchronization error in this paper is more easily perceived than the lip synchronization error [24], but it is more difficult to perceive the group synchronization error than the pointer synchronization error [24] (see the lip and pointer synchronization errors in Section 1).

Tightly-Coupled Contents
As described earlier, the perception rate for Voice 2 and Video 1 is also shown in Figure 4. From the figure, we find that the perception rate for Voice 2 is 0% when the error is between about −250 ms and +250 ms. The imperceptible range is between around −250 ms and +250 ms, and the allowable range of the absolute error is larger than about 300 ms. The imperceptible range of Voice 2 is different from that of Voice 1 and the allowable ranges of Voices 1, and 2 are almost the same as each other. Therefore, we can conclude that the perception rate of group synchronization error depends on the voice contents.

Loosely-Coupled Contents
In Figure 5, we plot the perception rate versus the group synchronization error for Video 2 and Voice 3. The figure includes the results of the three different Int. J. Communications, Network and System Sciences starting times of 0 sec., 2.5 sec., and 5.0 sec. We find in the figure that when the starting time is 0 sec. and the group synchronization error is larger than about −500 ms, the perception rate is 0%. When the starting time is 2.5 sec., the perception rate is 0%; that is, no one finds the group synchronization error for the errors from around −600 ms to +600 ms. When the starting time of 5.0 sec. and the group synchronization error is less than about +500 ms, the perception rate is 0%.
From the above considerations, we can obtain the following results. Because the perception rate is 0% when the error is larger than about −500 ms and the starting time is 0 sec., and when the error is less than around +500 ms and the starting time is 5.0 sec., the perception rate is 0% when the error is between about −500 ms and +5500 ms (i.e., 5 sec. +500 ms) and the starting time is 0 sec.
In the same way, the perception rate is 0% when the error is between about −3000 ms (i.e., −2.5 sec. −500 ms) and 3000 ms (2.5 sec. +500 ms) and the starting time is 2.5 sec. Also, the perception rate is 0% when the error is between around −5500 ms (−5 sec. −500 ms) and +500 ms and the starting time is 5.0 sec. The ranges are much wider than those of the tightly-coupled contents.
Therefore, we can obtain that the perception rate of group synchronization error depends on the starting time of voice as well as the temporal relations between voice and video contents.
From the above discussions, the human perception of group synchronization error depends on the voice and video contents.

Conclusions
In this paper, we investigated the dependencies of voice and video contents on human perception of group synchronization error for remote learning by carrying out QoE subjective assessment. Assessment results showed that we can more easily perceive the error for voice contents without silence period than those with silence periods. Therefore, silence periods mitigate the perception rate of the error. We also found that we can more easily perceive the error for tightly-coupled contents than loosely-coupled ones. We further confirmed that the perception rate is dependent on the voice and video contents.
As the next step of our study, we will investigate the two-way communication case in which a teacher and multiple students can interactively discuss with each other in a lecture. In addition, we need to handle a variety of contents, because there exist dependencies of contents on QoE.