Accuracy and Response Speed of Eye Center Annotation Using Eye Movement Models: Validating the Effectiveness of Eyesight Detection ()
1. Introduction
Eye iris center annotation holds significant value in ophthalmic diagnostics and surgery [1]. Accurate real-time eye annotation not only supports the early diagnosis, long-term monitoring, and auxiliary treatment of ophthalmic diseases but also provides essential support for certain oculomotor research. For instance, in the early screening of amblyopia in children, precise annotation of the eye center can detect subtle eye tremors, enabling early detection and intervention.
Optokinetic Nystagmus (OKN) is a natural reflexive eye movement in oculomotor studies, reflecting the health status of the visual system. Through accurate eye center annotation, physicians can observe the minute variations in eye tremors, allowing for early detection of abnormalities and providing a basis for subsequent treatment.
Eyesight refers to the ability of the human eye to distinguish the minimum distance between two points, which reflects the ability of the fovea centralis to resolve the minimum spacing between two points. It is usually measured by the minimum angle of resolution (MAR), which can be converted into logarithmic visual acuity (LogMAR) for quantitative comparison [2]. Eyesight examination is an important aspect of ophthalmic examinations, helping doctors evaluate the eye health of patients. Traditional eyesight detection methods are mainly subjective, using visual acuity charts such as the Snellen chart and E-chart [3] [4]. Although widely used [5], they rely on the patient’s language ability and active cooperation [6], leading to an error rate of up to 30% in infants and young children, individuals with intellectual disabilities, and uncooperative adults (such as malingerers) . In addition, affected by factors such as letter spacing and chart lighting, the accuracy and repeatability of traditional subjective visual acuity test results are often questioned.
Existing studies have shown a correlation between OKN and eyesight [7].
Mediapipe, developed by Google, is a cross-platform framework that provides efficient facial landmarking and other computer vision tasks [8]. It uses deep learning models to achieve high-precision facial feature point detection [9] [10]. Dlib is a widely used open-source library that provides facial landmarking, face recognition, and face detection capabilities. It uses machine learning-based methods for facial feature point detection [11]. Haar Cascade is a traditional computer vision method provided by OpenCV, widely used for face detection. It uses Haar features and a cascade classifier for object detection[11].RetinaFace is a deep learning-based facial detection method that uses an efficient Convolutional Neural Network (CNN) for face detection [12].
2. Objective
This study aims to comprehensively compare and analyze the accuracy and response time of four widely used facial landmarking algorithms—Mediapipe, Dlib, Haar Cascade, and RetinaFace—in eye iris center annotation tasks. Additionally, it intends to establish an objective eyesight detection method based on collecting Optokinetic Nystagmus (OKN) responses and explore its application value in the adult population.
3. Methodology
3.1. Self-Collected Dataset
This study uses a dataset of eye images that includes a variety of ages, genders, and lighting conditions. Each image in the dataset is annotated with the true eye center position, which serves as the ground truth for comparing the algorithm’s annotation results.
3.2. Algorithm Selection and Experimental Setup
We selected four facial landmarking algorithms for the experimental tests: Mediapipe, Dlib, Haar Cascade, and RetinaFace. Custom programs were written to conduct the tests and collect results.
3.3. Evaluation Metrics
The accuracy is quantified by calculating the Euclidean distance, Mean Squared Error (MSE), and Mean Absolute Error (MAE) to measure the difference between the algorithm’s annotation and the true eye center position. The real-time processing capability of each algorithm is measured by the frames per second (FPS), assessing its performance in real-time video streams. The detection rate is calculated by dividing the number of successfully detected images by the total number of images. Collection and correlation verification of OKN signals and eyesight: First, use the optimal algorithm obtained from the comparison to annotate the eye center and extract OKN signals; then, pair the OKN signals with the subjective eyesight test results (Snellen visual acuity chart) of the corresponding subjects; finally, input the paired data into different machine learning models for training and verification to explore the correlation between OKN signals and eyesight.
4. Experimental Results and Analysis
This section presents the experimental results of the four facial landmarking algorithms—Mediapipe, Dlib, Haar Cascade, and RetinaFace—on the eye center annotation task, and provides a detailed analysis of their accuracy and response time.
4.1. Annotation Accuracy
Figure 1. Scatter plot of Euclidean distances of each algorithm per image.
By analyzing the experimental data, we evaluated the accuracy of the four algorithms in eye center annotation. The following are the specific results based on various charts and evaluation metrics:
The scatter plot (Figure 1) intuitively displays the annotation error for each algorithm on different images. The distance values for Haar Cascade show significant fluctuations, especially on certain images where the annotation error is notably higher than that of the other models. This indicates that Haar Cascade performs inconsistently when handling changes in facial angles. In contrast, Dlib, RetinaFace, and Mediapipe show more stable annotation errors, with their distributions being relatively close to each other.
Figure 2. Box plot of Euclidean distances of each algorithm.
Figure 3. Histogram of Euclidean distance distribution of each algorithm.
The box plot (Figure 2) further confirms the instability of Haar Cascade in terms of annotation accuracy. Haar Cascade exhibits the largest fluctuation range in annotation errors, with the median of its distance values being higher than the other models, indicating poor robustness in eye annotation. In contrast, Dlib and Mediapipe show lower median errors with narrower error distribution ranges, validating their superior accuracy. RetinaFace ranks just behind these two algorithms.
The histogram (Figure 3) shows the distribution of Euclidean distances for different algorithms. Haar Cascade’s distance values exhibit a bimodal distribution, indicating significant bias and instability in its annotation results. In contrast, Dlib, RetinaFace, and Mediapipe have most of their distance values concentrated within a smaller range, validating their consistency in annotation accuracy.
Figure 4. Bar Chart of MSE and MAE of Dlib, RetinaFace, and Mediapipe.
Figure 5. Bar chart of detection rate of each Algorithm.
Due to Haar Cascade’s larger errors in eye annotation, we compare the MSE (Mean Squared Error) and MAE (Mean Absolute Error) bar charts only for the three models: Dlib, RetinaFace, and Mediapipe. The bar charts (Figure 4) reveal that RetinaFace’s error values are significantly higher than those of the other models, further highlighting its disadvantage in accuracy. In contrast, Mediapipe shows the lowest error values, demonstrating its superior performance in eye center annotation.
The detection rate bar chart (Figure 5) shows that both Mediapipe and RetinaFace achieved a detection rate of 100%, demonstrating excellent performance. In contrast, Haar Cascade had the lowest detection rate, only 50%. This result further confirms Haar Cascade’s poor performance in complex scenarios. Dlib’s success rate was also below 80%, while Mediapipe and RetinaFace were able to consistently complete the eye annotation task.
Table 1. Statistical table of Euclidean distance under different lighting conditions.
|
Mean Load Time (s) |
Mean Detect Time (s) |
Mean Total Time (s) |
Std Load Time (s) |
Std Detect Time (s) |
Std Total Time (s) |
Frame Rate Detect (fps) |
Dlib |
0.0031 |
0.037 |
0.040 |
0.0038 |
0.021 |
0.022 |
27.06 |
haar |
0.0036 |
0.038 |
0.041 |
0.0017 |
0.020 |
0.022 |
26.54 |
mediapipe |
0.0027 |
0.0056 |
0.0083 |
0.0033 |
0.0024 |
0.0054 |
177.08 |
retinaface |
0.0038 |
3.01 |
3.0 |
0.0019 |
0.89 |
0.89 |
0.33 |
For each algorithm, we calculated its loading time, detection time, and total processing time, and analyzed the frames per second (FPS) as a measure of response speed. The following are the detailed statistics: From the table and chart (Table 1), it can be seen that all four models perform well in terms of loading time. Mediapipe shows the best FPS performance, reaching 177.08 FPS, significantly higher than the other algorithms, making it suitable for real-time annotation. Haar Cascade and Dlib achieve frame rates of 26.54 and 27.07 FPS, respectively. While they can handle typical real-time tasks, their performance seems insufficient for rapid eye movement tracking. RetinaFace, with an FPS of only 0.33, has an extremely slow response time and is unsuitable for annotation tasks in real-time video streams.
4.2. Further Experiments
To further assess Mediapipe’s performance, we manually adjusted the brightness and darkness of images to verify the algorithm’s adaptability under different lighting conditions. The test results showed that Mediapipe performed with higher accuracy under brightened images, while the accuracy was slightly lower under darkened images. By using multi-threaded processing and Haar Cascade ROI calibration, Mediapipe achieved a detection rate of 96.43% on the adjusted dataset.
Table 2. Statistical table of Euclidean distance under different lighting conditions.
|
mean |
std |
min |
25% |
50% |
75% |
max |
Normal vs Dark |
1.15 |
1.13 |
0 |
0 |
1 |
1.41 |
4.12 |
Normal vs Bright |
0.85 |
0.678 |
0 |
0 |
1 |
1.19 |
2.03 |
Additionally, we conducted repeatability tests on normal, brightened, and darkened images. The results (Table 2) showed a 100% repeatability rate across all three tests. This indicates that Mediapipe performs consistently under different processing conditions, with all three test results achieving a 100% repeatability rate, further proving the robustness and consistency of Mediapipe.
Figure 6. OKN waveform.
In real-time detection of dynamic video streams, we also conducted real-time annotation tests on Mediapipe. The results showed that it provided stable annotation results and produced a standard OKN waveform (Figure 6), demonstrating its feasibility for ophthalmic applications (Table 3).
Table 3. Evaluation results of machine learning models for eyesight detection.
Model |
Mean Squared Error (MSE) |
Mean Absolute Error (MAE) |
Regression Tree |
0.043 |
0.139 |
Random Forest Regression |
0.042 |
0.141 |
Support Vector Machine Regression |
0.055 |
0.162 |
KNN Regression |
0.056 |
0.171 |
5. Discussion
This study compared four widely used facial landmarking algorithms—Mediapipe, Dlib, Haar Cascade, and RetinaFace—assessing their accuracy and response time in eye iris center annotation tasks. Mediapipe’s core strengths lie in outstanding real-time processing, efficient facial feature annotation, and strong robustness under varying lighting conditions; integrating deep learning with hardware acceleration, it delivers high-precision, low-latency eye annotation while maintaining high FPS in dynamic video streams, which is crucial for long-term ophthalmic home monitoring, and it balances accuracy, speed and low hardware resource demands, though its detection rate is not 100%, calling for future optimization to cut computational overhead and boost performance on resource-constrained devices. This study also has certain limitations. First, the dataset does not include samples of patients with ophthalmic diseases, and the applicability of the algorithm in patients with eye diseases needs to be further verified. Second, the algorithm's performance in occlusion scenarios (such as wearing glasses, squinting, and eye closure) is not tested, and future research should supplement relevant experiments. Third, OKN signal collection may be interfered by eye movement artifacts, and more effective signal preprocessing methods need to be explored to improve signal quality.
6. Outlook and Future Work
This study experimentally compared the performance of four facial landmarking algorithms—Mediapipe, Dlib, Haar Cascade, and RetinaFace—in eye center annotation tasks, evaluating their accuracy, response time, and robustness. Nevertheless, Mediapipe still has room for improvement, particularly in terms of robustness in complex environments and computational resource consumption. Therefore, future research could focus on improving and expanding the algorithm in the following areas:
Future work could incorporate the Multi-Task Learning (MTL) framework to jointly optimize facial feature annotation and eye center annotation tasks. By sharing parts of the network layers and feature representations, the algorithm can simultaneously improve the performance of multiple related tasks. Enhancing Detection of Other Key Information While Processing Eye Annotation While processing eye annotation, enhancing the ability to detect other key information will further improve the comprehensiveness and accuracy of ophthalmic diagnostic systems.
Also, expand the dataset to include samples of patients with various ophthalmic diseases, and conduct more in-depth research on the correlation between OKN signals and eyesight, so as to further improve the effectiveness of Mediapipe in eyesight detection and promote its clinical application [13].