A Restricted, Adaptive Threshold Segmentation Approach for Processing High-speed Image Sequences of the Glottis

In this paper, we propose a restricted, adaptive threshold approach for the segmentation of images of the glottis acquired from high speed video-endoscopy (HSV). The approach involves first, identifying a region of interest (ROI) that encloses the vocal-fold motion extent for each image frame as estimated by the different image sequences. This procedure is then followed by threshold segmentation restricted within the identified ROI for each image frame of the original image sequences, or referred to as sub-image sequences. The threshold value is adapted for each sub-image frame and determined by respective minimum gray-scale value that typically corresponds to a spatial location within the glottis. The proposed approach is practical and highly efficient for segmenting a vast amount of image frames since simple threshold method is adapted. Results obtained from the segmentation of representative clinical image sequences are presented to verify the proposed method.


Introduction
Laryngeal imaging based analysis of vocal fold motion has been proved valuable for both diagnosing voice disorders and understanding the mechanism of voice production.High speed digital imaging (HSDI), or high speed video-endoscopy (HSV), has now become a clinical reality for imaging the vibrating vocal folds.The HSDI systems record images of the vibrating vocal folds at a typical rate of 2000 frames/sec, which is fast enough to resolve a specific, sustained phonatory vocal fold vibration.In the literature [1][2][3][4][5][6][7][8][9], glottal area waveform (GAW), along with other spatiotemporal waveforms of the glottis, has been successfully used to analyze the vocal fold vibration which may correlate with voice condition.The credibility of the analysis strongly depends on an accurate extraction of the GAW from images of the glottis.In order to obtain the GAW, the glottis, or the vocal fold opening region, needs to be segmented and the area calculated on a frame by frame basis.Clearly, it is crucial for us to develop effective and highly efficient segmentation algorithms for this purpose.
Image segmentation is fundamental to the field of image understanding and computer vision [10][11][12][13] and to establish an efficient segmentation algorithm is still challenging because of lacking in a universal segmentation algorithm for all image segmentation tasks.
The purpose of image segmentation is to divide an image into regions that are meaningful to some higher level processes.In this research, the meaningful region is the glottis, the air space between the pair of vocal folds.In the literature some algorithms for glottis segmentation have been reported, which include region growing algorithm [5,14,15] and active contour algorithm [16][17][18][19][20].However, there are some limitations in these approaches, making them impractical for applications in the analysis of HSV image data sets.The region growing algorithm depends much on selection of the seed point that requires prior knowledge about the location of glottis [10].On the other hand the active contour algorithm is extremely time consuming and susceptible to noises [11].
In a clinical setting, the HSV system is capable of capturing images of the vibrating vocal folds at a rate of at least 2000 frames per second.During an examination, a patient is instructed to phonate a sustained vowel phonation with a typical recording time of 2 seconds.In other words, each HSV recording contains 4000 image frames that need to be processed for further analysis and interpretation of the vocal fold dynamic behaviors [4].As a result, it is essential to develop effective and efficient methods to segment the glottis rapidly and accurately.Since the time duration for each HSV recording is short, it is reasonable to assume that tremors of the hand of the clinician and of subject's neck and head are negligible.Additionally, following assumptions should hold: • The illumination is constant during the recording, • The camera position is fixed during the recording.
While the motion of the vocal folds causes changes in the gray level in some region, the gray level intensity within other (motionless) regions remains almost unchanged.In order to successfully segment the glottis by threshold method, it is necessary to achieve well behaved histogram distributions.Since the motionless region is not of interest, it should first be removed.For this purpose, motion cue is used to obtain a sub-image, in which the size is adaptive to the glottis opening/closure status.As a result, the size of each sub-image varies so as to only contain a minimal but complete region of interest.In this way, the original image data is greatly reduced to facilitate faster segmentation and thus the simplest threshold method can be more efficiently and successfully adapted to segment the glottis.
In this work, we propose a two-step segmentation scheme based on the vocal fold motion analysis and adaptive thresholding as detailed in the following Method section.

Method
In this paper, the adaptive thresholding segmentation approach is based on an evaluation of the motion using difference image at corresponding spatial locations in the image sequence that highlights the region enclosing the vocal-fold motion extent.In addition, the images are segmented by adaptive thresholding, which is obtained in a restricted region of the original image, or termed sub-image.The threshold value varies for each image and is determined based on the grayscale minimum pixel in the sub-images, which typically corresponds to a location within the glottis.
We designed the following scheme for the segmentation task as illustrated in Figure 1: 1) Manually select an image frame from a HSDI recording where the vocal fold opening region is the smallest, as the reference image (RI).
2) Obtain the binary difference image (DI) based on the RI.
3) Use the median filter to eliminate the isolated points labeled one in the DI.
4) Obtain the sub-image which has a variable size for each image frame based on the DI. 5) Select the threshold value based on the lowest pixel value in each sub image frame and segment the sub-image.

Introduction to Image Segmentation and Motion Analysis
As illustrated in Figure 2, each image from a laryngeal image recording should be segmented into two regions: the vocal fold opening region (glottis), which is the object, and the remaining region, which is considered as the background.In general, the image segmentation techniques can be categorized into three classes [11]: 1) characteristic feature thresholding or clustering; 2) edge detection; and 3) region exaction.Among them, thresholding method is the simplest and most efficient.Thresholding is the transformation of an input image ( , ) f i j (a gray level image) to an output (segmented) image ( , ) g i j (binary image), 1 ( , ) ( , ) 0 ( , ) where T is the threshold value, ( , ) 1 g i j = for image elements of objects; and ( , ) 0 g i j = for image elements of the background (or vice versa).From Equation (1), it is clear that correct threshold selection is crucial for successful segmentation.
Motion is a powerful cue used by humans and many animals to exact objects of interest from a background of irrelevant detail [21].Their applications of the motion cue in segmentation can be in both spatial and frequency domains.In this work, we exploit the basic spatial techniques since our applications focus on motion analysis in the spatial domain.

Glottis Area Segmentation
The different image is typically obtained by motion analysis in the spatial domain as defined by a binary image: where, ( , ) 1 d i j = represents image areas enclosing mo- tion, while ( , ) 0 d i j = represents image areas with no or  little motion. 1 f and 2 f are two consecutive gray level image frames within the original image sequences, and ε is a small positive number.
Here, we define the difference image (DI), a binary image, slightly differently as described below:

) ( , , ) 0 if f x y t RI x y T DI x y t otherwise
where 1 T is a positive constant.The optimal value of 1 T is determined based on experimenting with different datasets.The parameter t refers to the corresponding image frame at the recording time of t.Similarly, ( , , ) 1 DI x y t = represents the vocal fold motion enclo- sure in and image frame at time t, and ( , , ) 0 DI x y t = represents the background area within an image frame at time t.
( , ) RI x y is the selected reference image frame that is used to compare with any input image.As mentioned earlier, an image frame having minimum glottis area is manually selected as the RI.
In each frame of the DI sequences, there might be pixels that are far from the glottis, mislabeled as '1'.The main reasons for this mislabeling are as follows: 1) Illumination is not constant during the image recording; 2) Vocal folds are not rigid.As a result, some regions near the vocal folds undergo moderate motion as the vocal folds vibrate.
In order to accurately obtain the sub-image and ensure it encloses entire region of the glottis, we apply a median filter to the DI for noise removal.
Median filtering is a non-linear smoothing method that is widely used to reduce the blurring of the edges [10].This smoothing technique has been shown effective in eliminating spike noises.The key operation in the median filtering involves replacing the brightness of an individual pixel in the image by the median of the brightness values at several pixels in its neighborhood.The use of the median value can therefore reduce the effect of individual noise spike and smooth the image.
In the sub-image sequences, each image frame ideally contains a minimal region representing entire enclosure of the vocal fold motion extent.After the median filtering operation, the binary DI sequences are constructed and based on which we can determine the ROI that will be used for subsequent restricted, adaptive threshold segmentation processes applied to the sub-image sequences.
Further, we propose to use a variable threshold value for segmenting each sub-image, since it is prior knowledge that the darkest pixel point with minimum gray level intensity should be within the glottis, and in principle all pixels within the glottis should have lower values compared to areas outside the glottis in the sub-image.We thus obtain the threshold value based on the grayscale minimum value.
The algorithm is designed as follows, 1) Find the grayscale minimum (L) of each sub-image frame, 2) Obtain the threshold value 2 2 L c T = + , 3) Repeat above steps frame by frame.Where, 2 c is a constant, the determination of 2 c is described in the following section.
After segmenting the sub-image sequences using the respective threshold values, we will obtain a binary segmented image sequences.

Parameters Determination
In this work, we use Matlab as a platform to conduct all analyses.In the proposed segmentation method, we need to determine the following parameters: 1) Size of the median filter convolution mask, [m,n], 2) Threshold value 1 T , and constant 2 c .Different parameters can lead to different segmentation results.The method used for determining these parameters is based on trial and error.The parameters used in following analyses are 1 T = 0.10, 2 c = 0.15, and [m,n] is selected as [4,4].

Discussion
Among threshold selection methods from gray-level histograms, Otsu method is widely used in many applications [22].It is a and unsupervised method for automatic threshold selection and image segmentation.An optimal threshold is selected by the discriminate criterion, namely, so as to maximize the separability of the resultant classes in gray levels.all pixels mislabeled "1" were effectively removed by the median filter.Finally, a series of segmentation results are shown in Figure 6, where both the sub-image region (rectangular ROI) and the accurately delineated glottis contour are outlined.
A comparison between the results of segmentation obtained from randomly selected three consecutive HSDI frames using Otsu and our method is shown in Figure 7.The top row shows the segmentation results obtained in the full image frame by Otsu method, and the lower row shows the results obtained from our method.It is clear that our first step to obtain the sub-image is critical for achieving robust and accurate segmentation results.

Conclusion
We developed a new approach for restricted, adaptive segmentation of images of the glottis that are acquired from the HSV system.By defining a sub-image set based on vocal fold motion cue, the subsequent threshold process is efficiently restricted to a ROI so that the effects of background are minimized, leading to a robust  and accurate segmentation outcome.From the segmentation results obtained from several clinical HSDI data sets using the proposed method, we can conclude that our method is effective and practical for applications in clinical settings.

Figure 1 .
Figure 1.The scheme for the two-step segmentation.

Figure 2 .
Figure 2.An image frame from the HSDI recording, and (b) the grey-level intensity profile along the mid-line of the vocal fold.

Figure 3 3 .Figure 3 .
Figure 3.Comparison of the results of segmentation; the upper row shows two input images, the middle row shows the segmented images using our two-step approach, and the lower row shows the segmented images using Otsu method.

Figure 4 .
Figure 4. Sub-image frames showing the defined rectangular ROI.

Figure 5 .
Figure 5.The left column shows four difference images; and the right column shows the results after applying a 4×4 median filter.

Figure 6 .
Figure 6.Serial segmentation results: the rectangle marks the defined ROI within which a restricted thresholding is performed to delineate the glottis (outlined).

Figure 7 .
Figure 7. Results of segmentation from direct thresholding (top row) and from our algorithm (lower row).