^{1}

^{*}

^{2}

^{3}

Feature detection in chemical sensors images falls under the general topic of mathematical morphology, where the goal is to detect “image objects” e.g. peaks or spots in an image. Here, we propose a novel method for object detection that can be generalized for a k-dimensional object obtained from an analogous higher-dimensional technology source. Our method is based on the smoothing decomposition, Data = Smooth + Rough, where the “rough” ( i.e. residual) object from a k-dimensional cross-shaped smoother provides information for object detection. We demonstrate properties of this procedure with chemical sensor applications from various biological fields, including genetic and proteomic data analysis.

Numerous chemical sensor platforms and technologies require image analysis techniques to isolate the signal from the associated noise in the sensor. In a one-dimensional chemical sensor setting, for example, several technologies produce spectra where scientists can gain information from associated peaks, or grayscale images where the features appear as streaks or lines. Meanwhile, in a two-dimensional setting, associated technologies produce images whose features are spots. Such image analyses usually involve methods where the goal is to identify and quantify the size of an image feature or object, i.e. feature detection and quantification.

Feature detection in multi-dimensional images is an area of great interest in a variety of applications, ranging from astronomy to proteomics [1-7]. Proposed methods employ image segmentation techniques such as watershed methods, thresholding operators, and wavelet reconstruction methods to locate the features contained in a one-dimensional or two-dimensional image. Further, feature detection has a growing body of research in larger high-dimensional datasets, as well; see, for example, [8, 9]. The algorithms and methods proposed, however, usually apply solely to the application and technology of interest and may not be applicable to images of other forms or varying dimensionality.

Determining the locations and boundaries associated with various chemical sensor features has been a problem considered by computer scientists and engineers (under the guise of image analysis), as well as mathematicians and statisticians (via mathematical morphology). Mathematical morphology (MM) is the science of analyzing and processing geometric structures (e.g. local maxima) in digital images via various processing techniques (e.g. local maxima) in digital images via various processing techniques [10-15]. Examples of common MM functions include opening, closing, thinning, binning, thresholding, and watershed methods, and have been employed in numerous applications including pedestrian detection [

This paper combines aspects of feature detection, data smoothing, and residual analysis to develop a new bump detection method for not only oneor two-dimensional images, but k-dimensional images for any. Thus, not only is this method straightforward, but it can also be applied universally to higher-dimensional images, providing researchers with a detection and quantification method for any chemical sensor technology whose features of interest are bumps.

In our method, a specialized median (referred to hereafter as an s-median) smoother is developed, where the s-median determines the median associated with the intensity values that lie spatially in the cross-shaped structuring element. Consider a k-dimensional (kD) image represented by, where x is a point location in the Cartesian coordinate system. We let denote the kD smoothed image obtained by using an s-median operator with “arms” of length on; window size examples are provided in

After applying this s-median throughout the raw image, we examine the associated residual image, , to obtain information regarding bump detection and quantification. The image contains k-dimensional cross features, where associated image local maxima identify the associated bump center, and local minima outline the shape of the bump. We can use this information, for example, to identify peaks and their associated area in one-dimensional applications involving spectral data, or spot detection and quantification in two-dimensional images. Sections 2.1 and 2.2 introduce the theoretical underpinnings for our method and demonstrate the procedure for continuous and discrete functions, while Section 2.3 extends these ideas to study the behavior of the smedian operator in the presence of noise.

This section develops the theoretical underpinnings for and, subsequently, in the context of continuous functions. We derive for 1D and 2D theoretical models by characterizing the the median operator

via the function mapping between input and output values.

Let and denote the cumulative density function (cdf) and probability density function (pdf), respectively, for the a random variable evaluated at the point; analogously, we denote the cdf and pdf for a random variable at point. Let be a function that maps from the support set (for the random variable) to the support set (for the random variable Y). For our applications, is obtained from an optical device such as a charge-coupled device camera or laser scanner. Our goal is to obtain an expression for [and thus] which then determines the median of, , i.e. satisfies . Note that, in our notation for one dimension, at a given location, x. Thus, for simplicity, we will denote as (or) with the implicit understanding that (or) is also a function of k and c.

Consider the case where is strictly monotone on the interval. Then, for increasing,

while, for decreasing,

Strict monotonicity in implies its invertibility for any, i.e.; in particular, by definition of, we have

. Hence, in the monotone case, , where denotes the median associated with the random variable, and “” denotes statistical equivalence as defined in [

For the piecewise monotone situation, we define

as an open interval with

. Let , be the smallest collection of disjoint open intervals such that is strictly monotone on each. By definition, is strictly monotonically increasing on if, for any two values such that, holds. Analogously, is strictly monotonically decreasing if for such that. Note that continuity may not be enough for the to be countable, where we define countability as in [

where. While the sequence partitions, does not necessarily partition; e.g., see

Let, i.e. the th decomposition of, and defines the indicator function of by (0) if is (not) in the interval. By definition, is strictly monotone. Thus, for, we have, implying that any function can be decomposed into the sum of its strictly monotone components,. Accordingly, we see that

, where, for

increasing on,

and, for decreasing on,

For all but the most simple functions, there is no closed form solution by which to define. Nevertheless, the above equations will allow for the calculation of and thus using computational methods.

In the 2D continuous case, we introduce the function where the goal is to obtain an expression for. From standard probability theory such as in

[

and is the joint pdf for and. Note that this is the general case for obtaining the cdf of in terms of and. For our specialized median, however, our sample space for and must be defined in terms of another parameter, say, where controls the width of the smoothing window in each dimension.

methods can be used to compute M_{Z} and thus.

Let denote a “discrete” function, i.e. a function with discrete/countable realizations from the continuous function. By definition, is a function that maps from the support set (of the discrete random variable X) to the support set (for the discrete random variable Y). This section considers computational results associated with and its impact on the s-median.

Let and denote the discrete cdf and probability mass function (pmf), respectively, for the arbitrary random variable evaluated at the point. Let be as defined above. The calculation of the s-median proceeds by assuming Discrete Uniform() as defined in [

We can analogously represent using discrete random variables and as we did for the continuous case, namely

where indicates probability. If we assume that X~ Discrete Uniform(), then

and with and

denoting the ceiling and floor functions, respectively.

For the special case of a strict monotone discrete function on the full interval,

does not depend on the direction of monotonicity for. Figures 4(a)-(c) show the images from our technique applied to a simple one-dimensional discrete piecewise monotone function.

For the 2D discrete case, we define the sample space with the following definition.

Definition 2.1 Let be a discrete uniform on, be a discrete uniform on, and (x^{*}, y^{*})^{ } be a fixed point such that x^{*} and y^{*}, respectively. Then let be of the form,

Let define a mapping from the support sets and, of the random variables X and Y respectively, to the support set, for the random variable Z. We define the functions, and

, such that,

and, where is the the smallest set of disjoint open intervals such that is strictly monotone on each,; and is the smallest set of disjoint open intervals such that is strictly monotone on each,. In this setting,

defines the cdf of . Nicely, all of the above quantities can be computed since we specified the distributions for and. Note the zero quantity in the third line is due to the intersection of the sets containing only the single point. The and depend on the length of and, respectively.

Similar to the 2D discrete setting, there is usually no closed form solution for and thus, but the solution can be determined numerically. Figures 4(d)-(f) show the images from our technique applied to a simple two-dimensional discrete piecewise monotone function.

In this manuscript, we directly show the calculations for one and two dimensions. However, our method can be extended to higher dimensions () as demonstrated in [

In this section, we examine the properties of our procedure in light of Gaussian noise. In the 1D noise-free setting for image, it can be shown (with our proposed methods) that for any when is the location of the absolute maximum, and when the sequence contained in each dimension of the smoothing window is monotone. Further, under certain circumstances associated with 1D images, when is the location of a local minimum in our image; see [

Consider adding independent and identically distributed (i.i.d.) Gaussian noise to the 1D monotonic sequence, where. Let, where denotes the true signal at location, equals the step size at in the monotonic sequence such that, and denotes normally distributed noise of mean zero and standard deviation. We fix for our examples such that the signal-to-noise ratio () remains constant within each simulation.

We examine the case when at an arbitrary location since, as shown in [

for any when. As the step size increases relative to the standard deviation of the noise, naturally for any, we expect the probability to converge to one. Hence, for noise-free monotone images, for all.

In the presence of noise, however, the monotone signal becomes contaminated such that decreases as increases. Intuitively, as increases, the number of points in the smoothing window increases hence there are more “opportunities” for other points to be the median, thus making the residual nonzero at that location.

Given the local maximum at in a noise-free 1D spot (mountain),. In the presence of noise, we can estimate (via simulation) the probability that the 1D residual image intensity value at the local maximum location is positive; i.e., we can estimate when and

the absolute maximum location as a function of for different values of in a 1D image. Analogously,

To further illustrate the importance of the size of the smoothing window in detecting spots, Figures 8(a)-(b) shows a set of four Gaussian spots with different standard deviations.

To confirm that large values of c more effectively find spots, Figures 11(a)-(c) show a sequence of three spots in order of increasing size with noise. Figures 11(d)-(f) are the R_{2,2} images corresponding to Figures 11(a)-(c), respectively. Figures 11(g)-(i) are the images corresponding to Figures 11(a)-(c).

detect larger spots in the presence of noise, and (2) in the presence of noise, larger values of c are more effective for detecting spots.

Collectively, Figures 6-11 illustrate the tradeoff that must be considered when determining the arm size for the s-median smoother. We see that large values of are more likely to yield positive residuals at the maximum in the I image; however, the residuals associated with large values of c are also more likely to be nonzero in the presence of noise over monotonic regions. In other words, for spot finding, large values of c improve spot detection in noisy images, however, it may cause two distinct spots to merge into one spot in the presence of noise. A balance between these two issues will be critical in choosing the optimal c value(s) for peak or spot finding (see Section 3.4).

In this section, we present the results from applying our method to biologically motivated chemical sensor array data, including mass spectrometry, gel electrophoresis, and spotted microarray data. In mass spectrometry, the relevant data are represented as spectra where the associated peaks in the intensity plots represent proteins (or peptides) present in a sample. Obtaining the location and intensity of these peaks aides in identifying sample proteins for further study consideration. Gel electrophoresis data are represented in the form of 2D images comprised of protein spots. Again, investigators are interested in detecting these features in order to isolate their location in the image and potentially extract the associated protein sample for further analysis. Finally, spotted microarray data are represented as two-dimensional images of spots in a 2D matrix structure. Feature detection is key in order for the genetic data to be properly summarized and thus for these technologies to have utility in diagnosing disease or assessing putative biomarkers.

The code to perform our method is written using FIASCO, a collection of statistical software created in the Department of Statistics at Carnegie Mellon University that was originally designed to analyze functional magnetic resonance imaging (fMRI) data. The computer code used for this work are available upon request from the corresponding author. In the following, we demonstrate our spot detection technique on the various example sets noted above.

Matrix-assisted laser desorption ionization time-of-flight (MALDI-TOF) mass spectrometry is a technology that can be used to profile protein markers from tissue or bodily fluids, such as serum or plasma in order to compare biological samples from different patients or different conditions. The output from a MALDI-TOF experiment consists of a measured intensity for each massto-charge ratio (m/z) value; see

Our s-median derived image can be used to detect peaks in MALDI-TOF images and thus locate peptides present in the sample. The spectrum for each sample consists of a single vector, I, thus applying the s-median is equivalent to applying a running median to the I image. This dataset in question was obtained from the Proteomics Core Laboratory at Roswell Park Cancer Institute. We use this real data to examine the results of applying the s-median to a MALDI-TOF spectrum. In this example, we set this dataset’s bandwidth (i.e. the value of in) to 500 data points, which corresponds to approximately a 95 m/z bandwidth.

Another application of this spot detection technique is on images obtained from two-dimensional difference gel electrophoresis (2D-DIGE) experiments such as those

described in [

For our 2D-DIGE examples, we will focus on images representing portions of the 2D gels examining morphogenesis in Drosophila obtained from the Minden laboratory at Carnegie Mellon University [28,29]. These images are obtained from a charge-coupled device (CCD) camera and the protein spots in these images allow the researchers to obtain a protein expression signature of the sample under a given condition or given time point. The images under study have been normalized according to the model described in [

Genetic microarrays are a popular analysis tool to study genetic changes associated with diseases such as breast cancer [

The classic equation, , is well known to statisticians studying regression techniques or smoothing methods for datasets. In this manuscript, we demonstrate an application of this equation, resulting in a new operator where the residual image derived from a novel smoother can be used to locate spots or mountains in an image. This method combines the residual operator from statistics with the structuring element (cross-shaped window) in the field of mathematical morphology. Major advantages of our method include fast running time, broad application to many image types, and universal spot detection regardless of scale. That is, irrespective of a spot’s size and height, its location will be detected via our method. This aspect alleviates the need to alter or change the grey scales in an image when searching for spots of varying intensities.

As demonstrated, this method uses the s-median operator to smooth images. Other window operators can be considered, however they result in different residual image implications. For example, if a mean cross (i.e. “smean”) smoother is used on the Gaussian mountain in

grid or “box” shaped window sequence is used. Here, we now obtain a residual image that looks like a starburst instead of a cross. As a result, the spot center is now potentially more difficult to identify. The shape of the smoothing window (cross vs. box) and the summary statistic used (median versus mean) thus affect the R image and the ability to detect the mountains in an image.

The issue of rotation invariance is an important concept within mathematical morphology operators used in image detection. Rotation invariance implies that the resultant image does not change when arbitrary rotations are applied to its input argument. In general, our spot finding method is rotation invariant for the Gaussian spots with zero correlation (e.g., spots of the type shown in

When using the s-median operator for spot finding, the major consideration is the arm-length size associated with the smoothing window, or alternatively the number of pixels included in the smoothing window (structuring element). The s-median smoother naturally removes noise from, hence the size of the smoothing window essentially decides the amount of smoothing to apply to the dataset. From Figures 11 and 17, the choice of is critical, since choosing too large will oversmooth the image and blend spots together, while choosing too small will undersmooth the image and cause spurious spots due to noise to appear as real spots. Since the choice of is essentially choosing a smoothing parameter, there are several available methods to consider when choosing an optimal value for c. The general method for choosing smoothing parameters is based on cross validation algorithms described in [

The optimal choice of c is related to the larger statistical subject of bias-variance tradeoff. Choosing c too small leads to a largely variable residual image (missing small spots), while choosing c too large leads to a residual image with a large bias term (too many spurious spots). Similarly, the optimal choice of c is related to several other problems in statistics, the optimal choice of bandwidth in kernel density estimation [

A major concern in proposing image analysis software algorithms involves performing the comparisons among competing methods. Unfortunately, due to the cost of these technologies and the lack of a gold standard for measuring the signal of the chemical sensor, it is difficult to design statistically appropriate benchmarks or quality control studies to assess these image analysis techniques for a given chemical sensor. Although it is relatively simple to simulate “bumps” or mountains in an image, the difficulty arises in deciding the type of noise to impose upon the simulated images. In the presence of most noise distributions, the success of our proposed method will be dependent on the choice of smoothing parameter, c. It is outside the scope of this manuscript to perform a thorough comparison of competing spot finding algorithms against a set of noise distributions. For future work, we propose performing comparisons such as those in [42, 43] to establish conditions in simulated and real datasets where our methods are superior to competing methods. The main goal of this manuscript is to establish a new method for spot finding in images and demonstrate its performance on a variety of different biological images derived from chemical sensors.

This manuscript develops a new method for spot finding and illustrates the technique’s great utility and applicability within several chemical sensor datasets such as mass spectrometry spectra, gel electrophoresis images, and microarray images. This method can be easily extended to mountains in k dimensions and can be extended to further quantify the amount of signal present in other emerging chemical sensors with Gaussian profiles.

The authors are grateful to the Roswell Park Cancer Institute Proteomics laboratory and the Minden laboratory at Carnegie Mellon University for generously providing their data to illustrate our method. We also thank the reviewers of this manuscript for their valuable feedback and insights.