Virgo Cluster Membership Based on <i>K</i>-Means Algorithm

The Virgo cluster of galaxies is of great importance to study the development of the universe due to its close distance from the earth as well as being the center of the local super cluster. The problem that faces Virgo cluster studies is that it shares the same right ascension (RA) and Declination (DEC) ranges with large number of background as well as foreground galaxies. This study aims to geometrically and statistically estimate Virgo cluster membership. The study employs Virgo cluster data, prepared by Harvard University. The radial velocity (RV) data of the Virgo cluster were treated and employed in exchange of missing galaxies’ third dimension, taking advantage of their proportionality. The data were treated by K-means algorithm, using Matlab 2014, and visual and logical exclusion of extremity galaxies to determine the rational center of the Virgo galaxies cluster. Results were presented, compared and discussed. Finally distances of galaxies from the Virgo cluster center were employed along with normal probability distribution characteristics to identify the most probable Virgo cluster members from the range of Virgo cluster of galaxies. The results showed that out of 17,466 objects surveyed in Virgo galaxy range, only few of galaxies were estimated to be genuine Virgo members.


Introduction
The study of the Virgo cluster is important to explore the development stages of the universe. This cluster is the nearest large high-density cluster to earth, at about 19 Megaparsec (Mpc) [1]. It is easier to monitor with higher accuracy, which makes it an ideal laboratory for testing hypothesis of structures formation of the universe [2]. The Virgo cluster is somewhat an irregular cluster with concentration of galaxies at the center. In three-dimensional (3D) space, the Virgo cluster constitutes the nucleus of the Local Super Cluster (LSC) of galaxies [3]. Its location in the crowded center of our local super cluster causes it to share its right ascension (RA) and Declination (DEC) ranges with large number of unrelated background as well as foreground galaxies [4]. These galaxies would highly influence investigations of Virgo cluster structure [5].
Optimization techniques relay primarily on arranging states of candidate members based on chosen optimization characteristics, within the search space. Once organized states of all members are reached, selection of the optimal solution is a straight forward process [6]. Different state identification techniques are employed for different optimization problems. For cluster memberships assessment, a number of different optimization techniques were employed with varying degrees of complexity and accuracy, such as Hierarchical Clustering Techniques [7], expectation-maximization (EM) algorithm [8], density based clustering algorithm [9] and Cuckoo Search Algorithm [10]. K-means represent one of the most promising optimization techniques, being employed for geometrical galaxy membership identification [11].
K-means is used to group galaxies into a given K number of galaxy clusters as well as identify the clusters' centers. This is done through optimization of Euclidian distances between galaxies and identified clusters' centers [12]. Distances are used to evaluate most probable cluster membership of galaxies.
Another important optimization technique is the normal distribution characteristics [13]. This technique could be employed on Euclidian distances between galaxies' locations and their cluster center. In this technique, the probability of each galaxy membership to the cluster is identified and a threshold of probability is employed to differentiate between the cluster members and field galaxies.
The current study attempts to differentiate between Virgo cluster galaxies and the unrelated field galaxies using 2D K-means and 3D K-means as well as other optimization techniques. This process was performed using Matlab  2014 on Windows 10 operating system and results were plotted using Grapher©. The proposed technique was verified with other reviewed studies.

Related Work
Recently, computational techniques coupled with greater computational powers developed astronomical capacity leading to better understanding of the universe. Studying different stellar objects and their development process leads to better understanding of the past, present and future of our planet, solar system and galaxy.
In [14], authors presented an automated stellar cluster analysis tool that employs the standard tests on stellar clusters to determine their basic parameters. This tool has a set of functions that are used to obtain precise and objective val-ues for a given cluster's center, radius, luminosity and integrated color magnitude.
In [15], a data clustering technique is proposed. It is a clustering technique based on descriptive data analysis and it can be employed to uncover the structure of multivariate data sets. It depends on K-means clustering technique. K-means clustering is a center defining model as each cluster is represented by one vector. It was developed by Macqueen [16], as a tool to give researchers qualitative and quantitative insights into multivariate large data sets. K-means clustering proved to be useful in data mining and investigative data analysis and is used to provide unique and definitive means of data grouping. Large data sets are the result of advances in information technology and growth of computational powers. Popularity of K-means over other clustering techniques stems from its ease of implementation, low memory requirements and computational efficiency.
In [17], K-means clustering was employed to develop a new method for open cluster membership determination. The developed algorithm allows efficient discrimination between cluster members and field stars. The results showed that the developed algorithm has the capacity to evaluate stars membership probability without assumptions regarding stars spatial distribution in cluster or field.
Researchers in [18] investigated membership of Virgo cluster, as the closest and consequently most studied cluster of galaxies. They employed classical methods to evaluate membership probability of Virgo cluster. Virgo has a problem that stems from sharing its field with a large number of non-Virgo galaxies and stars. There are many reasons favoring studies of Virgo such as its nearness, which causes it to be a main candidate for surveys. Also, Virgo exhibits a full range of galaxy luminosities and morphological types. Faint Virgo galaxies proved difficult to separate from overwhelming number of background galaxies. This was specially an issue of older literature. Investigations of galaxies are constantly biased in favor of massive and more luminous galaxies. This trend has many reasons such as under representation of low luminous galaxies in most galaxies catalogs due to the larger volume over which more luminous galaxies can be sampled.

K-Means
The current research employed 2D K-means and 3D K-means as the tools to estimate the Virgo cluster center. In this section 2D K-means algorithm and 3D K-means algorithm of clustering probability are presented. These algorithms deal with set of n members coordinates { } vector. K-means process is carried out in two steps [19].
In the first step, the members are divided into, given K groups, G = , by minimizing the mean squared distance between members and predicted group's center [20] as follows: International Journal of Astronomy and Astrophysics ( ) In second step, the group center is relocated to be situated at the arithmetic mean of the group's members locations [20] as follows: Finally, first step and second step are repeated until no more changes in group members are observed.
The 2D and 3D K-means are illustrated in Algorithm 1 and Algorithm 2.

2D K-Means Algorithm
, , , n X x x x =  //set of n data items.

K // Number of desired clusters
Output: A set of K clusters, k c : The center of each cluster and Steps: Repeat • Calculate the distance between each data point and cluster centers using: • Assign the data point to the cluster center whose distance from the cluster center is minimum of all the cluster centers. • Recalculate the new cluster center using: Until the centers don't change.

3D K-Means Algorithm
, , , n X x x x =  //set of n data items.

K // Number of desired clusters
Output: A set of K clusters, k c : The center of each cluster and Steps: Repeat • Calculate the distance between each data point and cluster centers using: • Assign the data point to the cluster center whose distance from the cluster center is minimum of all the cluster centers. • Recalculate the new cluster center using: Until the centers don't change.
The proposed method consists of two steps. The first, is to employ 2D K-means and 3D K-means to evaluate the most probable Virgo cluster center, treated as one group (K = 1), and consequently evaluate the distance of each galaxy to the Virgo cluster center. The second step, is to employ the RA and DEC positions with normal probability distribution to derive the membership probability of Virgo cluster galaxies. Finally a set value of standard deviation is employed to eliminate the most improbable galaxies and identify the most probable Virgo cluster galaxies.

Normal Probability Distribution
Normal probability distribution characteristics evaluate the probability that an individual data point is a member of a given data set [21]. For this purpose the standard normal random variable Z is evaluated for this data point according to the following equation [22]: Finally the probability is obtained from the following equation [13]: where: x is the investigated data point.
x is the arithmetic mean of the data set under consideration. σ is the standard deviation of the data set under consideration.
The data set arithmetic mean is calculated as follows: Standard deviation is a measure that quantifies the amount of variation of a set of data values. Lower standard deviation indicates that data points are gathered closer to the data set mean (center of Virgo cluster of galaxies), while higher standard deviation indicates that data is spread further away from the data set mean [23]. Standard deviation is calculated as follows: The perfect normally distributed data would have equal mean, median and mode values. It would also be asymptotic and have its probability distribution symmetrical and centered on vertical axis at the data mean value. The values of mean and (mean + σ) would enclose 34.1% of all data elements. Also, the values of (mean + σ), (mean + 2σ) and (mean + 3σ) would enclose 13.6% and 2.1% of all data elements respectively [24].

Virgo Cluster Data
The raw Virgo clusters of galaxies data have been used in the current study are from Virgo subset of 2 Micron All-Sky Survey catalogs (2MASS). The data is delivered by Center for Astrophysics (CFA) of Harvard university as reported in year 2007 [25]. The raw Virgo data clusters of galaxies contain all the galaxies within RA from 11.0 to 14.0 hours and DEC from −10 to + 35 degrees. This roughly rectangular region is centered around 12.5 hours and +12 degrees, where Harvard believes that Virgo cluster center is located. 2MASS's RA is measured in hours, minutes and seconds format, running from 0 to 24 hours. DEC is measured in degrees, minutes and seconds format running from −90 to +90 degrees. RV, heliocentric radial velocity, of the different Virgo galaxies is employed in exchange of missing galaxies' third dimension, taking advantage of their proportionality [26].

Result
The first employed methodology is based on 2D K-means. the results, as shown in Figure 1. The squared shape of the plotted data indicated and further stressed the Harvard suggestion that the data includes background and foreground objects that do not belong to Virgo cluster galaxies.
To further investigate the location of data center, the data of RA and DEC along with RV values, representing the third dimension, and K value of 1 were fed to the 3D K-means. To study the distribution of the raw data and their probable belonging to the Virgo cluster, the RA and DEC data with RV were plotted in 3D.
The second step, normal distribution parameters of the developed RA and DEC positions of galaxies were used to drive probabilities of each galaxy membership likelihood to the Virgo cluster. Finally, a threshold of σ that supposed to enclose coefficient of determination, R-squared = 0.0351475 Residual mean square, sigma-hat-squared = 34.2374 is employed the data set. This galaxies data set would represent the most probable Virgo cluster members, Figure  The plotted data indicated the most probable Virgo cluster galaxies in all 2D data of 2Mass catalog, as shown as Figure 3. The circle in Figure 3 indicates the most probable range of Virgo cluster members.    . The most probable Virgo cluster range of galaxies member in 2Mass data.

Conclusion
In this paper we presented a new method for galaxies cluster membership determinations based on K-means algorithm. The application of the 3D K-means identification along with normal distribution characteristic was employed to identify the most probable galaxies of Virgo cluster as well as the Virgo cluster center. To demonstrate the method quality and to test how well it handles real clusters the proposed technique was employed to raw Virgo cluster data in 2MASS as supplied by Harvard. Probability results showed that Virgo cluster contains 1300 galaxies in 2MASS data. The above mentioned results support us to use the proposed algorithm to get a better and clearer way to determine the membership of groups. This method has the ability to handle large datasets both objectively and automatically. The method includes functions to identify structure of the cluster. The developed technique proved its ability to successfully perform global searches to solve the set problem.