A new projection method for biological semantic map generation

Low-dimensional representation is a convenient method of obtaining a synthetic view of complex datasets and has been used in various domains for a long time. When the representation is related to words in a document, this kind of representation is also called a semantic map. The two most popular methods are self-organizing maps and generative topographic mapping. The second approach is statistically well-founded but far less computationally efficient than the first. On the other hand, a drawback of self-organizing maps is that they do not project all points, but only map nodes. This paper presents a method of obtaining the projections for all data points complementary to the self-organizing map nodes. The idea is to project points so that their initial distances to some cluster centers are as conserved as possible. The method is tested on an oil flow dataset and then applied to a large protein sequence dataset described by keywords. It has been integrated into an interactive data browser for biological databases.


INTRODUCTION
Thanks to the availability of the human and other genomes and the rapid progress of biotechnologies and information technologies, numerous large biomedical datasets have been generated.Modern biomedical information thus corresponds to a high volume of heterogeneous data that doubles in size every year and that covers very different data types, including phenotypic data, genotypic data as well as standards, processes, protocols or treatments used to generate information from raw data.In this context, systemic approaches are now needed to store, analyze and compare the huge amount of relevant information.
In addition, the knowledge provided by classical query services on biological data is often unsatisfactory (e.g. a list of proteins or sequences) and there is a need for user-friendly visual representations of the data.Such a representation exists and is called a feature or semantic map.It is used to visualize "land maps" in two or three dimensions that represent, for example, the distribution (similarity and neighborhood) of protein annotations in biological databases.When query results are represented on the map, the repartition of the proteins can be easily observed, as well as their proximity to clusters labeled according to their content.In addition, it is straight-forward to superpose the information obtained from additional requests.Thus, a semantic map can greatly facilitate the interpretation of results from large scale data analyses.To quote a few examples, semantic maps have already been used in fluid mechanics [1], astronomy [2], internet data mining [3,4], scientific literature mining [5] and biology [6].
Many low-dimensional methods have been devised [5,7,8,9] and two of the most popular are the WEBSOM method [9] and the Generative Topographic Mapping (GTM) [1].These two methods are briefly outlined below.
WEBSOM originates from self-organizing maps [10] which is a classification algorithm where nodes move towards cluster centers.In WEBSOM, the nodes are fixed on a two-dimensional grid and at the same time live in the space of the dataset, typically a space.First, a point is picked at random from the dataset.
Next, the closest node in is selected and then each node moves towards y according to the equation is the learning rate decreasing in time and is a neighborhood function in the two-dimensional grid.These steps are then iterated for all data points.The initialization of the p-dimensional space can be performed randomly, but a more effective method is to select points The first and second author contribute equally to this paper.

SciRes Copyright © 2010 JBiSE
along the two first principal axes of the dataset [4].Finally, the dataset is used again by assigning each point to its closest node in the p-dimensional space using a Euclidean distance.Then, for each node, the number of points it has captured is taken as its density up to a given scaling factor (the size of the dataset).
The generative topographic map (GTM) [1] is a statistical method which is provably (locally) convergent and which does not require a shrinking neighborhood or a decreasing step size.It is a generative model: the data is assumed to arise by probabilistically picking points in a low-dimensional space and mapping them to the observed high-dimensional input space.The statistical model can be described in the following way:  W typically equal to radially symmetric Gaussians centered on the nodes of a two-dimensional grid.The parameters and  of the model are estimated through the expectation-maximization (EM) algorithm [11].This model can be considered to be the probabilistic counterpart of SOM/WEBSOM.However, the WEBSOM method is quicker than GTM when large amounts of data must be dealt with, especially if the winner selection is optimized so that millions of documents and nodes can be treated [4].
An alternative choice is to follow Flexer's approach [12] which first clusters the points in the data space and then projects cluster centers using Sammon's multidimensional scaling method [13].However this means that only a subset of points are effectively projected.In this paper, we present a complementary method that projects all points using their distances to the cluster centers.
First this new projection method is presented, then it is evaluated on a benchmark data set and compared to other methods.Finally, it is used in the results section to generate a semantic map in the context of a new integrative navigator for biological databases.

METHODS
The principle of the presented method is to project points after they have been clustered and the cluster centers have been projected onto a two-dimensional map.This is done by conserving as much as possible the original distances between the points and the cluster centers.Basically, for each point indexed by , the two-dimensional coordinates are search such as to minimize the difference between the distances computed in the -dimensional data space with those computed on the map.

i n
This comes down to finding the point i x in two dimensions minimizing the following function : with H the Hessian and the gradient of The optimizing function is not convex as the Hessian is not always semi-definite positive.To show this, it is sufficient to find a point X verifying ' 0 X HX  .In particular, we show that 11 H can be negative which is also sufficient.First let us note that Consequently, a global optimization process was performed using different initial values.Each cluster center projection was used as an initial value and the best solution after convergence was kept.

Validation Using the Oil Flow Dataset
To validate the new points projection method, a previously established oil flow dataset [14] was used as a benchmark.This training dataset is available at http://www.ncrg.aston.ac.uk/GTM/ and contains 1000 First, the dataset was clustered into 15 clusters and the cluster centers projected according to Sammon's multidimensional scaling method [13].Then the 1000 points were projected in two dimensions using the method described above.The results are shown on Figure 1, where it be seen that three different groups are rather well linearly separated.The groups obtained with the GTM and principal component analysis (PCA) methods are shown on Figures 2 and 3 respectively.In order to objectively measure the quality of these results, we computed the ratio of the between-class inertia and the total inertia for each method.For our method, GTM and the PCA, we obtained a ratio of 0.83, 0.25 and 0.23 respectively, thus confirming the visual impression.Nevertheless, it should be stated that, if only separation is desired and not specifically linear separation, GTM performs better, even though it has the drawback of making the underlying grid very visible.

Semantic Map Generation for Biological Database
The Laboratory of Genomics and Integrative Bioinformatics (LGBI) at the IGBMC Strasbourg, has developed a new high-performance biomedical information system, called the BIRD System [15,16].BIRD is able to integrate very quickly heterogeneous data either from the large generalist databases (sequence, structure, function and evolution, etc.) or from specialized databases dedicated to high throughput biology (transcriptomics, interactomic, etc.) in a relational database (IBM DB2).Thus, it allows to organize massive sets of biomedical data according to real world requirements.An original biological query engine, called BIRDQL, has been designed to facilitate access to the heterogeneous databases and to allow pertinent information extraction via a web server.This system has been used in the Decrypthon computing grid [17] in order to provide data to the runtime applications.To complete the visualization and analyze functionalities of the BIRD System, the new method described above to build semantic maps was integrated in the BIRD query engine (BIRDQL).The maps can be used to explore the data using a combination of high level queries and area selections (Figure 4).The method was tested by building a semantic map of the Uniprot database [18] using the keyword descriptions for each protein.After removal of redundant vectors, we obtained 60,000 vectors in a 914-dimensional space corresponding to the 914 keywords extracted from about 6 million proteins.In the following lines, to avoid focusing on the numerical details, we will consider proteins described by keywords where and points in 12 dimensions corresponding to 12 measurements on the mixture of oil, water and gas passing through a pipeline.The three phases in the pipe can belong to three different configurations corresponding to laminar, homogeneous and annular flows.stand for 60000 and 914 respectively.
Before projecting the points, some preliminary steps were necessary: Step 1: dimension reduction The proteins were described by keywords and were thus represented by points in dimensions.As in the preprocessing step of WEBSOM [3,4], an initial dimension reduction was performed to reduce coordinates to using random projection directions.More specifically, random vectors

. n y y
Step 2: mixture models clustering In a second step, these points were clustered using mixture models.Mixture models are a powerful method to cluster datasets of points described by coordinates.The points are assumed to be independent realizations from a mixture of several distributions.Here the mixture is only briefly described for components A general presentation of this method and its applications can be found in [19,20,21,22].
The estimation of the different coefficients of the mixture model is commonly performed via the EM (Expectation-Maximization) algorithm of Dempster [11].Here, in order to simplify the estimation, a variant of the EM algorithm called CEM was used [22].In this application was chosen to be equal to 30.G Step 3: cluster centers projection Once clusters were obtained, the centers of gravity were computed in the p-dimensional space.Then, multidimensional scaling (MDS) [23] was applied on the cluster centers to produce two-dimensional coordinates .MDS was used because Sammon's method [13] failed on this dataset, since it produced many points with the same coordinates.
After these three preliminary steps, the points were projected on the map using the new projection method.The density for each point x of the map is given using a kernel method [24]:

SciRes Copyright © 2010
JBiSE Then, a color scale ranging from purple to white, with intermediary colors red, orange and yellow was assigned to each point according to its density.The map is represented in Figure 4.
Then, a color scale ranging from purple to white, with intermediary colors red, orange and yellow was assigned to each point according to its density.The map is represented in Figure 4.
This visual representation allows a global comprehension of the whole database, which is easier to understand than numerical or textual data.Some important keywords shared by many proteins are visible on this map, such as kinase, ligase and protease.At the same time, frequent keywords, such as "complete proteome", that are non-informative, are avoided because they are shared by several clusters.Another observation is that the density is far from being homogeneous, the map being more crowded in the bottom-left corner than elsewhere.This visual representation allows a global comprehension of the whole database, which is easier to understand than numerical or textual data.Some important keywords shared by many proteins are visible on this map, such as kinase, ligase and protease.At the same time, frequent keywords, such as "complete proteome", that are non-informative, are avoided because they are shared by several clusters.Another observation is that the density is far from being homogeneous, the map being more crowded in the bottom-left corner than elsewhere.
When using the integrated biological query engine BIRD-QL of the BIRD System via a web service or http protocol, as shown in Figure 5, the selected proteins are represented on the maps by a plus sign of a given color.If different selections have been performed, different colors are used.An example is shown in Figure 6, where proteins selected by a query with the keyword "apoptosis" are shown by blue plus signs.Some of these proteins were selected by the user and are surrounded by a white square.One of the proteins, DNJA3, belongs to the small cluster labeled "disease mutation" but does not possess the "disease mutation" keyword.Interestingly its deficiency implies dilated cardiomyopathy [25] (MIM-608382).
When using the integrated biological query engine BIRD-QL of the BIRD System via a web service or http protocol, as shown in Figure 5, the selected proteins are represented on the maps by a plus sign of a given color.If different selections have been performed, different colors are used.An example is shown in Figure 6, where proteins selected by a query with the keyword "apoptosis" are shown by blue plus signs.Some of these proteins were selected by the user and are surrounded by a white square.One of the proteins, DNJA3, belongs to the small cluster labeled "disease mutation" but does not possess the "disease mutation" keyword.Interestingly its deficiency implies dilated cardiomyopathy [25]

(MIM-608382).
There is still room for improvement in the construction of semantic maps both at the algorithmic level and at the software functionality level.The point's projection is formalized as a global optimization problem and currently, it is resolved simply using different starting points with the Newton-Raphson method.However global optimization methods could also be tested [26,27].From a practical point of view it would also be useful to determine how many clusters or nodes are necessary to achieve a good projection of the data points.
There is still room for improvement in the construction of semantic maps both at the algorithmic level and at the software functionality level.The point's projection is formalized as a global optimization problem and currently, it is resolved simply using different starting points with the Newton-Raphson method.However global optimization methods could also be tested [26,27].From a practical point of view it would also be useful to determine how many clusters or nodes are necessary to achieve a good projection of the data points.

CONCLUSIONS CONCLUSIONS
The main contribution of this work is a new computational solution to the construction of semantic maps.The idea is to project points by locating them according to cluster centers.This method can thus be coupled with other methods such as self-organizing maps or Flexer's approach.
The main contribution of this work is a new computational solution to the construction of semantic maps.The idea is to project points by locating them according to cluster centers.This method can thus be coupled with other methods such as self-organizing maps or Flexer's approach.

Figure 1 .
Figure 1.New projection of the dataset.Results of the presented projection on the oil flow dataset.Crosses, circles and plus-signs represent stratified, annular and homogeneous multi-phase configurations respectively.The three group separations are clearly identified.

Figure 2 .
Figure 2. Oil flow dataset after GTM.After projection of the oil flow dataset using the Generative Topographic Mapping, the three group separations are clearly separated, but in a complex way that is far from linear.

Figure 3 .
Figure 3. Oil flow dataset after PCA projection.After projection of the oil flow dataset using principal component analysis, the separation of the three groups is not clearly identified.In particular, the crosses are very scattered.

Figure 4 .
Figure 4. Semantic map with density colours and most frequent keyword labels.
If 1 ,...,   indicate the different weights of the components, the likelihood of the model for points is expressed as:

Figure 5 .
Figure 5.The global architecture of the Semantic Map Discovery prototype coupled with the BIRD System using the BirdQL query engine.

Figure 6 .
Figure 6.Semantic map with selected proteins.The labels represent the most frequent keywords present inside the cluster points which are not shared between different clusters.