Discrete Differential Geometry of n -Simplices and Protein Structure Analysis

This paper proposes a novel discrete differential geometry of n -simplices. It was originally developed for protein structure analysis. Unlike previous works, we consider connection between space-filling n -simplices. Using cones of an integer lattice, we introduce tangent bundle-like structure on a collection of n -simplices naturally. We have applied the mathematical framework to analysis of protein structures. In this paper, we propose a simple encoding method which translates the conformation of a protein backbone into a 16-valued sequence.


Introduction
This paper proposes a novel discrete differential geometry of n-simplices, which is originally developed for protein structure analysis [1] [2].Discrete differential geometry is the study of discrete equivalents of the geometric notions and methods of classical differential geometry [3] [4].It mainly deals with polygonal curves and polyhedral surfaces whose properties are analogous to continuous counterparts, where the smooth theory is established as limit of the discrete theory.
On the other hand, we consider connection between space-filling n-simplices.We define gradient of n-simplices and obtain a flow of n-simplices by piling up n-cubes diagonally.Second derivative along a trajectory is given as a binary-valued sequence for any n (>1).As a result, we could encode the shape of n-dimensional objects if we approximate them by sweeping the occupied area with a trajectory of n-simplices.
Proteins are a sequence of amino acids linked by peptide bonds and fold into a unique three-dimensional structure in nature.Protein backbone structure is usually studied via manually-curated hierarchical classification [5] [6] but there also exist studies on differential geometric approach for protein structure analysis [7]- [11].As for discrete differential geometry of protein backbones, proteins are usually represented as a polygonal chain, where curvature and torsion are defined at each vertex [7].
In our method, protein backbone structures are approximated by a trajectory of 3-simplices (tetrahedrons).Particularly we consider second derivative along a trajectory to encode local protein structures.Our method performs comparably with more sophisticated but more time-consuming methods which are specifically designed for protein structure analysis [12] [13].In the following, we first describe the discrete differential geometry of n-simplices.Then, we apply the mathematical framework to analysis of protein structures and propose a simple encoding method which translates the conformation of a protein backbone into a 16-valued sequence.

Basic Ideas
Recall that an n-simplex is an n-dimensional polytope which is the convex hull of its n + 1 vertices.As an introduction, we would consider the case of n = 2 before we give the definitions in the general case.In the case of n = 2, we obtain a flow of 2-simplices (triangles) by piling up unit cubes in the three-dimensional Euclidean space 3   as shown in Figure 1(a).
First, cubes are pilled up in the direction of ( ) 1, 1, 1 − − − , where three upper faces of each unit cube are di- vided into two triangles by a diagonal line.Then, the diagonal lines on the faces of the cubes form a drawing on the surface of the "peaks and valleys" of cubes.By projecting the drawing onto a hyperplane that is perpendicular to ( ) 1,1,1 , a flow of triangles would be obtained.For example, the grey "slant" triangles on the surface spe- cify the closed trajectory of the grey "flat" triangles on the hyperplane in Figure 1(a).

Differential Structure
Because of convenience, we use monomials to represent coordinates of points.That is, point ( ) , , , of n indeterminates for integer n (n > 1).First of all, we give the definition of "slant" and "flat" n-simplices.Let's consider n-cube in the n-dimensional Euclidean space n  .Note that the facets of n-cubes are 1 n − -dimensional unit cubes.To obtain "slant" n-simplices, we divide each of the n facets which contain origin ( ) Definition 1.For any integer n > 1, n-dimensional standard lattice n L is the collection of all integer points of n  , i.e., { } Definition 2. For any integer n > 1, the collection S n of all slant n-simplices is defined by where Sym n is the n-th symmetric group and denotes the convex hull of n points where is defined as a monomial of degree n − 1, i.e., ( ) ( ) ( ) For simplicity, we occasionally denote , where where ( ) ( ) ( ) For example, [ ] 1(c)).We would obtain a flow on n B by patching these local trajectories together.To define the "second derivative" along a trajectory, we would impose a kind of "smoothness condition" on local trajectories.
Definition 7. (Smoothness condition).Let Γ be a section of n TB on { } mod , mod , mod where ( ) ( ) . Then, we impose the following conditions on the local trajectory: ( ) ( ) ( ) and is included in both ( ) ( )  corresponds to the contact surface between two consecutive slant n-simplices.
As an example, let's consider the case of n = 2 shown in Figure 1(d), where the gradient at current triangle [ ] x x .Then, the gradient at next triangle [ ] 1 2 1 mod ax x x σ could assume either 1 2 x x or 2 3 x x .Otherwise, we couldn't connect the two consecutive slant triangles over the trajectory "smoothly" as shown in the figure.

Tangent Cone and Section of TB n
Now we give the definition of the "peaks and valleys" of n-simplices (Figure 1 where Then As an example, let's con- sider the "peaks and valleys" shown in Figure 1(a), which is induced by Let's start from triangle [ ] σ (grey) and move downward (Figure 2) [ ] ( ) Since we move downward, next triangle [ ] σ .Continuing the process, we obtain a closed trajectory of length 10.
Finally, we consider variation of gradient, i.e., "second derivative", along a trajectory.Thanks for the smoothness condition, variation of gradient along a trajectory could be specified as a binary valued sequence.
Continuing the process, we obtain a binary sequence of length 10, DDDUDUUUDU, which describes the shape swept by the trajectory of triangles.

Encoding of Protein Backbone Structure
In the case of n = 3, we obtain a flow of 3-simplices (tetrahedrons), which is used for protein structure analysis.In this section we propose a simple encoding method which translates the conformation of a protein backbone into a sequence of letters from a 16-letter alphabet (called D 2 codes), using the second derivative along trajectories of tetrahedrons.
First, we consider all the fragments of five amino-acids occurred in a protein.Each fragment is approximated by a tetrahedron sequence of length five, where we permit translation and rotation during the process to absorb irregularity inherent in actual protein structures.
Next, we compute the second derivative along the tetrahedron sequences to obtain binary-valued sequences of length five.We assign the binary-valued sequences, which are denoted as a base-32 number, to the center amino-acid of the corresponding fragment.For example, DDDUD is denoted by "2", DUDDU is denoted by "9", DUDUD is denoted by "A", and so on.Then, we obtain a one-dimensional representation of protein backbone structure by arranging the base-32 numbers in the order the corresponding amino-acids appear in the protein.See [1] for detailed description of the algorithm.
Figure 3 shows an example of D 2 -encoding of a protein.As you see, our method captures successfully not only recurring structural features of the protein (strand, turn, caps, helix), but also distortions (such as kink) as well.

Discussion
In this paper, we first describe the discrete differential geometry of n-simplices.Then, we apply the mathematical framework to analysis of protein structures and propose a simple encoding method which translates the conformation of a protein backbone into a 16-valued sequence.
Unlike previous works, our version of discrete differential geometry studies connection between space-filling n-simplices.Considering cones of an integer lattice, we have introduced tangent bundle-like structure on n-simplices naturally.On notable consequence is the smoothness condition, i.e., restriction on variation of gradient along a trajectory.In particular, we could encode the shape of n-dimensional objects if we approximate them by sweeping the occupied area with a trajectory of n-simplices.
As for protein structure analysis, since we do not use clustering analysis to encode local structures, our approach not only provides a intuitively understandable description of protein structures, but also covers wide varieties of distortions.Our method performs comparably with more sophisticated but more time-consuming methods which are specifically designed for protein structure analysis.In SHREC'10 Protein Model Classification we achieved results comparable to more sophisticated methods, using the length of the longest common subsequence as the measure of structural similarity [12].At homology level of CATH95 data set, our method performs best among all the individual classifiers considered in [13].

3 t
where D U − = and U D − =.Then, we could encode the conformation of a trajectory by the second derivative along the trajectory.As an example, let's consider the trajectory of Figure 2 again.First, set any initial value: value of the second derivative is D until [ ] , where it is changed to U because the gradient of [ ]