Predicting tbx22 Zebrafish Protein Structure Using Multi-Level Prediction Tools and Demonstration of Conserved Structural Domains in Relation to Orthologous tbx22 Proteins in Humans

Biological functions of proteins play a key role in the development of any organism. The gene tbx 22 is a member of a phylogenetically conserved family of genes, which share a common DNA binding domain: T box. This study examines the similarity in the developmental pattern influenced by the transcription factor TBX22 and tbx22 in H. sapiens and D. rerio respectively. Secondary and tertiary structures of the proteins are predicted using standard structure prediction software’s like Phyre 2, Predict Protein, SWISSMODEL, PSIPRED and the homology of the proteins were compared to each other. Protein homology prediction shows more than 65% between the 2 organisms. Superimposing the predicted protein structures reveals conserved domains between the human and zebrafish proteins. Additional supporting data from Genomatix MATBASE, MATINSPECTOR show higher matrix family scores for BRAC (Brachury gene mesoderm developmental factor) in Human and Zebrafish. Transcription factor and promoter element analysis with Transcriptome Viewer, Gene 2 Promoter and Genomeinspector reveal a high degree of homology between the 2 organisms. Bioinformatic-Proteomics and protein structural analysis approaches shown here explain in detail the relationship between the Human and Zebrafish tbx22 Gene-Protein-Transcription factor. These studies also support zebrafish as a predictive model for numerous developmental pattering events in higher vertebrates.


Introduction
This gene is a member of a phylogenetically conserved family of genes, which share a common DNA-binding domain T-box.The T-box genes are an ancient family of transcription factors that are well conserved throughout all metazoans [1] [2] Genes associated with T-box are significantly involved in the regulation of the developmental processes of early mesoderm induction, specification, patterning and somite formation in many organisms [3].All the T-box genes share a conserved DNA domain which is homologous, encoding the T-box [4].Mutations in human TBX22 cause X-linked cleft palate with anklyoglossia syndrome affecting palatogenesis [5].Characterization of several Tbx genes in humans suggests a role for these genes in craniofacial development [6].The human form of TBX22 has 5 transcripts with 2 involved in protein coding.TBX22 is a part of the TBX1 subfamily which also consists the genes TBX1, 10, 15 18 and 20.While TBX1 has been shown to regulate palatogenesis, a similar complex defect is found in DiGeorge syndrome.The DiGeorge pattern of defects has been observed in the van gogh/tbx1 zebrafish mutant [7].The two zebrafish tbx22 splice isoforms, tbx22a and tbx22b, encode proteins of 444 and 400 amino acids, respectively.Zebrafish tbx22 mRNA expression mirrors mammalian TBX22 expression and is consistent with early patterning of the vertebrate face [8].
Protein homology models can be computed by the SWISSMODEL server homology modelling pipleline [9]- [11].This fully automated protein structure server is accessible via the ExPasy web server to match models online with multiple templates and established database.Three different types of modelling requests (automated mode, alignment mode, project mode) are provided, which differ in the amount of user intervention.Modelling template utilizes top ranking templates using PROMOD-II and MODELLER.Models are built based on the target-template alignment using Promod-II.Coordinates which are conserved between the target and the template are copied from the template to the model.Insertions and deletions are remodeled using a fragment library.Side chains are then rebuilt.Finally, the geometry of the resulting model is regularized by using a force field.The global and per-residue model quality has been assessed using the QMEAN scoring function [12].For improved performance, weights of the individual QMEAN terms have been trained specifically for SWISS-MODEL [11].
Phyre 2 is a suite of tools available on the web to predict and analyze protein structure, function and mutations.This protocol will guide users from submitting a protein sequence to interpreting the secondary and tertiary structure of their models, domain composition and model quality.A range of additionally available tools are described to find a protein structure in a genome, to submit large number of sequences at once and to automatically run weekly searches for proteins that are difficult to model.The focus of Phyre 2 is to provide biologists with a simple and intuitive interface to state-of-the-art protein bioinformatics tools [14].Phyre 2 encourages model building by 4 main stages: Gathering homologous sequences, Fold library scanning, Loop with multiple template modelling and Side chain placement.Developed Protein Data bank (PDB) files are then super imposed on top of one another to identify the common sharing areas.The SuperPose is a rapidly and robustly calculating web server, which calculates both pairwise and multiple protein structure superpositions using a modified quaternion eigenvalue approach.SuperPose is composed of two parts, a front-end web interface (written in Perl and HTML) and a back-end for alignment, superposition, RMSD calculation and rendering (written in Perl and C).The front-end accepts two kinds of input, PDB text files or PDB accession numbers or any combination of both [15].

Sequence Analysis and Model Development
Aminoacid sequences of Human TBX22 and Zebrafish tbx22 were obtained from the UniProt database collection of proteins [16], Supplemental Data S1.The accession code for them is Q9Y458 for human and C7FDJ3-1, C7FDJ4-1 for Zebrafish spliceforms.Bioinformatic work has been performed by using CLC Main Workbench 7.6.4.Neighbor joining (NJ) method was employed to generate the phylogenetic relationship of TBX22 between different organisms of interest and comparison.Neighbor Joining algorithm, K-mer based approach, where k = 15 and Distance measure = Mahalanobis was used to develop the phylogram.

Transcription Factor Binding Factor Sites Prediction
TBX22 comes under the V$BRAC-Brachury gene family of genes, which is a mesoderm developmental factor.TBX22 gene for humans against MATTINSPECTOR tool from Genomatix has shown the location of the gene on the chromosome X and ALTS_NT_187635 of human GRCh 38 (GeneID: 50945 GXL_16356/50945/ GXL_1808093) respectively.The tbx22 for zebrafish is located in the GRCz10 chromosome (GeneID: 556143 GXL_891394) with 1 locus, 3 identified transcripts and 1unique promoter sequence.Transcription Factor (TF) binding site analysis was compared with other TF binding site prediction within Genomatix systems like Transcriptome viewer and Gene2Promoter tools [17].

Structural Modelling
Sequence alignments were also used to identify the match/gaps/mismatch between sequences.There are no established Protein Data Bank (PDB) files for the HumanTBX22 and Zebrafish tbx22.Generation of PDB structures were based on Prediction softwares like Phyre 2, SWISSMODEL.Swissmodel employs building models based on the target-template alignment using Promod-II.Coordinates which are conserved between the target and the template are copied from the template to the model.Insertions and deletions are remodeled using a fragment library.Side chains are then rebuilt.Finally, the geometry of the resulting model is regularized by using a force field.In case loop modelling with ProMod-II does not give satisfactory results, an alternative model is built with MODELLER [18].Superimpose alignment of 2 sequences is achieved by SuperPose, SuperPose is a protein superposition server.SuperPose calculates protein superpositions using a modified quaternion approach.From a superposition of two or more structures, SuperPose generates sequence alignments, structure alignments, PDB coordinates, RMSD statistics, Difference Distance Plots, and interactive images of the superimposed structures.Molecular graphics and analyses were performed with the UCSF Chimera package [19].

Mammalian Cell Transfection and Expression of tbx22 Proteins
HEK-293, human embryonic kidney cells 293 (epithelial morphology) were used in this study.The cells were cultured in the base medium, Dulbeccos's Modified Eagle's medium (Thermo Fisher, Agawam, MA, USA), 10% Heat inactivated Fetal Bovine Serum (HI FBS, Qualified, Thermo Fisher), and 1% Penicillin-Streptomycin (Thermo Fisher) in a 37˚C incubator with 5% (v/v) CO 2 injected.The doubling times of the cells were calculated by identifying the cell growth over time.The assay was performed by seeding 0.3 * 10 6 cells in a 6-well plate and they were allowed to adhere and grow.Cells were prepared for density calculations by treatment with trypsin and dilution in media, followed by manual cell counts using a hemocytometer.Cell density was taken into consideration and cells were seeded at a standard density for initiation of the assay, reaching confluency at 1.2 × 10 6 cells at the end of the time point.Cells were then transfected with custom-made tbx22 gene constructs using Lipofectamine 3000 (Life Technologies, L3000008).A DNA concentration of 1 µg was transfected under standard conditions.Transfected cells were then allowed to grow for 3 -4 days.The media was then changed to Neomycin enhanced media to screen for cells.Cells were then scaled up to standard 25 cm 2 flasks for further protein production, followed by testing the supernatant for presence of specific proteins.

Sequence Identification
For each identified template, the template's quality has been predicted from features of the target-template alignment.The templates with the highest quality have then been selected for model building.Model building results are based on the primary amino acid sequence.Identified sequences were used for Alignment and tree construction (Figure 1).Sequences from UNIPROT and NCBI were crosschecked before the development of relatedness.

Transcription Factor Binding Site Prediction
Based on the ElDorado database, 2 loci were identified for humans with 20 transcripts and 8 unique promoters.1 transcript, GXT_26252783 from the GeneID: 50945 GXL_16356 and 9 transcripts from the maternally expressed X chromosome, GXT_22770400, GXT_24460915, GXT_26251083, GXT_26251084, GXT_26251085, GXT_26251086, GXT_26255662, GXT_26251089 and GXT_22770402.Whereas the identified zebrafish transcripts are GXT_23655886, GXT_24578594 and GXT_26537083.Matinspector tool from Genomatix identified the Individual binding sites in a promoter.Functional assessment of binding sites is further confirmed with other tools, e.g.Model Inspector, Comparative Genomics, Frame Worker, Genomatix Pathway System Identification of distance phylogenetically is significant for the analysis of TBX22 between different species and also between Both transcription factors and isoforms have been used for the development to make sure no significant details are missed during the tree building.The human transcription factor is related to its three different isoforms X1, X2, X3, which are further closely related to the zebrafish isoforms 1 and 2. (Figure 1(b)).Comparative statistics of the protein information developed with the repeating aminoacids show their structural similarity with closely equal numbers (Figure 2).

T-Box Genes and Brachury Gene Matrix Family
Table 1 depicts the V$BRAC matrix family, explaining the common set of genes for the BRACHURY gene family in both humans and Zebrafish.Certain genes exist in both organisms as splice variant forms, while few inclusions are restricted to their own such as Human TBX23P and Zebrafish tbx16, 19.TBX22 and tbx22 are sequentially similar to that of the organism with little mismatches.All the isoforms of the TBX22 gene were examined for the presence of the BRAC matrix family genes (Figure 3).The presence of the matrix family on the gene sequence explains that the T-box domain is contained within and its location on the gene.Gene identification with all matrix families and V$BRAC family has been shown with their location (Figures 3(a)-(d)).Sequence alignment of AA sequences of PDB structures are aligned together and sequence homology between them was identified before superimposing structures.Sequence identity/similarity scores between tbx22-1 and  Humans: EOMES, MGA, T, TBR1,TBX1, TBX10, TBX15, TBX18, TBX19, TBX2, TBX20, TBX21, TBX22, TBX23P, TBX3, TBX4, TBX5, TBX6 Zebrafish: eomesa, mgaa, ta, tbr1a, tbr1b, tbx1, tbx15, tbx16, tbx18, tbx19, tbx20, tbx21, tbx22, tbx2a, tbx2b, tbx3a, tbx4, tbx5a,tbx5b, tbx6 22-2 spliceforms were 89.2%/89.4%,while tbx22 vs TBX22 has a score of 40.3%/53.3%(Table 2 and Table 3).

Protein Homology Modeling
The SWISS-MODEL template library (SMTL version 2016-01-20, PDB release 2016-01-15) was searched with Blast and HHBlits [12] [13] for ZF tbx22-1 and tbx22-2 evolutionary related structures matching the target sequence for Transcription factor 5 and 1 as the closest PDB template.Close to 30 templates were built for the sequence in Table S1 The newly built homology models were then linked to one another and an ideal model for the amino acid sequences was constructed for every TF of interest.SuperPose uses structural alignment to guide the superposition of two or more structures.The SuperPose web server supports the submission of either PDB-formatted files or PDB accession numbers.Structures generated here are PDB formatted files using Phyre 2. Structures of Tbx TF 22-1 and 22-2 of Zebrafish are visualized along with their secondary structure (Figure 4(a), Figure 4(b)).Human TBX22 TF is also shown separately (Figure 4(c)).Superimposing one protein over another is perfomed by Structure Comparison-MatchMaker tool.The resulting sequence of tbx22-1 and tbx22-2 of zebrafish is shown with similar moieties (Figure 5).Superimposing structural alignment of third PDB is generated by laying out Human TBX22 TF over the result of 22-1 and 22-2 (Figure 6).Structural differences and overhanging sequence can be picturized by the generated model superimposed using UCSF Chimera 1.10.2[17].

Tbx22 Protein Expression in Mammalian Cell System
Cells transfected and grown in standard 25cm 2 flasks were transferred to grow in standard media which has neomycin.Presence of Neomycin resistance gene in the custom plasmid construct makes only the selectedtransfected cells grow.The antibiotic resistance gene acts as a screening tool to eliminate the non-transfected   ------------------------------------------     cells.Protein production in the cells is confirmed by isolating the supernatant after 4 -7 days, purifying them against Magnetic bead separation, followed by SDS-PAGE analysis.Confirmation of the size was examined by performing a Silver stain analysis to confirm the size of the protein (Figure 7).Expressed protein size has been estimated to have a molecular weight of 53 kDa which is the same as the tbx22 protein.

Discussion
The aim of this study was to build PDB files for Human TBX22 and ZF tbx22 transcription factors and determine the best model fit to superimpose the protein structures on top of each other.In order to achieve that, a combination of Bioinformatics applications were utilized starting with development of a NJ phylogram, identifying the distant relatedness followed by Strutcure prediction using Phyre  before and after simulation.Structural modelling of TBX22 shows the predicted template model to be built with more than 90% confidence level for 58% of the sequence.ZF tbx22 predicted template model to established PDB model had 53% and 48% sequence identity with more than 90% confidence level.Predicted structures of the models have been superimposed on one another and shown to share the major binding areas.Superimposed structures of independent proteins one on each other has shown to have a more reliable and common areas that have domains in the stereo view of the protein.Though the predicted models show maximum evidence to the structure of TBX22 and tbx22 to be a part of the T-box protein family, built models are not suggested to be taken if it has low identity percentage.While all the proteins over the study have been verified to have at least 50% sequence identity to the simulated templates against the established PDB models and more than 60% sequence identity among themselves.Having completed this simulation the predicted models show promise in solving this proteins structure.Supporting evidence has been also shown by expressing the protein in mammalian cells.Protein purification followed by coomassie/silver staining technique has shown the expressed protein of an expected size of 53 kDa.Were a ZF specific antibody for tbx22 protein available, we could confirm this easily by Western blotting.Based on the data in hand it is also evident that Zebrafish would be an ideal predictive model to identify numerous developmental patterning events translatable to human development.

Conclusion
Protein prediction methods have allowed biologists to identify protein structure, functionality, orientation and other important general information.Available knowledge of existing proteins can guide us with a template to approach a solution to unknown structures and function.Uncharacterized proteins can now be defined with the advantage of multiple public databases, with increased set of established data sources combined with bioinformatics tools.As more tools are accessed computational biologists can unravel mysteries of unidentified/unpredicted biological molecules to solve known problems leading to advancements and new ideas.The goal of this paper was to show the sequence similarity between T-box22 of 2 different species and show how they are sequentially and structurally similar to each other.Supporting documentation available from databases and the usage of the right tools has shown enough evidence to prove that the ZF tbx22 TF shares functional homology to the Human TBX22 TF.

Figure 1 .
Figure 1.(a) Detailed Analysis of Phylogram using NJ showing the Distance of TBX22 protein and Transcription factor among orthologous sequences between different organisms and (b) Comparison and broad understanding of the Phylogram developed between the TBX22 transcription factors between Homo sapiens and Danio rerio isoforms.Humans and Zebrafish to bridge the gap of closely relatedness (Figure 1(a)).Phylogram relationship between other organisms and Humans-Zebrafish has been shown individually and in detail (Figure 1(b)).Both transcription factors and isoforms have been used for the development to make sure no significant details are missed during the tree building.The human transcription factor is related to its three different isoforms X1, X2, X3, which are further closely related to the zebrafish isoforms 1 and 2. (Figure1(b)).Comparative statistics of the protein information developed with the repeating aminoacids show their structural similarity with closely equal numbers (Figure2).

Figure 2 .
Figure 2. Comparative statistics layout in the form of histogram, showing amino acids distribution between Human TF, isoforms and Zebrafish TF isoform.

Figure 3 .
Figure 3. Promoter sequences of (a) TBX22 gene of human genome and (c) tbx22 of Zebrafish genome analyzed against ElDorado database showing all different matrix families.Presence and location of the brachury gene-V$BRAC matrix family is illustrated on the gene sequence for Human TBX22 in (b) and Zebrafish tbx22 (d) respectively.
2 and SWISSMODEL.The final model was built by SuperPose and Chimera tools.Phylogenetic analysis of relatedness shows the ZF and human transcription factors are closely related based on the nodal distance between them.Human TBX22 shows that it is more closely related to Gallus gallus and Xenopus tropicalis than to ZF tbx22, and is more distantly related to Rattus novergicus and Mus musculus making the zebrafish an ideal model for comparative studies.When only Human and Zebrafish TF's are compared in a phylogram the distancescores reveal that the two are close to each other.Based on the NJ tree scores, TF binding site prediction has been analyzed for Humans versus ZF and it shows the common binding sites and common matrix families which share binding motifs.Analysis of the identified genes has shown the location of the T-Box domain to be the same as the V$BRAC sequence position.The T-Box binding domain is approximately 180 AA in length and is highly conserved among T homologous proteins from different species.The Brachury TF is bound as a dimer, interacting with the major and minor grooves of the DNA.Human TBX22 TF PDB model mimics the T-box transcription factor 1,5 and 3 with a sequence identity of 56.08%, 53.26% and 51.85% respectively.While the Zebrafish tbx22 TF PDB model also mimics T-Box Transcription factors 1, 5 and 3 with a sequence identity of 53.72%, 52.17% and 51.10%.The predicted models of human TBX22 and ZF tbx22 have similar binding areas with the established T-Box TF PDB structures showing the location of common sequence areas with the T-Box domain of TF's.Assembly of the the 3 different protein models remained a major bottleneck in developing and interpreting the material

Table 1 .
Brachury gene Family and associated homologous tbx genes in Humans and Zebrafish.

Table 3 .
Alignment of tbx22 of zebrafish and TBX22 human.
Amino Acid sequences of T-box 22 Transcription factor