Analysis of the Promoter Region, Motif and CpG Islands in AraC Family Transcriptional Regulator ACP92 Genes of Herbaspirillum seropedicae

Identification of promoters and their regulatory elements are the most important phases in bioinformatics. To understand the regulation of gene expression, identification, and analysis of promoters region, motif and CpG islands are the most important steps. The accurate prediction of promoter’s is basic for proper interpretation of gene expression patterns, construction and understanding of genetic regulatory system. Therefore, the objective of this study was to analyze the promoter region, motif such as a transcription factor and CpG islands in AraC family transcriptional regulator ACP92 genes of Herbaspirillum seropedicae. The analysis was carried out by identifying transcription start sites in ACP92 genome sequences taken from the H. seropedicae assembly of NCBI genome browser, and 29 ACP92 genes sequences. Ac-cordingly, transcription start sites (TSS) were identified, and the result indicated that 37.9% had more than one TSS whereas only 62.1% had one TSS. In the analysis, seven motifs were identified from the thought sequences and MV6 was revealed the common promoter motif for all (100%) in H. seropedicae ACP92 gene that serves as binding sites for transcription factors which shared a minimum of 48.27%. Based on a common motif MV6 to find out similar motifs using TOMTOM from the databases of prokaryotes DNA, most of them are transcription


Introduction
Herbaspirillum seropedicae is a genus of bacteria that found in roots, stems, and leaves in association with economically important species of Poaceae family such as maize (Zea mays), rice (Oryza sativa), sorghum (Sorghum bicolor), sugar cane (Saccharum officinarum) [1]. It's commonly found in forage grasses such as elephant grass and tropical fruits like pineapple and banana [2]. It is a nitrogen-fixing Proteobacterium isolated from the rhizosphere and tissues of several economically important plants species [3]. H. seropedicae is well-characterized class of diazotrophic bacteria capable of establishing endophytic associations and promoting plant-growth of important cereals and forage grasses [4]. It was also studied as a model of bacterial entry into host plants and plant growth promotion [5]. Genome of H. seropedicae, involved in the nitrogen fixation process and its regulation, the genes potentially involved in the establishment of efficient interaction with the host plant. Several studies have shown that H. seropedicae supplies fixed nitrogen to the associated plant and increases grain productivity [4]. The AraC family of transcriptional regulators, present in bacterial species is involved in a variety of cellular processes from carbon metabolism to stress responses and its regulation according to Munson and Scott [6]. Correspondingly, in AraC family transcriptional regulator ACP92 genes of H. seropedicae are also a potential transcriptional regulator that involved in a variety of cellular processes, transcriptional control and expression of genes by binding to specific promoter regions both at transcriptional and post-translational levels.
Promoter is a key region that is involved in differential transcription regulation of protein coding and RNA genes [7]. Promoters are functional regions containing complex regulatory elements for determining the transcription initiation of genes [8]. DNA binding sites or motifs refer to short DNA sequences (typically 4 to 30 base pairs long, but up to 200 bp for recombination sites) that are explicitly bound by one or more DNA-binding proteins or protein complexes [9]. It is often associated with specialized proteins known as transcription factors, and is thus linked to transcriptional regulation [10]. Transcription factors are DNA binding proteins interacting with RNA polymerase complex to activate or repress transcription factors bind to the DNA on specific cis-acting regulatory elements (CAREs) and in the regulation of gene expression the initiation of transcription which is one of the most important control points [11].
CpG islands are also reported as important regulatory elements in the promoter regions of genome [12]. CpG refers to the base cytosine (C) linked by a phosphate bond to the base guanine (G) in the DNA nucleotide sequence [13]. A structural feature that has proven useful in the detection of promoters is the so  [14]. CpG islands are playing an important role in gene regulation through epigenetic changes [15]. Recent studies have shown that CpG methylation correlates with the activation of some genes [16]. DNA methylation has been shown to repress transcription initiation by interfering directly with the binding of transcriptional activators or indirectly by binding proteins [17].
Prokaryotic and eukaryotic promoters use different DNA sequences to regulate gene expression [18]. Promoters in eukaryotic and prokaryotic genomes using CpG islands and transcription factor binding sites (TFBS) have been developed by Anwar et al. [19]. Studies on identifying the promoters on 250 bp long regions upstream of gene start in Escherichia coli [20], and also have proposed to identify in E. coli promoters reported by Gordon et al. [21]. Many methods have been proposed to search for binding sites [22]. Explained large subset of motif-finders among which MEME one is the most important tools for binding motif's discovery [23]. Neural Network Promoter Prediction (NNPP version 2.2) is a widely used on-line tool for the recognition of eukaryotic promoters [24].
However, in prokaryotes the Neural Network Promoter search from https://www.fruitfly.org/seq_tools/promoter.html, and promoter prediction tool set was used [25]. Analysis of promoter region, transcription start site and CpG islands are some of the most important issues in gene expression. Conducted for Herbaspirillum seropedicae ACP92 to identify and analysis of these elements and were revealed a common motif that serves as binding sites are very crucial.
Therefore, the objective of this study was initiated to analyze the promoter region, motif such as transcription factor and CpG islands in H. seropedicae in AraC family transcriptional regulator ACP92 genes.

Materials and Methods
Genome sequences were taken from H.seropedicae assembly of NCBI genome browser. Genome sequences starting by ATG (starting codon) were identified form AraC family transcriptional regulator ACP92 genes of H. seropedicae databases. At the beginning sequences containing start codon were identified and coding sequences were used in this analysis. Only twenty-nine AraC family transcriptional regulator ACP92 genes were discovered and the left AraC families are pseudogene with no ATG. Twenty-nine, H. seropedicae ACP92 gene sequences were used for analysis to determine their respective TSSs, 1 kb sequences upstream of the start codon were excised from each gene. Promoter regions were defined as 1 kb region upstream of each TSS. The Neural Network Promoter search from https://www.fruitfly.org/seq_tools/promoter.html and prediction tool was used with a minimum standard predictive scores (between 0 and 1) cutoff value of 0.8 for prokaryote [25]. For those regions containing more than one TSS, the highest value of prediction score was considered so as to have a more accurate prediction.
Identification of H. seropedicae, ACP92 promoter sequence was analyzed using the MEME (Expectation Maximization algorithm); via web server (http://bioinformatics.ubc.ca/resources) look for common motifs and transcription factors that regulate the expression of ACP92 genes. MEME was many optional inputs to modify its performance. The following possibilities were used: 1) zero or one occurrence per sequence model was chosen, 2) the maximum width of the motifs was 50, and 3) motifs occurrences were on both strands of the input DNA sequences. Statistically, significant motifs in the input sequence set were researching MEME and the E-value which is the probability of finding an equally well-conserved pattern in random sequences. The MEME output is HTML and shows the motifs as local many alignments of the input sequences. The MEME HTML output was allowed one or all the motifs to be forwarded for further enquiry, was better characterizing the identified motifs, by other web-based programs, TOMTOM. TOMTOM web server was selected where various sequence databases were searched for sequences matching the identified motif. TOMTOM shows that the query motif closely resembles the binding motif [26].
To find the CpG islands in H. seropedicae ACP92 promoter regions were two algorithms used. The first CLC searching, genomics Workbench ver. 3.6.5 (http://clcbio.com, CLC bio, Aarhus, Denmark) was used for searching the restriction enzyme MspI cutting sites (fragment sizes between 40 and 220 bp), and the second algorithm, Takai and Jones algorithm (stringent) search criteria was used in GC content ≥55%, Observed CpG/Expected CpG ratio ≥ 0.65, and length ≥ 500 bp [27]. The CpG island searcher program (CpGi130) available at the web link, http://dbcat.cgm.ntu.edu.tw// was used.

Determination of Transcription Start Sites (TSSs) and Promoter Regions
Identification of transcriptional start site and promoter regions is the first step to understand the regulation mechanisms of gene expression and association with genetic variations in the regions [28]. Accordingly, this study was the first identified transcription starts sites for each 29 transcriptional regulator ACP92 genes in Herbaspirillum seropedicae. The prediction more reliable for genes containing more than one TSS, TSS of the highest prediction score was considered and identified. The result indicated that three (in ACP92_RS04670 and ACP92_RS13185), four (in ACP92_RS00045) and six (in ACP92_RS19695) was found the highest TSS number while in the remaining genes a lower number of TSSs was obtained.
In addition, 37.9% have more than one TSS whereas 62.1% had only one TSS (Table 1).

Common Motifs and Transcription Factors
Based on the promoter region of H. seropedicae significant motifs in the input sequence set was searched MEME via the web server and the E-value, the probability of finding a well-conserved pattern in random sequences. MEME output was revealed seven motifs (MV1, MV2, MV3, MV4, MV5, MV6 and MV7) were identified from the thought's sequences. The study indicated that, motif six (MV6) was found the common promoter motif for all (100%) in H. seropedicae ACP92 genes that serve as binding sites for transcription factors and shared a minimum of 48.27% (Table 2). Motif MV6 found to serve as binding sites for transcription factors in the expression and regulation of genes. After the location and distribution of these motifs largely, it was found between -800 and -100 bp of the transcription start sites (TSSs). Relatively, higher distributions of motifs were found also in positive (96) than negative strands (81) H. seropedicae ACP92s gene (Figure 1). In a similar manner, sequence logo for MV6 was generated by MEME (Figure 2).     (Table 3). Among four families, fur (Ferric uptake regulation protein) is largely matched with the binding motif also known as a transcription factors family for H. seropedicae ACP92s gene regulations.

Determination of CpG Islands in H.seropedicae ACP92 Promoter Regions
In this study, CpG islands were determined using twenty-nine in H. seropedicae promoter and gene body regions with two algorithms were used to search. CLC searching algorithm was used and identified one possible CpG island in each gene; ACP92RS01580, ACP92RS04595, ACP92RS11865, ACP92RS12565, ACP92RS15060, ACP92RS17545, ACP92RS18255, ACP92RS18865, ACP92RS19245, ACP92RS22560, and ACP92RS23100 in promoter regions (Table 4). While, in gene body regions it was identified one possible CpG island in all genes except in gene ACP92RS00045, ACP92RS08330, ACP92RS11865, ACP92RS12565, ACP92RS15060, ACP92RS17440, ACP92RS19845 and ACP92RS19860 ( Table   5). The second algorithm using restriction enzyme MspI site cutting was used and examined CpG Island has many fragment sizes both in promoter and gene body regions. CpG islands in promoter regions contain several fragments size in all genes except ACP92_RS17545 gene that have only two fragment size (62 and 70 bp) (Table 6). Similarly, CpG Island was also found in all the gene body regions and contains many fragment sizes except the gene ACP92_RS00045, ACP92_RS11865 and ACP92_RS19695 in AraC family transcriptional regulator ACP92 genes in H. seropedicae (Table 7). This event implies that H. seropedicae bacteria have CpG Island and an important role the regulation of the gene expression. Also, there were indicating that H. seropedicae ACP92 genes are not poor in CpG islands.
In contrary to this study in human [29], mouse [30], and pig V1R genes [31] were poor in CpG islands from eukaryotes. Nevertheless, in vertebrates, about 70% of known promoters are CpG islands reported by Deaton and Bird [32].

Conclusion
Transcriptional factors modulate gene expression through binding to a specific DNA sequence usually found upstream of the gene, or the genomics region that they control. Gene promoter regions are together with transcription factors binding to regions upstream to the coding sequence. CpG islands are also regulatory elements in the promoter regions of genome and useful in the detection of Table 4. Possible CpG islands shown in graph using promoter regions.    promoters. In this study, we analyzed the promoter region, motif and CpG islands in AraC family transcriptional regulator ACP92 genes of H. seropedicae.
The result of this analysis helps to understand the transcription factor binding regions and could allow reading of the regulatory genetic code which predicts gene expression of bacterial species in general and H. seropedicae in particular.
Therefore, knowledge of bioinformatics methods is worthy important to identify gene regulatory regions in the promoter regions and gene body regions could help also to predict gene expression profiles in various bacterial species.