Identification of Driver Genes in Primary Liver Cancer by Integrating NGS and TCGA Mutation Data

Background: This study is aimed towards an exploration of mutant genes in primary liver cancer (PLC) patients by using bioinformatics and data mining techniques. Methods: Peripheral blood or paraffin-embedded tissues from 8 patients with PLC were analyzed using a 551 cancer-related gene panel on an Illumina NextSeq500 Sequencer (Illumina). Meanwhile, the data of 396 PLC cases were downloaded from The Cancer Genome Atlas (TCGA) database. The common mutated genes were obtained after integrating the mutation information of the above two cohorts, followed by functional enrichment and protein-protein interaction (PPI) analyses. Three well-known databases, including Vogelstein’s list, the Network of Cancer Gene (NCG), and the Catalog of Somatic Mutations in Cancer (COSMIC) database were used to screen driver genes. Furthermore, the Chi-square and logistic analysis were performed to analyze the correlation between the driver genes and clinicopathological characteristics, and Kaplan-Meier (KM) method and multivariate Cox analysis were conducted to evaluate the overall survival outcome. Results: In total, 84 mutation genes were obtained after 8 tions had no relationship with overall survival. Conclusion: This study investigated the mutant genes with significant clinical implications in PLC patients, which may improve the knowledge of gene mutations in PLC molecular pathogenesis.


Introduction
As the second leading cause of cancer death, the incidence of primary liver cancer (PLC) has shown a significant increase in almost all countries, especially in Asia [1]. Despite advances in many aspects of PLC treatment, including surgical treatment, arterial embolization, and systemic chemotherapy, the 5-year average survival rate is less than 10% [2] [3]. The emergence of molecular targeted drugs provides new treatment options for patients, but currently, there is no effective target for PLC. Therefore, analysis of genetic mutations in malignant progression and identification of biomarkers that would predict tumor behavior to research and develop novel target drugs are urgently needed. It is also the core target of tumor genomics in information mining.
Next-generation sequencing (NGS), a mainstream technology in oncology, is an ability to produce millions of reads in a single run. Compared with traditional gene sequencing (known as Sanger sequencing), NGS makes abundant parallel sequencing with higher throughput and lower cost [4]. Hence, the gene expression profile, mutational genes, and hotspot mutations in pathological samples from PLC patients could be detected on a large scale by NGS technology, furthermore, through bioinformatics analysis, key genes related to the disease can be screened out, which might pave the way toward novel therapeutic targets and molecular targeted drugs [5]. Recently, an increasing number of databases have been developed based on the sequencing of cancer genome, among which The Cancer Genome Atlas (TCGA) database provides relevant data such as tumor gene expression, the copy number variation (CNV), gene mutation, DNA methylation, and clinical patient prognosis. It can also provide important clues for exploring the mechanism of PLC development and searching for therapeutic targets [6] [7].
In this study, a 551 panoramic cancer gene panel was designed and 84 genetic mutations have been identified using targeted NGS in 8 PLC patients. Seventeen common mutations were obtained, following integrating top 100 genes with the highest mutation frequency of 396 PLC cases from the TCGA database. Subsequently, the functional enrichment analysis was performed and a protein-protein interaction (PPI) network was constructed. After interaction with 3 driver gene databases, 11 driver mutants were screened and visualized, furthermore, their The mutation status of 396 PLC cases filtered for "primary tumor" and "liver and intrahepatic bile ducts" were downloaded from the official website of TCGA were included, detailed information of who was obtained to analyze the correlation between mutation and clinicopathological characteristics or overall survival.
Peripheral blood (5 ml) was drawn in EDTA tubes and processed immediately (centrifugation 1500 g, 5 min, 4˚C) to collect plasma and buffy coats, which were aliquoted and stored at −80˚C until further use.
Postoperative paraffin-embedded tissue sections were collected, 10 pieces with a thickness of 5 μm, and stored in a refrigerator at −20˚C for use. DNA was isolated from whole blood, plasma, and tissue using the QIAampDNA Mini Kit (Qiagen, Hilden, Germany) following the manufacturer instructions.

Identification of Mutant Genes
The mutation genes of 8 clinical cases were visualized using R software, in which each row represented a mutated gene, and column represented a patient. The mutation types of the genes were exhibited, including single nucleotide polymorphic (SNP), single nucleotide variants (SNV), and insertion/deletion (In-Del).
TCGA PLC cancer data were screened by ticking "primary tumor" and "liver

Functional Enrichment Analysis
The KEGG is an information resource for understanding biological systems and genomic functions at the molecular level. The ggplot2 program package of R software was loaded to visualize the enrichment analysis results obtained from the DAVID database.

Construction of PPI Network
PPI network was constructed after importing the name list of mutated genes from the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) online database (https://www.string-db.org/), and at the same time, the species source was limited to Homo sapiens [9].
PPI network data was imported into Cytoscape software (version 3.7.2). A functional module was identified using the Molecular Complex Detection application (MCODE) plug-in. Next, using the cytoHubba plug-in, the top 10 hub genes were selected according to the maximal clique centrality (MCC) method.

Screening and Visualization of Driver Genes
An intersection between the identified mutant genes and the driver genes from three databases, including Vogelstein's list [10], the Network of Cancer Genes (NCG) [11], the Catalog of Somatic Mutations in Cancer database (COSMIC) [12] was taken using the Venn online website.

Statistical Analysis
All statistical results were conducted using SPSS version 25

Identification of 17 Mutated Genes
Eighty-four mutated genes were identified in the blood and tissue samples from 8 PLC patients through the high-throughput detection of 551 panoramic cancer genes. As shown in Figure 1

Functional and Interactional Analysis of 17 Mutant Genes
To extensively investigate the function and mechanism of the 17 mutated genes, GO and KEGG analyses were performed using the DAVID online application (Figures 2(A)-(D)). The GO analysis results demonstrated that the 17 mutated genes were significantly enriched for regulation of nucleobase-containing compound metabolic process, regulation of nitrogen compound metabolic process, and cellular macromolecule biosynthetic process in the BP category. With regards to CC analysis, 17 mutated genes were markedly enriched in nucleoplasm, chromosome, and nucleoplasm part. The top 3 significantly enriched MF terms included heterocyclic compound binding, organic cyclic compound binding, and DNA binding. Furthermore, 17 mutated genes were mainly enriched in P53 signaling pathway, cell cycle, and HTLV-I infection in the KEGG analysis.
To explore the interaction between the 17 mutant genes, a PPI network was constructed using the STRING database. As shown in Figure 2(E), a total of 17 nodes and 76 edges were mapped in the PPI network, with a local clustering coefficient of 0.795 and a PPI enrichment p value < 1.0e -16 . Then, a significant module, consisting of 10 nodes and 44 edges, with a score of 9.778, was identified by the MCODE plugin of Cytoscape (Figure 2(F)). Using the cytoHubba plugin and MCC algorithm, the top 10 hub genes were identified from the network (Figure 2(G)).

Screening of 11 Driver Genes
The three databases, Vogelstein's list, NCG database, and COSMIC database, including 125, 711 and 576 driver genes, respectively, are commonly studied for the analysis of tumor driver genes. After analyzing the intersection of 17 mutant genes and driver genes in the above 3 well-known databases, 11 driver genes

Correlation of Driver Gene Mutations with Clinicopathological Characteristics
After integrating 8 clinical patients and 254 TCGA cases with completed clini-copathological data, the correlations between mutations for 11 driver genes and clinicopathological features were analyzed using the Chi-square test. The results   To further explore the role of the 6 driver genes with statistical significance, logistic regression analysis was used to assess the relationship between mutation status of driver genes and clinicopathological characteristics (

Survival Analysis of Driver Gene Mutations in PLC
As shown in Figure 4(A) and Figure 4 (Figure 4(E)). Open Journal of Gastroenterology from a single case. NGS technologies have allowed a high-throughput, comprehensive characterization of cancer genomes at unprecedented rates, which could improve the cancer genetic map and our understanding of the genetic landscape in liver cancer [13]. In Morishita's study, 50 genes associated with tumor development were targeted, and the relationship between the genetic mutations and the clinical characteristics of HCC patients was investigated using an NGS platform [14]. Lu et al. [15] provided a mutation spectrum of HCC tissue in 12 western Chinese cases using NGS with a panel of 372 cancer-associated genes, assisting in the investigation of the mechanism of liver carcinogenesis. Kan et al. [16] conducted whole-genome sequencing analysis on 88 HCC patients by the NGS technology, which not only verified multiple gene mutation sites that had been reported but also found new sites of BRCA2 and IGF1R gene mutations. In this study, a panel containing 551 cancer-associated genes was used in the NGS platform to analyze the gene mutations of 8 PLC cases, which is a relatively large panel up to now, providing a comprehensive genetic landscape survey of PLC patients. A total of 84 mutant genes were identified, corroborating the progression of PLC development is the accumulation of multiple genetic events. Given the number of our cases analyzed is low, a PLC TCGA dataset was included in this study. After integrating the two cohorts, 8 clinical PLC cases and 396 PLC TCGA patients, 17 common mutated genes were obtained. Among them, 12 genes, including TP53, ARID1, ARID2, ATM, ATR, CDKN2A, KMT2C, KRAS, NF1, PTEN, RB1, and RECQL4, has been reported previously in NGS-based study of the liver cancer genome [17] [18] [19], while other 5 gene mutations, including KMT2D, NF2, STAG2, TSC2, ZFHX3, have never been verified in PLC, which remain to be further explored. Of the above 17 mutant genes, TP53 has a high mutation rate of 29.455% (119/404). Consistent with our data, the NGS result in 59 liver cancer tissues showed the most mutated gene was also TP53 (35.600%) [14]. In addition, TP53 was the most frequently mutated gene in 12 HCC patients studied by Lu et al. [15], with mutation rates reported up to 41.670%.
In order to further investigate the biological function and potential pathways of the 17 mutated genes, GO and KEGG analysis were performed using David online analysis platform. Notably, the result of the GO analysis showed that the 17 mutation genes were mainly enriched in metabolic process and macromolecule biosynthetic process. As we know, the liver is an important metabolic organ of the human body, and liver dysfunction induces intracellular redox imbalance, leading to the damage of intracellular biomolecules. Recent studies have reported that metabolic rearrangement contributes to the increased risk of PLC.
Ikeno et al. [20] demonstrated GLUT-1 expression was significantly higher in tumors with mutated KRAS than in tumors with wild-type KRAS. High metabolic tumor volume is associated with KRAS mutation and poor postoperative outcomes in ICC patients. To meet the metabolic requirements for cancer cell growth, the de novo nucleotide synthetic pathway is activated to support the biology activities of cancer cells, including nucleic acid and protein synthesis, energy preservation, signaling activity, glycosylation mechanisms, and cytoskeletal function. Both oncogenes and tumor suppressors have been identified as key molecular determinants for de novo nucleotide synthesis. Oncogenic KRAS maintains high intracellular nucleotide levels by enhancing de novo synthesis of purines and pyrimidines through upregulating MYC-mediated transcriptional activation of ribose-5-phosphate isomerase A (RPIA), which has been shown to play a crucial role in the development of HCC [21]. Additionally, the inactivation of tumor suppressor TP53 has been shown to fuel nucleotide synthesis in tumor cells that contributes to the maintenance of homeostasis and the proliferation of cancer cells. Moreover, the top 2 significantly enriched KEGG terms were p53 signaling pathway and cell cycle. The tumor suppressor p53 plays a major role in cell cycle arrest and/or apoptosis, and p53 mutations and functional inactivation are linked to the pathology of PLC. He's results showed that abnormal expression of p53 and cyclinD1 can lead to the progression of HCC by regulating G 1 /S transformation [22] [23].
Genes with acquired mutations or abnormal expression that are causally associated with cancer progression are called driver genes. Identifying and understanding genetic driver mutations dramatically facilitates the development of targeted cancer therapies. For this reason, we screened 11 driver genes by comparing the 17 mutated genes with the data from 3 databases, including Vogelstein's list, NCG database, and COSMIC database, which have been often utilized to identify driver genes. In 2013, Bert Vogelstein reviewed 125 driver genes containing 71 tumor suppressor genes and 54 oncogenes, which had been defined by the 20/20 rule in a total of 294,881 mutations [9]. Since then, in many researches, the driver genes in Vogelstein's list have been regarded as well-known oncogenes and tumor suppressor genes undergoing copy number alterations in the common solid tumors, such as colon, lung, prostate, breast, etc. [24] [25] [26]. The NCG is an open-access database of 2372 genes, consisting of 1661 predicted driver genes in cancer and 711 known cancer-driving genes which contains data on gene mutations [11]. Bioinformatic analysis of an HCC dataset from the Gene Expression Omnibus (GEO) database was performed to identify cancer-related genes which were followed by imported into the NCG database and identified several driver genes, including ATC, CCND1, CREBBP, FTCD, MDH2, PPPARG, and TP53 [27]. The COSMIC database is currently the most comprehensive database of mutations in cancer, containing 576 securable driver genes [12]. El-Ayadi et al. [28] firstly reported a medulloblastoma case with concurrent IDH1 and SMARCB1 mutations after searching the catalog of somatic mutations in the COSMIC database. In this study, the above 3 databases were combined to screen driver genes from 17 mutation genes, which significantly improved accuracy and reliability.
The correlations between the 11 driver genes and clinicopathological characteristics of 262 PLC patients were performed by logistic regression analysis. The results showed that only 3 mutant genes, RB1, TP53, and KRAS had statistical Open Journal of Gastroenterology significance. The RB1 gene, which was the first tumor suppressor gene identified, has a negative regulation function on cell growth and can promote cell death. In this study, we found that patients under the age of 60 were prone to RB1 mutations, and the same result was also mentioned in Chaudhary's study [29]. To reveal the possible link between 10 driver genes and age, they conducted Mann-Whitney-Wilcoxon tests for the continuous age variable in 6 HCC cohorts. The results confirmed that RB1 is the driver gene significantly and preferably prevalent in younger patients. In addition, Chaudhary et al. also demonstrated that TP53 was preferred in males, and in terms of associations with race, TP53 showed higher relative risk in Asians. This coincides with our study, in which male or Asian patients were more likely to have TP53 mutations. TP53, as a well-known tumor suppressor gene, seems to have a higher risk of developing HCC in this study, however, Hill et al. [30] reported TP53 loss enhanced reprogramming of hepatocytes to biliary cells, which may be a mechanism facilitating the formation of hepatocyte-derived ICC. Another mutant gene with statistical significance, KRAS, increased the risk of ICC, which has been supported by Levi S. In their study, all the 15 cholangiocarcinomas patients showed a KRAS mutation at codon 12, and 9 of them contained 2 or more mutations [31].

Conclusion
The pattern of genetic alterations in cancer driver genes in PLC patients is highly diverse, which partially explains the low efficacy of available therapies. Therefore, it may be a new option to try to use NGS technology to find the driver genes and carry out targeted drug delivery. In this study, 17 mutated genes and 11 mutation driver genes in PLC were identified through bioinformatic analysis of TCGA PLC data and NGS detection of clinical samples. Among them, RB1, TP53, and KRAS have relationships with clinicopathological characteristics, including age, gender, race, primary sites, and pathological type. This study provided significant clues and basis for further understanding the molecular pathogenesis, drug development, and treatment of PLC.