Journal of Intelligent Learning Systems and Applications

Volume 12, Issue 1 (February 2020)

ISSN Print: 2150-8402   ISSN Online: 2150-8410

Google-based Impact Factor: 4.31  Citations  h5-index & Ranking

Transcription Factor Bound Regions Prediction: Word2Vec Technique with Convolutional Neural Network

HTML  XML Download Download as PDF (Size: 2809KB)  PP. 1-13  
DOI: 10.4236/jilsa.2020.121001    347 Downloads   813 Views  


Genome-wide epigenomic datasets allow us to validate the biological function of motifs and understand the regulatory mechanisms more comprehensively. How different motifs determine whether transcription factors (TFs) can bind to DNA at a specific position is a critical research question. In this project, we apply computational techniques that were used in Natural Language Processing (NLP) to predict the Transcription Factor Bound Regions (TFBRs) given motif instances. Most existing motif prediction methods using deep neural network apply base sequences with one-hot encoding as an input feature to realize TFBRs identification, contributing to low-resolution and indirect binding mechanisms. However, how the collective effect of motifs on binding sites is complicated to figure out. In our pipeline, we apply Word2Vec algorithm, with names of motifs as an input to predict TFBRs utilizing Convolutional Neural Network (CNN) to realize binary classification, based on the ENCODE dataset. In this regard, we consider different types of motifs as separate “words”, and their corresponding TFBR as the meanings of “sentences”. One “sentence” itself is merely the combination of these motifs, and all “sentences” compose of the whole “passage”. For each binding site, we do the binary classification within different cell types to show the performance of our model in different binding sites and cell types. Each “word” has a corresponding vector in high dimensions, and the distances between each vector can be figured out, so we can extract the similarity between each motif, and the explicit binding mechanism from our model. We apply Convolutional Neural Network (CNN) to extract features in the process of mapping and pooling from motif vectors extracted by Word2Vec Algorithm and gain the result of 87% accuracy at the peak.

Cite this paper

Chen, R. , Dai, R. and Wang, M. (2020) Transcription Factor Bound Regions Prediction: Word2Vec Technique with Convolutional Neural Network. Journal of Intelligent Learning Systems and Applications, 12, 1-13. doi: 10.4236/jilsa.2020.121001.

Cited by

No relevant information.

Copyright © 2020 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.