Novel algorithms for accurate DNA base-calling

Abstract

The ability to decipher the genetic code of different species would lead to significant future scientific achievements in important areas, including medicine and agriculture. The importance of DNA sequencing necessitated a need for efficient automation of identification of base sequences from traces generated by existing sequencing machines, a process referred to as DNA base-calling. In this paper, a pattern recognition technique was adopted to minimize the inaccuracy in DNA base-calling. Two new frameworks using Artificial Neural Networks and Polynomial Classifiers are proposed to model electropherogram traces belonging to Homo sapiens, Saccharomyces mikatae and Drosophila melanogaster. De-correlation, de-convolution and normalization were implemented as part of the pre-processing stage employed to minimize data imperfections attributed to the nature of the chemical reactions involved in DNA sequencing. Discriminative features that characterize each chromatogram trace were subsequently extracted and subjected to the chosen classifiers to categorize the events to their respective base classes. The models are trained such that they are not restricted to a specific species or to a specific chemical procedure of sequencing. The base- calling accuracy achieved is compared with the exist- ing standards, PHRED (Phil’s Read Editor) and ABI (Applied Biosystems, version2.1.1) KB base-callers in terms of deletion, insertion and substitution errors. Experimental evidence indicates that the proposed models achieve a higher base-calling accuracy when compared to PHRED and a comparable performance when compared to ABI. The results obtained demon- strate the potential of the proposed models for efficient and accurate DNA base-calling.

Share and Cite:

Mohammed, O. , Assaleh, K. , Husseini, G. , Majdalawieh, A. and Woodward, S. (2013) Novel algorithms for accurate DNA base-calling. Journal of Biomedical Science and Engineering, 6, 165-174. doi: 10.4236/jbise.2013.62020.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] Griffiths, A.J.F., Wessler, S.R., Lewontin, R.C., Gelbart, W.M., Suzuki, D.T. and Miller, J.H. (2005) An introduc tion to genetic analysis. 8th Edition, W.H. Freeman, New York, 2-5.
[2] Maxam, M. and Gilbert, W. (1977) A new method for sequencing DNA. Proceedings of the National Academy of Sciences of the United States of America, 74, 560-564. doi:10.1073/pnas.74.2.560
[3] Sanger, F., Nicklen, S. and Coulson, A. (1977) DNA sequencing with chain terminating inhibitors. Proceedings of the National Academy of Science, 74, 5463-5467. doi:10.1073/pnas.74.12.5463
[4] Giddings, M., Brumley, R., Haker, M. and Smith, L. (1993) An adaptive, object oriented strategy for base-calling in DNA sequence analysis. Nucleic Acids Research, 21, 4530 4540. doi:10.1093/nar/21.19.4530
[5] Berno, A. (1996) A graph theoretic approach to the analysis of DNA sequencing data. Genome Research, 6, 80-91. doi:10.1101/gr.6.2.80
[6] Brady, D., Kocic, M., Miller, A. and Karger, B. (2000) Maximum likelihood base-calling for DNA sequencing. IEEE Journal of Biomedical Engineering, 47, 1271-1280. doi:10.1109/10.867962
[7] Eltoukhy, H. and Gamal, A. (2006) Modeling and base calling for DNA sequencing-by-synthesis. IEEE International Conference on Acoustics, Speech and Signal Processing, 2, 2.
[8] Thornley, D. and Petridis, S. (2007) Decoding trace peak behavior—A neuro-fuzzy approach. IEEE International Fuzzy Systems Conference, London, July 2007, 1-6.
[9] Trace Archive. National center for biotechnology information. http://www.ncbi.nlm.nih.gov/Traces/trace.cgi#
[10] Richterich, P. (1998) Estimation of errors in raw DNA sequences: A validation study. Letter in Genome Research, 8, 251-259.
[11] Berno, A. (1996) A graph theoretic approach to the analysis of DNA sequencing data. Genome Research, 6, 80 91. doi:10.1101/gr.6.2.80
[12] El-Difrawy, S.A. (2003) A soft computing system for ac curate DNA base-calling. Ph.D. Dissertation, Northeastern University, Boston.
[13] Zhang, X.-P. and Allison D. (2002) Iterative deconvolution for automatic base-calling of the DNA electrophore sis time series. Workshop on Genomic Signal Processing and Statistics, Raleigh.
[14] Priddy, K.L. and Keller, P.E. (2005) Artificial neural net works: An introduction. The International Society for Optical Engineering, Washington, 11. doi:10.1117/3.633187
[15] Duda, R.O., Hart, P.E. and Stork, D.G. (2000) Pattern classification. 2nd Edition, John Wiley and Sons, New York.
[16] Haykin, S.S. (2009) Neural networks and learning machines. 3rd Edition, Prentice Hall, New Jersey, 10-22.
[17] Farrell, K.R., Mammone, R.J. and Assaleh, K.T. (1994) Speaker recognition using neural networks and conven tional classifiers. IEEE Transactions on Speech and Audio Processing, 2, 194-205. doi:10.1109/89.260362
[18] Campbell, W.M., Assaleh, K. and Broun, C.C. (2004) Speaker recognition with polynomial classifiers. IEEE Transactions in Speech and Audio Processing, 10, 205 212. doi:10.1109/TSA.2002.1011533
[19] Mohammed, O.G., Assaleh, K.T., Husseini, G.A., Ma jdalawieh, A.F. and Woodward, S.R. (2010) DNA base calling using polynomial classifiers. Proceedings of International Joint Conference on Neural Networks, 18-23 July 2010, Barcelona, 1-5.
[20] Khan, O.G.M., Assaleh, K.T., Husseini, G.A., Majdalawieh, A.F. and Woodward, S.R. (2011) DNA base-cal ling using artificial neural networks. Middle East Conference on Biomedical Engineering, Sharjah, February 2011, 96-99.
[21] Ewing, B., Hillier, L., Wendle, M.C. and Green, P. (1998) Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Research, 8, 175-185.
[22] Brockman, W., Alvarez, P., Young, S., Garber, M., Giannoukos, G., Lee, W.L., Russ, C., Lander, E.S., Nusbaum, C. and Jaffe, D.B. (2008) Quality scores and SNP detection in sequencing-by-synthesis systems. Genome Research, 18, 763-770. doi:10.1101/gr.070227.107

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.