A Scalable Method for Cross-Platform Merging of SNP Array Datasets


Single nucleotide polymorphism (SNP) array is a recently developed biotechnology that is extensively used in the study of cancer genomes. The various available platforms make cross-study validations/comparisons difficult. Meanwhile, sample sizes of the studies are fast increasing, which poses a heavy computational burden to even the fastest PC.Here, we describe a novel method that can generate a platform-independent dataset given SNP arrays from multiple platforms. It extracts the common probesets from individual platforms, and performs cross-platform normalizations and summari-zations based on these probesets. Since different platforms may have different numbers of probes per probeset (PPP), the above steps produce preprocessed signals with different noise levels for the platforms. To handle this problem, we adopt a platform-dependent smoothing strategy, and produce a preprocessed dataset that demonstrates uniform noise levels for individual samples.To increase the scalability of the method to a large number of samples, we devised an algorithm that split the samples into multiple tasks, and probesets into multiple segments before submitting to a parallel computing facility. This scheme results in a drastically reduced computation time and increased ability to process ultra-large sample sizes and arrays.

Share and Cite:

Chen, P. and Hung, Y. (2013) A Scalable Method for Cross-Platform Merging of SNP Array Datasets. Engineering, 5, 502-508. doi: 10.4236/eng.2013.510B103.

Conflicts of Interest

The authors declare no conflicts of interest.


[1] N. Rabbee and T. P. Speed, “A Genotype Calling Algorithm for Affymetrix SNP Arrays,” Bioinformatics, Vol. 22, No. 1, 2006, pp. 7-12. http://dx.doi.org/10.1093/bioinformatics/bti741
[2] B. Carvalho, H. Bengtsson, et al., “Exploration, Normalization, and Genotype Calls of High-Density Oligonucleotide SNP Array Data,” Biostatistics, Vol. 8, No. 2, 2007, pp. 485-499. http://dx.doi.org/10.1093/biostatistics/kxl042
[3] Affymetrix, “BRLMM: An Improved Genotype Calling Method for the Genechip Human Mapping 500k Array Set,” Affymetrix Inc., Tech. Rep., 2006.
[4] Y. Nannya, M. Sanada, et al., “A Robust Algorithm for Copy Number Detection Using High-Density Oligonucleotide Single Nucleotide Polymorphism Genotyping Arrays,” Cancer Research, Vol. 65, No. 14, 2005, pp. 6071- 6079. http://dx.doi.org/10.1158/0008-5472.CAN-05-0465
[5] G. Yamamoto, Y. Nannya, et al., “Highly Sensitive Method for Genomewide Detection of Allelic Composition in Non-Paired, Primary Tumor Specimens by Use of Affymetrix Single-Nucleotide-Polymorphism Genotyping Microarrays,” American Journal of Human Genetics, Vol. 81, No. 1, 2007, pp. 114-126. http://dx.doi.org/10.1086/518809
[6] Affymetrix, “Cnat4.0: Copy Numbers and Loss of Heterozygosity Estimation Algorithms for the Genechip Human Mapping 10/50/100/250/500k Array Set,” Affymetrix Inc., Tech. Rep., 2007.
[7] H. Bengtsson, P. Wirapati and T. P. Speed, “A Single- Array Preprocessing Method for Estimating Full-Resolution Raw Copy Numbers from All Affymetrix Genotyping Arrays Including Genome-Wide Snp5&6,” Bioinformatics, Vol. 25, No. 17, 2009, pp. 2149-2156. http://dx.doi.org/10.1093/bioinformatics/btp371
[8] H. Bengtsson, A. Ray, et al., “A Single-Sample Method for Normalizing and Combining Full-Resolution Copy Numbers from Multiple Platforms, Labs and Analysis Methods,” Bioinformatics, Vol. 25, No. 7, 2009, pp. 861- 867.
[9] R. Bosotti, G. Locatelli, et al., “Cross Platform Microarray Analysis for Robust Identification of Differentially Expressed Genes,” BMC Bioinformatics, Vol. 8, Supplement 1, 2007, p. S5. http://dx.doi.org/10.1186/1471-2105-8-S1-S5
[10] A. A. Shabalin, H. Tjelmeland, et al., “Merging Two Gene- Expression Studies via Cross-Platform Normalization,” Bioinformatics, Vol. 24, No. 9, 2008, pp. 1154¨C60.
[11] F. Klinglmueller, T. Tuechler and M. Posch, “Cross- Platform Comparison of Microarray Data Using Order Restricted Inference,” Bioinformatics, Vol. 27, No. 7, 2011, pp. 953-960.
[12] Y. Xiao, M. R. Segal, et al., “A Multi-Array Multi-SNP Genotyping Algorithm for Affymetrix SNP Microarrays,” Bioinformatics, Vol. 23, No. 12, 2007, pp. 1459-1467.
[13] H. Bengtsson, K. Simpson, et al., “aroma.affymetrix: A Generic Framework in R for Analyzing Small to Very Large Affymetrix Data Sets in Bounded Memory,” Tech. Rep., February 2008.
[14] P. Hupe, N. Stransky, et al., “Analysis of Array CGH Data: From Signal Ratio to Gain and Loss of DNA Regions,” Bioinformatics, Vol. 20, No. 18, 2004, pp. 3413- 3422.
[15] R. Beroukhim, G. Getz, et al., “Assessing the Significance of Chromosomal Aberrations in Cancer: Methodology and Application to Glioma,” Proceedings of the National Academy of Sciences of the United States of America, Vol. 104, No. 50, 2007, pp. 20007-20012.
[16] M. G. Schimek, “Smoothing and Regression: Approaches, Computation, and Application,” Wiley Series in Probability and Statistics Applied Probability and Statistics Section, Wiley, New York, 2000. http://dx.doi.org/10.1002/9781118150658
[17] M. J. Walter, J. E. Payton, et al., “Acquired Copy Number Alterations in Adult Acute Myeloid Leukemia Genomes,” Proceedings of the National Academy of Sciences of the United States of America, Vol. 106, No. 31, 2009, pp. 12950-12955.

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.