Novel Methods in the Study of the Breast Cancer Genome : Towards a Better Understanding of the Disease of Breast Cancer 798 disease of breast cancer

Rapidly developing sequencing technologies and bioinformatic approaches have provided us with an unprecedented instrument allowing for an unbiased and exhaustive characterization of the cancer genome in genetic, epigenetic and transcriptomic dimensions. This review introduces recent exciting findings and new methodologies in genomic breast cancer research. With this development, cancer genome research will illuminate new delicate interactions between molecular networks and thereby unravel the underlying biological mechanisms for cancer initiation and progression. It also holds promise for providing a molecular clock for the estimation of the temporal processes of tumorigenesis. These methods in combination with single cell sequencing will make it possible to construct a family tree elucidating the evolutionary lineage relationships between cell populations at single-cell resolution. The anticipated rapid progress in genomic breast cancer research should lead to an enhanced understanding of breast cancer biology and guide us towards novel ways to ultimately prevent and cure breast cancer.


Introduction
Breast cancer is the second most commonly diagnosed cancer and seriously threatens women health [1].As a complex disease, both genetics and environmental causes are implicated in the tumorigenesis of breast cancer.The catalogue of inherited or somatic mutations accumulated in a cancer genome encompasses substitutions of nucleotides, insertions and deletions, translocations and other chromosomal rearrangements as well as copy number changes [2].Many efforts have been spent in the last decade to identify the spectrum of genes associated with breast cancer [3].Genes, such as BRCA1 and BRCA2, with high penetrance mutations are involved in approximately 70% of breast cancers in high-risk families.However, they only account for a minority of all breast cancer cases [4].In general, <10% of breast cancer cases are thought to be hereditary in a Mendelian fashion and usually a somatic "second hit" in the homologous normal allele is required for disease development.
Thus, to identify low penetrance susceptibility gene variants (inherited or somatically acquired) has become an area of interest in breast cancer research.Genomewide association studies (GWAS) are commonly used for the search for correlations between disease incidence and genetics.GWAS routinely encompasses tens of thousands of patient samples and scans the full length of the genomes [5].GWAS have identified 25 genetic loci associated with breast cancer risk [5].Still, to date, GWAS can only account for 9% -10% of breast cancers [5].Even when considering all types of genetic studies, some 70% of breast cancer cases remain unexplained [5,6].It has become obvious that genetic factors only account for part of the phenotypic variance [7].Breast cancer development represents a multiple-step process and the risk increases with age.Environmental degenerative factors no doubt play an important role in breast cancer tumorigenesis.Epigenetic changes, including somatically acquired (and sometimes germ line transmitted) chemical modifications of DNA (without DNA sequence changes) as well as DNA binding small RNAs and proteins (e.g.histones), bridge the gap between genetics and the environment significantly improving our understanding of the disease of breast cancer [8,9].
The emergence of massively parallel sequencing technology provides researchers with an unprecedented powerful tool for breast cancer research.Currently, there are five commonly applied massively parallel sequencing technologies: 454 Life Sciences (Roche) applies a pyrosequencing approach [10], Illumina/Solexa uses the principle of sequencing by synthesis (SBS) with reversible dye terminators [11], Applied Biosystems SOLiD [12] and Complete Genomics [13] perform sequencing by ligation strategies, and Ion Torrent [14] utilizes an ionsensitive SBS principle for sequencing.Although these sequencing platforms are technically quite diverse, they share many common features: Similar process of library preparation, amplification of libraries prior to sequencing, and similar process of sequencing by an automated series of enzyme-driven biochemical and fluorescent imaging based data acquisition steps [15].This allows ultra-deep investigations of breast cancer genomes and their epigenetic modifications in a fast and cost-effective way, without the requirement of abundant amounts of material [2,16].Here we briefly describe current genomic approaches applied in breast cancer research.
Although array-based approaches remain broadly applied for RNA analyses at present, transcriptome sequencing is becoming increasingly important, as sequencing has a greater dynamic range and provides the possibility to discover new transcripts, sequence variants and splicing events [17,18].RNA sequencing allows deep mapping of short RNA fragments (17 -22 nucleotides), thus exponentially increasing our knowledge of the biology, diversity and abundance of small RNA populations [19,20].
Despite the fact that a number of whole breast cancer genomes have already been sequenced [21][22][23], the analyses of particularly informative sectors of the cancer genome, e.g.sequencing the DNA sequence based on capturing the exomes and DNA sequences coding for known micro-RNAs, are likely to be carried out commonly [24].Exome sequencing applies affinity-enrichment techniques to enrich exome sequences from the genome before sequencing, thereby allowing a deep characterization of the target sequence for a decreased cost [25].Massively parallel sequencing also can efficiently sequence small genome fragments that have been randomly collected from the tumor genome to reveal copy number changes (low coverage sequencing) [16].The relative number of sequenced short DNA fragments in equalsized bins distributed along the genome, can be regarded as an estimate of the relative copy number at different genomic locations [16].
Massively parallel sequencing has also dramatically increased our ability to survey genome-wide epigenetic markers.Chromatin immunoprecipitation followed by sequencing (ChIP-Seq) uses antibodies to pull down target DNA to globallysurvey the DNA binding pattern of a protein of interest [26].This method is also applied for measurement of histone modifications.DNA methyllation as an important epigenetic mechanism has been extensively studied.To date, there are three main approaches that are compatible with massively parallel sequencing for genome-wide mapping of DNA methylation information: 1) endonuclease digestion-based methods such as modified methylation specific digital karyotyping (MMSDK) [27]; 2) affinity enrichment-based methods such as methylated DNA binding domain sequencing (MBD-Seq) [28] and methylated DNA immunoprecipitation sequencing (MeDIP-Seq) [29]; and 3) bisulfite conversion-based methods such as MethylC-Seq (methylome) [30,31] and reduced representation bisulfite sequencing (RRBS) [32].Exhaustive comparisons of these DNA methylation assays have been recently carried out by several groups, and these studies are invaluable when selecting methods for DNA methyllation analysis [33][34][35][36].With the cost of sequencing the whole human genome dropping towards 1000 US dollars (http://www.genome.gov/12513210)[37] in the near future, a revolutionary era of personalized medical care for breast cancer patients will soon become a reality.For example, the elucidation of a number of intrinsic breast cancer subtypes [38] has added significantly to our understanding of breast cancer heterogeneity and also provides tools that can be used to select the right treatment for the right patient at the right time.The important advances in cancer genome analysis brought about by the application of massively parallel sequencing have already been discussed in detail in many other reviews [2,16,39,40].In the present review, we will introduce and highlight some new research directions, which we expect will lead to an increased understanding of the breast cancer disease.

Pathway-Oriented Analysis Based on Integration of Multiple "Omic" Dimensions
One important insight obtained from the large-scale mutational analyses carried out in pioneering large-scale sequencing studies of breast and colon cancer was the importance of taking a pathway-oriented strategy [41][42][43].
A pathway-oriented model of tumorigenesis is also supported by the observation that although different genes may be mutated in the same type of tumors, these genes often belong to a more limited number of pathways and biological processes [41].For example, breast and colo-rectal cancers both have frequent mutations in PIK3CA pathway genes, but these mutations are not always in the same genes [41].The cancer genome can be dysregulated through multiple mechanisms including mutations in coding and non-coding sequences, alterations in DNA copy number and organization, and aberrations in modifications of DNA and DNA related proteins [16].The abnormalities may simultaneously occur in a key gene in an independent or synergistic manner, leading to dysfunction of this gene, thereby fueling tumorigenesis.Alternatively, these abnormalities can target different genes that are connected within a pathway and, thereby, through dysfunction of the pathway, ultimately facilitate cancer development.A classic example of this is the tumor suppressor gene, TP53, which can be inactivated in three ways: through homozygous deletions in the 17p13.1 region; through hypermethylation of TP53 promoter to epigenetically silence the expression; or through mutations that cripple the function of TP53.More interestingly, these multiple mechanisms can collaborate to cause dysfunction of this gene; for instance, one allele may be inactivated by mutation whilst the other allele may subsequently be silenced by DNA methylation of its promoter region.Alternatively, one mutated allele in combination with a subsequent copy loss of the second allele or epigenetic silencing of the other allele can eventually completely inactivate the gene function.Allelespecific gene expression regulated by epigenetic mechanisms was previously regarded as mainly constrained to genomic imprinting.A recent study of the DNA methylome of human peripheral blood mononuclear cells demonstrated that the regulation mechanisms of allelic-specific gene expression by allelic-specific DNA methylation may exists a more comprehensive biological phenomenon [31], which underscores the relevance of an integrative analysis involving multiple dimensions of biological information.This can be even more important in cancer genome research, because genomic abnormalities observed in cancers can be associated with abroad range of biological characteristics.Thus, breast cancer is undoubtedly a complex disease, both in its biological mechanisms and in its final biological endpoints.A deeper understanding of breast cancer therefore requires broad investigations of the breast cancer genome in different dimensions followed by integrative analysis of the findings using a pathway-oriented strategy.Additionally, pathway analyses can facilitate the selection of genes for further functional analyses [44].Recent breast cancer genome studies have focused their efforts on integrative analysis for large-scale sequencing data.As examples of this, many studies have involved combinational analysis of sequencing data from the genome, exome and/or transcriptome to evaluate the impact of mutations or genome rearrangements on gene expression [45,46].Integrating DNA copy number, RNA transcriptome, and CpG island methylation profiles, Sun et al. systematically examined the genomic features underlying the estrogen receptor positive (ER+) and estrogen receptor negative (ER-) breast cancer phenotypes [47].These studies demonstrate the transition of the strategy of breast cancer research from focusing on a handful gene sets to multi-dimensional investigations including the whole genome using pathway-oriented models.

Mitochondrial Genome Analysis
The human mitochondrial genome is a 16.6 kb double-stranded circular DNA molecule presenting a copy number that varies widely according to the cell type [48].Because of a lack of histone protection, limited repair capacity and proximity to superoxide radicals, mitochondrial DNA (mtDNA) has a higher susceptibility to damage, compared with the nuclear genome [49].Additionally, the absence of introns in the mitochondrial genome leads to more frequent coding sequence mutations, which affect mitochondrial function.Numerous somatic mutations of mtDNA have been observed in breast cancer [50,51].The dysfunction of the mitochondria has long been suspected to contribute to the development and progression of cancer [52].At present, a primary goal is to assess the functional role of the various mitochondrial mutations in the initiation and progression of breast cancer, with a specific focus on identifying mutations associated with acquired adaption for rapid proliferation under hypoxic conditions, as well as mutations related to drug metabolism.Thus, mtDNA mutations may have potential value as cancer biomarkers, for example to predict the metabolism of different chemotherapeutic drugs, i.e. to predict sensitivity/resistance to treatment.Moreover, the majority of mtDNA mutations have been observed to be homoplasmic in early preneoplastic and cancerous lesions, i.e. the mutated mtDNA predominates and is readily detectable in tumor biopsy material with amounts reported to be 19 -22 times more abundant than, for example, mutated TP53 DNA [53].The success in identifying mtDNA mutations in material obtained from fine-needle aspirates underscores the potential possibility of using this methodology in clinical practice [51,54].Distinguishing the spectrum of mutations related to cancer from age related mutations is necessary, since mitochondrial mutations have also been reported to occur as a function of the aging process [55].
The heterogeneity of the mitochondrial genome must be considered during analysis.To address this issue, ultra high-depth sequencing and additional association analyses are required.Still, research into the mitochondrial genome is relatively neglected, since previous studies using massively parallel sequencing have mainly focused on the characterization of the cancer nuclear genome.Thanks to the abundant copy numbers of mitochondria, obtaining mitochondrial sequences is a common bonus derived from whole cancer nuclear genome sequencing.Taking advantage of this mitochondrial genome information will hopefully provide us with a better understanding of the associations between the breast cancer genome and the diverse range of breast cancer phenotypes.According to our experience, even using low-coverage genomic sequencing (typically one gigabase per sample), we can obtain 100% coverage and more than 300× depth of mitochondrial genome sequence.Alternatively, implementation of custom-designed enrichment assays that specifically capture mtDNA from total isolated DNA can be used to achieve in depth target sequencing.

The Temporal Order of Genome Changes in the Evolution of the Breast Cancer Genome
Molecular characterization of human cancers usually gives a catalogue of genomic and epigenetic abnormalities reflecting years of somatic changes until the sampling time point [56].Efforts to elucidate the temporal order of aberrations are performed by examination of a series of samples such as paired matched primary tumors and metastases or sporadic samples ordered according to different clinicopathological stages.These studies have revealed mutations associated with tumor progression and metastasis [57,58].The differences between the primary tumor and its metastasis actually present a molecular profile of the late stages of tumorigenesis, and the molecular characterization of progression between sporadic samples intrinsically contains bias from different genetic backgrounds.
Copy neutral loss of heterozygosity (CN-LOH), as a frequently observed event in tumors, offers a unique opportunity for illustrating the longitudinal evolution of somatic events, beginning early in tumorigenesis in a single cancer [56,59].CN-LOH, also referred to as uniparental disomy (UPD), is a loss of one copy (allele) of a heterozygous chromosomal region followed by a duplication of the other allele, yielding a homozygous chromosomal region without a copy number change (Figures 1(a) and 1(b)).The process of CN-LOH can reveal important information contained in the evolutionary history of somatic aberrations: If a mutation precedes a regional UPD duplication, its copy number is doubled, i.e. homozygous, and mutations following such a duplication event appear in haploid copy number, i.e. heterozygous [56,59].Based on this principle, simple mutations preceding a chromosomal duplication event show discretely higher copy numbers compared to those occurring after duplica-tions and the ratio of heterozygous to homozygous mutations in CN-LOH regions directly reflects the temporal order of the duplication in tumorigenesis (Figures 1(a) and 1(b)) [56].In practical analysis, mutants can be discretely classified as homozygous mutations (high allele frequency) and heterozygous mutations (low allele frequency).The difference in allele frequency, i.e. shifts between homozygous and heterozygous mutations, can reveal the temporal order of genetic events that occurred in different regions in a single cancer genome.Individual somatic homozygous mutations accompanied by abundant heterozygous mutations in a CN-LOH region, implies that the homozygous mutations are early events in tumorigenesis, since a long period after a duplication event would allow this region to accumulate numerous new heterozygous mutations.On the other hand, a majority of homozygous mutations with a concurrent minority of heterozygous mutations implies that a new duplication event has occurred in the recent past, in which the previous heterozygous mutations have been lost and only one allele's information is retained and doubled.Thus new heterozygous mutations are quite limited due to the short period of accumulation after the duplication event.
This principle is also valid for trisomic regions.A trisomic region can be obtained through two different patterns: It can be the result of a simple duplication in which one allelic chromosomal region is doubled.In this case, the trisomic region harbors both heterozygous and disomic homozygous mutations (Figure 1(c)).Alternatively, a CN-LOH event could be followed by a secondary duplication to generate a trisomic region.In this scenario, the trisomic region harbors three types of mutations; heterozygous, disomic homozygous and trisomic homozygous mutations (Figure 1(d)).
Taken together, combining the information of the allelic frequency of mutations and the corresponding chromosomal copy number allows the measurement of the relative order of progressive events determining a cancer's individuality [56,59].Durinck et al. recently applied this principle to delineate the temporal order in cancer evolution of skin and ovarian cancers [56].In that study, based on investigations of the allele frequency of the mutations and the corresponding copy number profile, the mutation of TP53 was revealed as an initial event prior to the substantial numbers somatic mutations in tumor development of both types of cancers.Notably, the method introduced by Durinck et al. can sharply delineate the wide spread genomic instability for any type of cancer, setting the stage for determining the genetic events in the progression of breast cancer too.Delineation of the temporal order in cancer evolution will offer important information for future characterization of the succession of molecular changes and identification of (a) and (b) show the principle of determining the temporal order of point mutations and copy-neutral loss of heterozygosity (CN-LOH) events [56].In (a), homologous chromosomes (one chromosome is in red and its homolog is in green) accumulate different mutations in their alleles.These mutations are heterozygous (yellow line).A CN-LOH event (highlighted by blue rectangle) occurs at an early stage.Thus, the number of heterozygous mutations (yellow lines) is limited due to a relatively short time allowed for mutations to accumulate.During the CN-LOH event, the loss of one allelic chromosomal region is compensated by duplication of its homolog.The previously heterozygous mutations on the homolog become homozygous (dark blue lines) by this duplication event and, thus can be classified as early.The heterozygous mutations not located in CN-LOH regions remain intact.The new mutations arising after the CN-LOH event are heterozygous (yellow lines).Since this CN-LOH occurred early, a long period allowed for accumulation of more new heterozygous mutations prior to the sampling time point (the left pane in (a)).By contrast, a majority of homozygous mutations with a concurrent minority of heterozygous mutations implies that a duplication event has occurred in the recent past, in which the previous heterozygous mutations have become homozygous and new heterozygous mutations are limited due to the short period of accumulation after the duplication event A statistical model is applied to determine the temporal order of CN-LOH by calculating the densities for the allele frequency of heterozygous and homozygous mutations.The ratio of heterozygous to homozygous mutations in CN-LOH regions directly reflects the temporal order of the duplications in tumorigenesis (the right panes in (a) and (b).This principle can also be applied to determine the temporal order for trisomic regions.A trisomic region can be acquired by two distinct types of events: It can be the result of a simple duplication (highlighted by a red rectangle) in which one allelic chromosomal region is doubled.In this case, the trisomic region harbors both heterozygous and disomic homozygous mutations (c).Alternatively, a CN-LOH event (highlighted by a red rectangle) is followed by a secondary duplication (highlighted by a second red rectangle) to generate a trisomic region.In this scenario, the trisomic region harbors three types of mutations; heterozygous: disomic homozygous and trisomic homozygous mutations (d).

Figure 1. Conceptual framework defining the temporal order of genetic events in cancer genome evolution based on the relationship between point mutations and chromosome aberrations.
driver mutations in early breast cancer tumorigenesis, thereby supporting the development of novel cancer detection assays and the establishment of new innovative targeted treatment modalities [56].

Single-Cell Sequencing
Breast cancer is a complex disease in part because the progression of breast cancer is a dynamic evolutionary process in the temporal dimension, and in part because breast cancer neoplasms contain highly heterogeneous cell populations in the spatial dimension.Tumor heterogeneity is an unavoidable fact in cancer research, because it is related to many of the important features of tumorigenesis including tumor progression, metastasis and therapeutic resistance [60,61].Breast cancer is a typically heterogeneous cancer type, composed of diverse malignant epithelial subpopulations mixed with non-malignant tissues, such as infiltrating stromal cells, and cells from the immune system, such as infiltrating lymphocytes [62].In some scenarios, normal cell populations may contribute to more than 50% of the total extracted DNA or RNA [63].To address tumor heterogeneity, one solution is to select samples enriched for tumor content (at least 80%) and perform in depth sequencing to obtain sufficient sequenced data for characterization of dominant cancerous populations.This strategy is not optimal for studies aimed at reconstructing the evolutionary history and revealing the hierarchical structures in cell populations, since the subtle, important information from special rare subpopulations of cells may be masked, or even lost, in the data obtained from mixed bulk populations.Recently developed sequencing approaches for single cells at transcriptomic [64], genomic (DNA copy number profile) [65] and exomic [66,67] levels provide a new strategy for improved characterization of tumorigenesis.These approaches also offer promising tools for the early detection of compromised genes involved in cancer initiation, deciphering intratumour heterogeneity, monitoring the most malignant cells and capturing circulating tumor cells, thus guiding clinical therapy [63] (Figure 2).
Isolation of individual cellsis a prerequisite for single-cell genomic and transcriptomic analyses.Several attempts have been made to stratify cell subpopulations using regional macrodissection, fluorescence-activated cell sorting (FACS), laser capture microdissection (LCM) and other forms of micromanipulation (Figure 2).Macrodissection can retain the anatomical information, thereby providing a possibility to clarify the relationship between cells in special proximity where they share the same microenvironment.Additionally, this method is easily performed without the requirement of special equipment.However, one drawback of this is that it only can provide a gross stratification rather than single-cell resolution.FACS can collect cells according to the fluorescent characteristics of each cell, but selected cell populations based on limited number of labeled features may remain heterogeneous according to other cellular or molecular properties.In addition, anatomical information would be lost in the procedure of making the suspension of cells from dissociated tissue.LCM enables users to individually collect target cells, thereby providing an ideal and well characterized biological material for subsequent analysis.But LCM is labor-intensive and timeconsuming.Micromanipulation can also capture single cells from cultured cell, dissociated tissue or biopsy material according to a given feature, but with the same shortcomings as LCM.
The amount of material isolated from individual cells using the above approaches is usually very small.Thus, an amplification step of the DNA or mRNA (through amplification of cDNA) extracted from captured single cells is necessary.Whole genome DNA amplification approaches, such as PCR-based amplification [68,69] and isothermal multiple displacement amplification [70], provide tools for relatively unbiased increasing of DNA material from single cells.The method of isothermal multiple displacement in particular has been demonstrated to ensure a highly efficient and good quality representative amplification of the template genome [70].RNA is prone to degradation, thus stabilization of RNA is necessary for single-cell transcriptomic analysis.To maximize the sensitivity of subsequent sequencing analyses, elimination of genomic DNA contamination is also recommended.The methods for increasing the amounts of RNA include linear in vitro transcription (IVT)-based and exponential PCR-based methods [71].With improvement in methodologies, ~10 pg of total RNA and ~0.1 pg of mRNA, in a typical mammalian cell, can be converted to up to 3-kb fragments of cDNA, followed by uniform amplification that can increase the yield around ten million-fold to match the requirement for downstream analyses with a high reproducibility [64,71,72].The techniques and methods applied in single-cell transcriptome analyses have recently been highlighted and discussed by Tang et al. [71].
In early studies, single-cell genomic and transcriptomic analyses mainly utilized microarray-based technologies such as array-CGH and gene expression microarrays.Massively parallel sequencing not only ensures deeper measurement of DNA copy number and transcriptomic profiles, but also directly provides sequence information.Recently, some attempts utilizing the application of massively parallel sequencing platforms for single-cell analysis, have been reported [64][65][66][67].Navin and his colleagues applied single-nucleus sequencing (SNS) to Multiple approaches can be used to obtain single cells for different analyses [64][65][66][67].Samples can be obtained as follows: Surgical removal of primary breast cancer (purple) (a); Fine needle biopsy of axillary lymph node (b); Fine needle biopsy of primary breast tumor (c); Capturing of circulating tumor cells from blood (d); Surgical dissection of thoracic vertebral metastasis (red) (e); Macrodissection collects primary breast cancer (purple) and distant metastasis (red) (f); Isolation of cancer single cells can be performed by micromanipulation; g), laser capture microdissection (h); and fluorescence-activated cell sorting (FACS) (i); Polygenetic analysis at the single-cell level is applied to uncover the evolutionary relationship between cancer cell populations (j).investigate tumor population structure and evolution in two human breast cancer cases through the investigation of copy number profiles [65].SNS was demonstrated to be a reproducible method the sequencing result from a single-cell showed a high correlation (R 2 > 0.85) with that from a million cells [65].Tang et al first reported the transcriptome analysis of single cell mouse blastomeres in combination with massively parallel sequencing technology [71].In their study, numerous known transcripts and splicing isoform expression patterns were identified at single cell resolution.Notably, thousands of previously unknown exon exon junctions were found in the transcriptome, indicating the potential value of this application in transcriptomic analysis for cancer single cells [64].Recently, single-cell exome sequencing method was introduced [66,67].Hou and his colleagues carried out whole-exome single-cell sequencing of a JAK2-negative myeloproliferative neoplasm [67] and Xu and his colleagues carried out single-cell exome sequencing of a clear cell renal cell carcinoma (ccRCC) [66].These two studies opened the way for detailed analyses of a variety of tumor types and other complex diseases, thereby sup-porting the development of more effective therapies, which are targeted to the relevant cells [66,67].
Phylogenetic analysis is a commonly used bioinformatic tool in research of the evolutionary relationship between cancer cell subpopulations [57,58,65].Cancer progression can be regarded as a micro-evolutionary process: A cancer begins with an initiating aberration in a normal cell that confers a selective growth advantage.Subsequently, successive clonal expansions occur fueled by the acquisition of additional aberrations, corresponding to progression stages.At the same time, there is a massive loss of clones with lower fitness.In the late phases of tumorigenesis, founder cells within the cancer give rise to seeding clones that can colonize distant organs and hence initiate a disease stage characterized by metastatic lesions [73].In phylogenetic analysis, if single cells have similar DNA sequences they likely originate from a common ancestor and locate in a lineage branch with short evolutionary distances in a phylogenetic tree.The lineage branch will be split, when a 'speciation' (founder cell) event occurs, in which a single ancestral lineage gives rise to two or more daughter lineages (extended clones).Consequently, through phylogenetic analysis of data generated by sequencing of multiple samples ordered by the progressive stages of cancer, such as the normal epithelium, carcinoma in situ, infiltrating carcinoma, lymph node metastasis, and distant metastasis, it would be possible to construct the evolutionary relationships between single cells, identify the founders responsible for initiating next stage and determine their molecular features as well as estimate time intervals between the successive stages.
Single-cell sequencing and related bioinformatic analyses open a new avenue for breast cancer research.These methods may have great importance for future breast cancer genome studies-especially with a continuous reduction in sequencing costs and the emergence of more powerful sequencing technologies.Limited to current conditions, there are some drawbacks in the present methodologies, such as the relatively low coverage in single-cell sequencing and sequence information not being fully exploited [65].Compared with genomic information, transcriptomes from single cells present with more variability due to the influences from epigenetic events, the circadian clock, the cell cycle, microenvironmental niches as well as "transcriptional noise" [71].The evidence of stochastic characteristics in gene expression among single cells underscores the importance and necessity of applying multiple single-cell transcriptomic analyses, and also highlights the challenge in understanding and interpreting the gene expression results from individual cells [71].Epigenetic abnormalities may also contribute to breast cancer progression, but DNA methylation analysis for single cells has not yet been developed, mainly due to lack of proper amplification methods.At present, no DNA amplification method is able to properly retain the DNA methylation information in newly amplified DNA copies.If a technical breakthrough can occur in single cell epigenetic analysis, the evolutionary models currently being constructed on the basis of single-cell genomic data will be improved by addition of epigenetic information.Simultaneous analysis of DNA, DNA modification and mRNA from the same individual cells will be an ideal strategy for the comprehensive and precise interpretation of the functional alterations occurring in single cancer cells.
Accomplishing the above goal will depend on advances in sequencing technology.Nanopore DNA sequencing is one of a number of promising single molecule sequencing approaches that can directly sequence DNA or RNA molecules using tiny amounts of material without the requirement of an amplification and labeling step [74,75].DNA methylation information would be available in the direct readout by precisely distinguishing unmethylated cytosines from methylated cytosines in the DNA sequence [76].Therefore, we believe single-cell sequencing in combination with novel sequencing technologies will bring a revolutionary change in breast cancer research.

The Microbiome
Beside the aforementioned progress, microbiome and metagenomic studies will be other promising fields in cancer research.Microbes inhabiting the human body, including eukaryotes, archea, bacteria and viruses, are collectively known as microbiome.Bacteria alone are estimated to outnumber human cells by an order of magnitude and the gene set of a microbiome is approximately 150 times larger than the human gene complement [77,78].Increasing evidences implicate the microbiome as crucially important for metabolism, immune defense, and the development of diverse disorders including cancers.In recent years, microbiome research has been boosted through such large-scale sequence-based human microbiome projects as Metagenomics of the Human Intestinal Tract (MetaHIT) and the Human Microbiome Project (HMP).A variety of microbial communities have been characterized by massively parallel sequencing, sequence analysis and functional studies [77][78][79].Following the establishment of microbiome catalogs and references as well as the development of laboratory and bioinformatic approaches-especially, investigations of the correlation with host phenotype-the microbiome will become an important aspect in cancer research.In the context of breast cancer research, the next effort will be to establish cause and effect relationship between the microbiome and breast cancer susceptibility.

Challenges and Progress
Rapid development of improved methods for studying the breast cancer genome poses many future challenges.Some challenges will arise from analysis of numerous short reads the amounts of which are several magnitudes higher than those traditionally obtained by Sanger sequencing.Thus, the first challenge is to meet growing computational requirements such as sufficient storage, data transfer and assembly.Secondly, there is an urgent need for fast, accurate and user-friendly bioinformatic approaches for data mining to realize the full potential of these improved sequencing technologies.Numerous recently published bioinformatic tools offer a wide variety of options for broad "omics" analysis, but also result in questions on which method provides the best results.Thus, exhaustive comparisons between algorithms, incorporating miscellaneous analytic methods into an integrative pipeline, evaluating the statistical power, sensitivity and specificity of software developed for the same analyses, will be required for standardization of bioinformatics inanalyses of the breast cancer genome.
In addition to bioinformatic methods, as the cornerstone for cancer genome research, more representative human reference genomes are greatly required.With growing number of published reference genomes and an increasing knowledge of the variations in the normal human genome, the previous single consensus representation of the genome is not sufficient, especially in regions with complex allelic diversity.This challenge is being addressed by an effort to create assemblies that better represent the diversity (http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/).At the same time, functional annotation projects will provide the necessary information for elucidating dysfunctions of protein-coding genes (GENCODE project (http://www.sanger.ac.uk/gencode/)) [80] and defining functional elements (http://encodeproject.org/ENCODE/)[81].Another important resource for cancer genome research is well-annotated databases.Advances in understanding the cancer genome depend on the access to comprehensive catalogues of variations in the human genome in normal populations.These normal variations are well collected, curated and updated by many different databases, according to different variation patterns, for example, SNPs in dbSNP [82] and the Hap-Map database [83], copy number variations in DGV [84], dbVar (http://www.ncbi.nlm.nih.gov/dbvar/) and comprehensive human genome variations in The 1000 Genomes Project [85].
A more difficult challenge is the defining of normal epigenetic references, because epigenetic information is reversible and presents in highly tissue-specific and developmentally associated patterns.Recently, the Epigenomics Mapping Consortium has been working to pro-duce a public resource of epigenomic maps (DNA methylation, histone modifications and related chromatin features) for stem cells and primary ex vivo human fetal and adult tissues representative of normal human biology, thereby offering the normal counterpart for cancer research [86].These databases, either presenting the repertoire of oncogenic variations [3] or collections of normal variations (see above), which are well-curated and periodly updated, have provided profound value for cancer genome research by providing comprehensive references and aiding in identifying novel aberrations for individual studies.The relationship between large-scale sequencing projects for the construction of reference databases and the many milestone events of cancer genome sequencing has been well described in a recent review [87].
Large-scale sequencing of cancer genomes, including breast cancer, is rapidly providing an astronomical amount of data, which will offer many new candidates that will be assumed to play pivotal roles for a given cancer phenotype.Careful functional studies of mutated genes are required for ultimate proof of the relationship between cancer gene status and clinical behavior [41].How to validate these candidate genes will become a crucial challenge for researchers using routine assays such as cell lines or animal models.High-throughput RNA interference screens in combination with the adaptation of existing model systems, will be a promising tools for refining the potential candidates provided by large-scale sequencing by further functional studies [16].
The many applications and analyses using massively parallel sequencing platforms have not yet been fully optimized, standardized and systematically evaluated for samples routinely processed in cancer pathology in clinical practice.This poses a gap between bench and bedside.To address this important matter, comprehensive coordinated international collaboration is necessary for the standardization of laboratory endeavors and bioinformatic analyses [24].

Conclusion
The completion of the draft of the human genome signaled the ushering in of the genomic era [88].Thereafter, revolutionary breakthroughs in sequencing technology, a spectacular blossoming of bioinformatics and an accelerating accumulation of sequencing data, bring unprecedented opportunities as well as challenges to cancer research.Recently, the International Cancer Genome Consortium (ICGC) was launched to coordinate the largescale sequencing of the genomes, epigenomes, and transcriptomes for 50 different cancer types and/or subtypes [24].The goal of the project is to define catalogues of cancer genomic abnormalities and translate the findings of these genomic analyses into clinical utility [24].This project not only has a profound influence on present cancer research, but more importantly, it heralds the start of the era of personalized medicine [24].Consequently, we can anticipate that sequencing and genomic analysis will play an important role in clinical practice.In the not too distant future, sequencing may become a population screening approach for the early detection of breast cancer, and sequencing of the breast cancer genome of individual patients may be routinely applied to confer guidelines for personalized breast cancer patient management.
(the left pane in (b)).By using sequencing or microarray technologies, heterozygous mutations (newly accumulated, indicated by open circles) and homozygous mutations (generated by the duplication, indicated by blue solid circles) in CN-LOH region (indicated by a horizontal thick solid black line) can be identified by their allele frequency.Mutations located in non-CN-LOH regions are shown by open stars (the middle panes in (a) and (b).

Figure 2 .
Figure 2. Schematic indicating the framework of single cell sequencing analysis for breast cancer.