Building reliable genetic maps : different mapping strategies may result in different maps

New high throughput DNA technologies resulted in a disproportion between the high number of scored markers for the mapping populations and relatively small sizes of the genotyped populations. Correspondingly, the number of markers may, by orders of magnitude, exceed the threshold of recombination resolution achievable for a given population size. Hence, only a small part of markers can be genuinely ordered in the map. The question is how to choose the most informative markers for building such a reliable “skeleton” map. We believe that our approach provides a solution to this difficult problem due to: 1) powerful tools of discrete optimization for multilocus ordering; 2) a verification procedure, which is impossible without fast and high-quality optimization, to control the map quality based on re-sampling techniques; 3) an interactive algorithm of marker clustering in complicated situations caused by significant deviation of recombination rates between markers of non-homologous chromosomes from the expected 50% (referred to as quasi-linkage or pseudo-linkage); and 4) an algorithm for detection and removing excessive markers to increase the stability of multilocus ordering.


INTRODUCTION
Genetic maps are an important tool in genomics and in numerous practical applications such as breeding, medical genetics, and gene cloning.Unfortunately, the available algorithms and software tools become less suitable with an increasing number of available markers.The objective of our study is to develop efficient methodology for building multilocus genetic maps, providing the control of the quality of maps by detecting and removing the sources of map instability.Two major problems should be addressed in multilocus genetic mapping: 1) markers that belong to non-homologous chromosomes should not be assigned to the same linkage group; and 2) markers from the same chromosome should be placed on the genetic map in the same order as the corresponding DNA sequences that reside in the chromosome.
In situations with significant deviations of the recombination rates between non-synthetic markers from the expected level (50%), the problem of correct clustering cannot be solved by an arbitrary choice of a certain (constant) threshold value of recombination or LOD, albeit this is exactly how this problem is treated in many mapping packages [1,2].Indeed, in experiments with the foregoing characteristics, the recombination values between groups of markers from different chromosomes may sometimes be smaller than the values between adjacent markers within a chromosome.This phenomenon, referred to as "quasi-linkage" (or "pseudo-linkage") can result from a combination of statistical and biological reasons and scoring errors.The statistical reasonsof pseudo-linkage are mainly caused by the sample size and number of chromosomes (n) in the genome: increased n is associated with higher chances to detect "significant" deviations from independent segregation.But the major source of pseudo-linkage is biology.Literature on this phenomenon in many species can be found in: [3][4][5] and references therein.
Mapping algorithms tend to ignore pseudo-linkage.Consequently, some non-syntenic loci may appear in the same "linkage group", which could result in contradictions between mapping results for different mapping populations and between genetic and physical maps.Thus, up to 12% of cattle markers were assigned to wrong chromosomes and contradict the physical maps (H.Lewin, personal communication).In fish genetics, pseudo-linkage is also a known phenomenon (see for review: [6]).To address this problem, we suggest a modified approach of clustering markers into linkage groups.In our scheme, clustering is conducted concurrently with multilocus ordering that also includes verification of the order and removing unreliable markers.Instead of using one threshold recombination rate (or LOD value), we employ a series of increasing recombination thresholds (or decreasing LODs).At the first, most stringent threshold, we have a minimum of danger of mixing markers from different chromosomes into one linkage group, but the result is a high number of linkage groups.By relaxing the stringency at the next steps, we allow end-to-end merging of the ordered linkage groups, excluding those that display the strongest affinity to each other by their interior parts.
In many practical cases, high-density mapping is associated with another difficult problem: a disproportion between a high number of scored markers and a relatively small population size.The number of markers may, by orders of magnitude, exceed the resolution of recombination for the given population size, so that only a minority of markers can be actually ordered.The question is how to choose the most informative markers to build a reliable "skeleton" map.If we consider a situation with, for example, k ~ 1000 markers, then for a sample size of N ~ 100, the minimum distance between markers that can be resolved in the map should be  1 cM; hence, the map length for a chromosome should be  1000 cM, which is unrealistic for the vast majority of organisms.How can the appearance of such 1000 cM maps be explained?We believe that the root is in the wrong assumption that all markers are different (resolvable by recombination).In fact, for small sample sizes, many markers comprise groups of absolutely linked markers and should be replaced by their "delegates".But even with this simplification, the number of resulting markers that differ may remain quite large, with the map length by far exceeding the expectations based on the estimates of chiasma frequencies at meiosis [7,8].Clearly, marker scoring errors generate "false recombinants": with perfect scoring most of these recombinants would not have appeared, but after excluding absolutely linked excessive markers (replacing them by delegates), it would be possible to build an "ideal" skeletonap.Another possible complicating factor is negative interference [4,5,9,10] violating the simple principle that "the entire entity is supposed to be larger than its parts" [11].
Besides close linkage combined with a limited (small) sample size, the necessity for the selection of representative markers for the skeleton map derives from the varying information content of markers (co-dominant versus dominant, missing data, distorted segregation, and scor-ing errors), "absolute" linkage between repulsion-phase dominant markers, and negative interference [5].These (or some of these) criteria are employed by other authors also.Thus, before the analysis, the authors also chose bin markers, whereas after the analysis a decision is made about excluding double recombinants and recovering missing data (usually, by assuming no recombination).The problem with the last correction aimed to reduce the map length, is that it does not affect the order of the markers.This after-ordering correction deals with maps that might have been affected by errors of marker scoring.This could cause erroneous ordering or, even worse, bringing together markers from non-homologous chromosomes.
Our objective is to get an approximation as close as possible to the true multilocus order despite the foregoing complications.A specific feature of our approach is that the choice of candidates for the skeleton map is a part of the core ordering-verification procedure focused on detecting and removing markers causing local map instability and non-monotonic changes of recombination (i.e., deviation from the expected increase of rf between a marker and its subsequent neighbors).The verification process is based on multiple re-sampling runs from the scored mapping population using the so-called jackknife approach [12,13], namely, from the initial set of N genotypes, we sampled a subset of αN genotypes (e.g., with α = 0.8-0.9)scored for the same markers.The obtained sub-sample is employed to order the map.This process is applied repeatedly (e.g., 100 times), resulting in corresponding map orders.The neighborhoods that do not change upon these jackknife runs can be referred to as stable.
Clearly, such a formulation calls for massive repeated application of multilocus ordering procedures that may be computationally very challenging in case of moderate to high-density maps.Several genomic problems, including multilocus genetic mapping, in building physical maps (contig assembling for overlapping clones and radiation hybrid mappings), assembling ESTs, and others, can be formulated as multipoint one-dimensional ordering.Despite variation among possible optimization criteria, the one-dimensional genetic or genomic ordering problems are quite similar to the well known challenging Traveling Salesperson Problem (TSP).A powerful algorithm developed for a wide class of TSP-like Vehicle Routing Problems and referred to as the Guided Evolution Strategy (GES) [14] was successfully adapted to genetic mapping [15].High performance and high precision of GES algorithms make them very suitable to address computation challenging multipoint ordering problems, especially in the context of our methodology requiring a verification analysis to ensure stability of the constructed maps.The mapping of 3000 loci of a dataset generated in a maize project at the Center for Plant Genomics, Iowa State University, can be used as an example of practical efficiency of this approach [16].

General Scheme
The core procedure of our approach includes the following stages (Figure 1): Using marker orders rather than marker cM positions as the main map characteristic, with minimum total map length as an optimization criterion.Although map length is employed by many other authors, the stability of ordering rather than the confidence intervals of the marker positions or posterior marker positions [17,18] as a criterion of the map quality, is central to our method.To evaluate the stability of ordering, we employ re-sampling procedures [11,19].
The previously mentioned formulation was possible to implement as a mapping algorithm for many markers (i.e., hundreds per chromosome) because of a novel, highly efficient method of discrete optimization developed in our lab [14].A procedure for detecting and removing problematic markers causing local "map expansion" using the instability of local neighborhoods across re-sampling runs and the deviation from the expected monotonic change of recombination rates as criteria.A stepwise procedure of merging clusters of linked markers based on the end-to-end principle in order to reduce the danger of combining markers of non-homologous chromosomes in one map.This danger may derive from sampling deviations of corresponding recombination rates from the expected 50% or from the pseudo-linkage phenomenon [4,5,20].

Clustering
The first step is calculating pairwise recombination frac-tions (rf ) for all pairs of markers using the maximum likelihood estimation procedure.Then, the number of clusters (linkage groups, LG) can be evaluated as a function of the threshold (maximal) value rf 0 , allowing the preliminary assignment of a marker to a certain LG, namely, marker m i may be assigned to an LG j if recombination between m i and at least one marker from LG j is lower than the threshold rf 0 and is the lowest compared to the distances to any other LG.Based on the obtained information, it is necessary to choose a sufficiently small value of rf 0 to exclude the possibility of getting in one LG markers from non-homologous chromosomes due to pseudo-linkage (see [4,5,21,22]).But choosing an rf 0 that is too small will result in a large number of clusters (linkage groups) that will considerably exceed the haploid number of the species.Therefore, the next steps should include controlled merging of some of the clusters by a gradual relaxing of the conditions on pseudolinkage (by increasing rf ).The specific feature of our approach is that the building and ordering of the LGs are considered as interacting procedures (see Figure 1).If some markers of two LGs appeared closer than the relaxed rf, it would be reasonable to permit merging if the closest markers of the two candidate LGs are terminal or sub-terminal ("end-to-end" merging).Merging should be forbidden if the closest markers reside in the interior part of one or both candidates.
To illustrate how this scheme works, we simulated an example with two chromosomes (A and B) with pseudolinkage.The maximum deviation from independent segregations of markers a i (chromosome A) and b j (chromosome B) was for markers with i = 5 and j = 13 (the simulated value was rf a5-b13 = 0.2, whereas the value that "occurred" was 0.19).The recombination values for consecutive adjacent markers were 0.1, excluding r 11-12 = 0.285 on chromosome A and r 8-9 = 0.33 on chromosome B. What happens when two different threshold values of recombination are chosen, e.g., rf 0 = 0.3 (a usual choice in many publications) and rf 0 = 0.15?With rf 0 = 0.3, all markers of chromosome A and the 12 last markers of chromosome B should be combined in one linkage group; the remainder of the markers of chromosome B comprised the second linkage group.The results obtained after ordering these groups followed by detecting and removing markers violating monotonic change of rf along the map are shown in Figure 2(a).Now, we start with a more stringent threshold, rf 0 =0.15.This choice resulted in four linkage groups (Figure 2(b)); upon relaxation of the threshold (0.15  0.30) linkage groups of the non-homologous chromosome would tend to merge, but not in the end-to-end manner (their internal parts proved to be the closest, rf a5-b13 = 0.19).Thus, this merging was not allowed.The next step of relaxation (rf 0 = 0.35) with the previously mentioned rule (allowing only end-to-end merging) resulted in the correct recovery of the simulated chromosomes.The presented cycle can be repeated several times until further merging will cause an appearance of LGs with large internal gaps.Clearly, this procedure can be simplified if anchor markers are available.However, the choice and usage of anchors based on literature should be cautious because of the possibility of a relatively high level of errors in some published maps.

Multilocus Ordering
As noted above, the number of scored markers may, by orders of magnitude, exceed the number of those practically resolvable by recombination markers for the given population size.Only a small portion of markers (here referred to as delegate markers) can be included in the skeleton map, with the remaining markers being attached to the delegates.Besides the non-resolvable linkage caused by small sample size, the necessity for selection of representative markers for the skeleton map derives from non-random (clustered) recombination distribution in the genome [4], varying information content of the markers (co-dominant versus dominant, missing data, distorted segregation, and scoring errors), biased recombination estimates between repulsion-phase dominant markers [19], and negative interference [5,10].
Using our approach (implemented in MultiPoint package, http://www.multiqtl.com),one may start from a linkage group with hundreds of markers and conduct several analytical steps in order to build a reliable map:  multilocus ordering;  binding together closely linked markers followed by selection of delegates (bin markers) with the highest information content and replacing the groups of tightly linked markers by their "delegates"; repeated ordering and re-sampling verification of the reduced LG to detect regions of map instability;  removing the markers causing unstable neighborhoods and violating monotonic change of recombination, followed by repeated ordering to obtain the skeleton map;  attaching the removed markers to their best intervals on the skeleton map.Our mapping algorithm is based on the reduction of the multilocus ordering problem to TSP or TSP-like formulation.Its main features were described earlier [11,19].Here we present only recent modifications and extensions of our ordering algorithm.One of the possibilities in addressing the mapping problem is to recover the marker order from a known matrix d ij of pair-wise marker distances based on estimates of the recombination rate.An important fact is that in genetic ordering problems the distances between the markers cannot be measured directly.For this reason, even an exact TSP solution does not guarantee that the obtained map will be robust to a small variation of the data, hence the importance of stability testing.Special formulations of the problem may include various restrictions (e.g., on a predefined order of anchor markers), implying a reduction of genetic mapping to a more complex constrained discrete optimization problem.
In order to improve the efficiency of our multilocus ordering algorithms, we developed a new metaheuristic approach, referred to as the Guided Evolution Strategy (GES) that combines the strengths of the Guided Local Search (GLS - [33]) and Evolution Strategies [19] in the framework of one iterative two-stage procedure.GES combines the ES and GLS metaheuristics, and these two stages are iteratively repeated until no more improvements can be found in the local search.Our experiments on 302 large-scale benchmark vehicle routing problems with constraints demonstrated that the proposed algorithm is fast, cost-effective, and highly competitive, producing the best known solutions to 82% of the constrained benchmark problems [14].We adapted this GES approach to TSP-like problems of genetic mapping, including the Fast2Opt local search procedure and new variable (adaptive) neighborhood size.The new mapping-oriented algorithm works with small neighborhoods (25-50 neighbor markers) that allows significantly accelerate the performance on large-scale problems.The algorithm was tested on standard TSPlib problems with 50-2392 points.All known best solutions were achieved for these problems.

Map Verification
The objective of the verification procedure is detecting regions with unstable neighborhoods relative to the initial ordering (in the following example, explaining the method, we used 10 markers numbered from 1 to 10).This can be achieved by repeated re-sampling of the initial dataset (jackknife, bootstrap) followed by multilocus ordering for each such derivative sample (Figure 3 .The higher the deviation from 1 (i.e., from the "diagonal" pattern) the less certain is local order.In the example, we show only two re-sampling runs.Based on this smallsize re-sampling, we can indicate certain local neighborhoods, i.e., for marker pairs 1-2, 4-5, 7-8, and 9-10.In actual analysis the number of runs should be at least a few dozen or hundreds.
Clearly, the unstable neighborhoods result from fluctuations in the estimates of recombination rates across the repeated samples; the range of fluctuations depends on the sample size and the proportion of individuals taken at each jackknife run.In our framework, jackknife analysis is a modeling tool to quantify the diversity of map versions for the treated chromosome representing the sampling (stochastic) nature of the map.The results of such an evaluation can facilitate further decision making about problematic markers.These markers can be removed from the map and then the map must be rebuilt (Figure 3).

Improving Map Stability and Building
Skeleton Map After revealing the regions of map instability, we need to make a decision about the marker (or markers) responsible for local instability.For each region, there could be more than one candidate for removal from the dataset with the objective to stabilize the order.Our choice should depend on quality of the markers (anchor markers or genes vs. other markers; co-dominant vs. dominant; concordantly c-segregating with the neighbors or displaying unique segregation; fully scored or with many missing data, etc.).Taking these criteria into account, we can check the effect of removing any of the candidates using the trial-and-error approach, namely, after the removal of a candidate marker, we can re-build the map and again test its stability based on jackknife re-sampling.This computing of intensive methodology is affordable within the framework of our approach due to the very high performance of our multilocus ordering heuristic algorithm.As a result, we will come up with a stabilized (skeleton) map.
Clearly, the skeleton map will include the most reliable (informative) markers.Likewise, any group of tightly linked markers un-resolved by recombination, due to tight linkage and small sample size, will be represented by one delegate marker, which is also selected on the basis of scoring quality or biological priority (say, gene vs. anonymous marker).The three main reasons for map instability can be mentioned here: 1) tightly linked markers (with one-two recombinants in the sample) that may produce varying local orders upon jackknife runs; 2) islands of moderately linked well ordered markers separated from other neighbors by relatively large gaps that will appear in opposite orientation of the entire island (propeller effect); and 3) regions with non-mon-otonic change in recombination due to an excess of double recombinants caused by scoring errors, negative interference, or gene conversion) [5,10,20,23,24].

Criteria for Comparing Different Map
Versions To characterize the efficiency and advantages of the proposed methodology compared to other methods, we need to define some criteria used in comparisons.These include: Map length and number of markers presented in the resulting stable map: Even with correctly unraveled marker order, the evaluated map length may be higher than the actual one, due to a certain amount of scoring errors.But deviation from the correct order will result in map length inflation even without marker scoring errors.
Controlling monotony: Some of the scoring errors may generate situations in violation of the principle that "the entire entity is supposed to be larger than its parts" [11].Normally, for three markers ordered as a-b-c, one would expect: r ac > r ab & r ac > r bc .A violation of this condition indicates that something may be wrong with the local ordering.Alternatively, a violation may be caused by negative interference or gene conversion.
Stability of local neighborhoods: With increased proportions of scoring errors, the ordering may become very sensitive even to a small sampling variation upon jackknife re-sampling.To quantify such instability, we employ a simple measure: where  2 i is variance of i th marker neighborhood: and p ij is the proportion of jackknife runs where markers i and j were adjacent neighbors.The enumeration of markers is according to the multilocus order obtained on the entire sample or based on an external order (e.g., from the literature).Stable order will give  = 1.Concordance of segregation distortion within local neighborhoods: Segregation distortion is a known phenomenon caused by various factors [4].Upon correct ordering, one would expect a certain correspondence between segregation ratios of neighbor markers (seg i and seg i+1 ), namely, a correct order should give smaller values of the following criterion S i compared to a wrong order.If the frequency of one of the two classes at a marker locus in an RIL (dihaploid, or backcross) population or one of the homozygotes in F2 is p 1 , then the normalized change of segregation ratio from marker i to i + 1 can be calculated as: with dseg i = |p 2i -p 2i+1 |, where p 2i and p 2i+1 are the frequencies of the second marker class for loci i and i + 1 normalized by p 1i and p 1i+1 , and D i,i+1 is the distance (in cM) between the markers; for F2: where p 2i and p 3i are the normalized frequencies of the heterozygote and the second homozygote.

Using Real Mapping Data for Illustration
To illustrate the efficiency of the proposed mapping strategy on real data we selected a few examples of diverse organisms: wheat, barley, oat, maize, Arabidopsis, mouse, rat, and trout using the data available on the web.The published results for the same datasets were employed for comparisons with our map versions.

RESULTS
An empirical assessment of the proposed analytical framework, in comparison with other procedures, can be conducted using both Monte-Carlo simulations and real data.The validation of the basic properties of our algorithms was provided in our earlier publications using simulated data [11,19].Here we demonstrate the advantages of the extended analytical scheme using genome mapping data on several eukaryotic species: wheat, barley, oat, maize, Arabidopsis, mouse, rat, and trout.Obviously, our intention is just to illustrate the proposed approach rather than to revise the published maps.We believe that a revision of a map, if needed, should be conducted by the research groups generating the data.
Among the illustrations provided below, the example on wheat is presented in more detail.The results on the other examples are summarized in Table 1, using the case on maize to explain how we compare the published and de novo constructed maps.One more example, for the mouse, is summarized in a separate table (Table 2) due to the fact that besides chromosome 16, our results for all other chromosomes corresponded well with the published map.Like with the mouse, our map version corresponded well with the published version using a dataset on Arabidopsis.A slight difference was detected for chromosome 5 only due to the presence of two markers with high levels of missing data, which caused problems in the published map and were excluded from our version (see Table 1).

Wheat Chromosome 1B
Wheat data from the GrainGenes website (http://wheat.pw.usda.gov/GG2/quickquery.shtml) on chromosome 1B for the RIL mapping population Synthetic  Opata with 81 markers were employed (72 markers remained after deleting absolutely linked markers).The first step was to check the marker ordering presented on the web.Using the re-calculated pair-wise rf values transformed "back" to F1 level, the length of the map (with Kosambi mapping function) was estimated as L = 444 cM.Based on re-sampling analysis, the neighborhood stability of this map was tested and found to be relatively low (Figure 4

(a)-4(c)).
A more stable ordering, at least for a sub-set of markers (comprising a skeleton map), can be achieved by removing the markers causing the observed instability and deviation from the (expected) monotonic growth of recombination rates along the map around each of the markers [11].After such "cleaning", a skeleton map with 26 markers was obtained (Figure 5(a)).The first and the last markers proved to be the same, as they were in the initial map (Figure 5(b)); the map length was reduced from 444 cM to 138 cM (!), and this was achieved without deleting "double recombinants" and replacing missing scores by those that yield non-recombinants.The improved quality of our map was accomplished by detecting and removing problematic markers, mainly with high missing levels.Historically, these markers were placed onto the map as "second wave" and "third wave" markers that were characterized much later than the first groups of markers and for a much smaller subsample of the initial mapping population.Unlike the total set with 72 markers, the order of our skeleton map with a subset of 26 markers (see Figure 5(a)) corresponded well with the published map: the relative ordering of the markers in Figure 5(b) has only one minor "inversion" (between markers Xbcd442 (#66) and Xbcd441 (#64).

Maize Chromosome 1(IRIL5)
We employed data on IBM302 intercross recombinant 583  The order of skeleton markers (which proved to be the "first wave" markers) in our version of the map displays a remarkable similarity to the order on the website map, excluding one locality.inbred line (IRIL) mapping population (http://www.maizegdb.org/cgi-bin/displaymaprecord.cgi?id=870745).To demonstrate the differences between the web version and our version of the maps the results on maize chro-mosome 1 are presented (Table 1).The length of our version of skeleton map (L MP = 357 cM) is half of that built with MapMaker (L MM = 696 cM), probably reflecting a better ordering and less discrepancy from the cy-togenetic map (see [7,8]).Note that the distal markers in the two maps coincide.The average interval size between the neighbor markers was 696:341 = 2.05 for MM and 357:182 = 1.96 for MP.
Several examples demonstrating the problems that are typical of many published maps and are easily resolvable, via detecting and removing unreliable markers, are pre-sented in the examples in Table 1 for the selected organisms, including maize.Thus, disconcordant segregations of maize markers 202-203-204 in the MM map (reflected in large values of 100dseg/D) indicate that marker 203 was erroneously placed between 202 and 204.Indeed, segregation ratios for the consequent markers 202, 203, 204 in MM map are 0.78, 0.2, and 0.75,  respectively.In МР map, marker 203 is removed.Removing a marker in a correct map usually results in an increased distance between the remaining markers.That was not the case in this example: D(202,203) = 6.13,D(203,204) = 4.31, and D(202,204) = 2.80.Thus, the presence of 203 expands the local region more than 3-fold.An even higher discrepancy in segregation ratios was detected in the neighborhood flanked by markers 304 and 308.Here three markers are problematic: 305, 306, and 307.If all five markers, from 304 till 308, are retained, the total length of the segment is equal to the sum of distances D(304,305) = 8.2, D(305,306) = 5.4,D(306,307) = 4.4, and D(307,308) = 3.5, which is 21.5.In our version of the map markers 305, 306, and 307 are deleted, and the length of the segment becomes D(304, 308) = 2.3, i.e., only ~1/10 of the original size of the interval flanked by 304 and 308.An additional argument in favor of our analysis is that unlike the removed markers, markers included on our skeleton map appear in the same order as in the original web map, excluding a few local discrepancies, namely, the revised local orders 2-1-4, 28-32-31-33, 47-49-48-50, 101-109-105-111, 289-295 -293-292-296, and 324-331-329-334 (the numbers reflect the consequent relative positions of marker loci from 1 to 341 in the original map).

Mouse
We employed data on the mapping population (C57BL/ 6JxM.spretus)F1xC57BL/6Jhttp://www.informatics.jax.org/searches/crossdata_form.shtml).The skeleton maps that we constructed for chromosomes 1-19 and X corresponded completely with the maps presented on the website.The only difference was for chromosome 16: the web map seems to have a serious local mistake, unless the authors used some additional information.Still, keeping in mind the high quality of the data, it may be instructive to compare our results with the original maps.For comparison, we excluded absolutely linked markers from the original maps.The results are shown in Table 2.For chromosome 5, the difference between our map and the web version is small and was caused by two markers (Zp3 and Ccnb1-rs1) that violated the rule that the entity cannot be smaller than its parts [11].The two versions of the map proved identical until the marker Gusb.In the web version, the interval Gusb-Zp3 was larger than the flanking interval Gusb-Gnb2; thus, deleting Zp3 seems a reasonable suggestion.Similarly, interval D5Fcr8-Brca2 is shorter than its part Ccnb1-rs1-Brca2; thus Ccnb1-rs1 was also deleted.For chromosome 16, the considerable difference between the two versions, is caused by the erroneous placement of marker D16Fcr1 at the upper distal point of the map, despite its tight linkage with Csta.

DISCUSSION
Building correct multilocus maps is usually considered a pre-requisite for diverse genomic/genetic applications, e.g., positional cloning, anchoring contigs in physical mapping, and marker-assisted selection.Dodds and coauthors [25] surprisingly found that map errors do not seem to have too much influence on QTL mapping results.Although this may be the case in some situations, in many other situations, map errors may lead to dramatic negative impacts.Purportedly, if the objective of QTL mapping was map-based cloning, then a lot of effort might be made with no results if the map order in the region of residence of the targeted QTL was wrong.This may happen even if the assignment of the QTL to the chromosome was correct.Moreover, with some types of erroneous ordering, one could detect two QTLs in a chromosome that carried only one QTL for the considered trait, whereas under the correct ordering of markers only one QTL will be detected.Obviously, wrong ordering, even on a local scale, may also reduce the efficiency of using marker positions on the map to facilitate physical mapping.Here we propose an analytical scheme for building multilocus genetic maps that allows reasonable map quality even in complicated situations with very large number of markers, disproportionately small sample sizes, and a high level of reading errors.A high ratio of scored markers to population size has become typical in recent years due to the ever-increasing availability of high throughput genomic technologies.This new situation has encountered a psychological barrier inherited from the previous generation of the mapping community, when the number of markers was very small, justifying an intention to put each marker onto the map, even if its position is poorly resolved from the neighbor(s) or if its quality is problematic.Even now, when the number of potential markers becomes rather high (and sometimes huge), there is still a tendency to follow this approach.Some authors suggest selecting a part of the markers as map bins and order the vast majority of the remaining markers relative to the bins.This is a more realistic task, but still the number of markers may by far exceed the map resolution caused by small sample sizes (see Introduction section).Consequently, the resulting maps, carrying large number of markers, may be locally unreliable.
Our approach of testing map stability includes a verification procedure based on jackknife re-sampling [11].Its major difference from the usual way of addressing this problem is that, as a measure of map stability, we consider the marker local orders [11,19] rather than the length of confidence intervals of marker map positions [17].This approach gives flexibility in detecting and removing markers causing map instability that would be less natural to implement if the cM position on the map is the measure.Moreover, it is a well known fact that recombination rates may display very high variability between different mapping populations of the same organism, due to the effects of genotype, age, and environment [4,20].This variability may cause serious problems in combining data from different mapping populations to build consensus maps [9,26].If marker order is the basis of map comparisons, this problem just does not exist.
At each of the re-sampling iterations, the multilocus mapping problem is solved using TSP-like formulation.Several well known heuristic algorithms can be applied: Tabu Search, Simulated Annealing, Guided Local Search, Genetic Algorithm (with EAX crossover), Evolution Strategy, Guided Evolution Strategy, Ant Colony Behavior (ACB), and Artificial Neural Networks (ANN) (for detailed references see [15]).Currently, the most advanced software for solving unconstrained TSP is the Concorde package (http://www.tsp.gatech.edu/index.html).Concorde TSP software was applied for solving genomic problems (e.g., [27,28].The Concorde solver uses the cutting-plane algorithm [29,30], which is an alternative to branch and bound to solve integer programming using the specific (one-dimensional) structure of the problem.This allows generating very good cuts helping to accelerate the optimization process, but only if the problem does not contain additional restrictions.
There are examples of applying the cutting plane algorithm to constrained discrete optimization problems [31,32].In both cited studies, the authors employed a multi-processor system and parallel C++ language.In particular, using the 188 processors system, a Concorde team obtained the exact solution for a 120-point problem for 10 days.Our heuristic GES algorithm found exactly the same solution in just ten seconds using a standard PC Pentium IV (2.0 Ghz).This fact illustrates that heuristic approaches for constrained optimization problems are preferable, and GES manages with this challenge effectively.We note that our mapping-oriented GES is based on hybrid technology and employs powerful properties of both Guided Local Search [33,34] and Evolution Strategy algorithms [11,19].
Returning to the discussion about the importance of re-sampling for reliable mapping, we should stress again that one of the most frequent factors of instability in small sample sizes and in a large number of markers is errors in marker reading.As a rule, the number of such errors is rather small, just a few per marker per population.But the result is that marker pairs that had to be non-resolvable (no true recombinants) or poorly resolvable (one recombinant) upon the small sample size, become "resolvable" and are somehow ordered.Thus, if, on some position of the chromosome, there are several absolutely linked markers, the small rate of scoring errors, approximately equal for all markers, should disperse the markers in some multi-dimensional sphere (where they are equally distant from each other).In fact, during the mapping of these markers, they will be ordered in a one-dimensional space (as part of the map), resulting in map length inflation proportional to the number of such markers and the rate of errors.Local marker order in such a situation should be very unstable (non-reproducible) upon jackknife re-sampling, hence the proposed method of detecting such a region by our verification procedure.Removing a considerable proportion of such markers should significantly improve the map reliability and reduce the map length.As was shown in the previously mentioned example on wheat, missing data also may be an important contributor to the instability of marker neighborhoods, hence, markers with high missing levels should be considered among the first candidates for removal during the building of skeleton maps.
The discomfort that a researcher gets from a map that is too long, explains the intention of reducing the map length, sometimes by rather artificial approaches.For example, such trials may include removing double recombinants and/or recovering the missing data by scores yielding non-recombinant genotypes after the marker ordering was finished.We found a good example for such a situation in re-analyzing the data on wheat chromosome 1B (see the previously mentioned analysis on wheat chromosome 1B).In a trial to display the map length for the multilocus order of 1B presented on the website, we used "per meiosis" recombination rates converted to Kosambi map distances.This resulted in a 1B map with L = 432 cM length, in contrast to the one published on the website with L = 104 cM.How could this huge discrepancy (432 vs. 104 cM) be explained?We managed to "reproduce" the underlying procedure with very high precision.As with many other published maps, major efforts have been invested by several teams to enrich this population with molecular markers from other populations, thereby bridging these mapping resources together.Unfortunately, new waves of markers were scored only partially, so that the level of missing data for this population was often very high, up to 50-70%.
It appeared that the problem of missing data was treated by the map constructors (or by the software they have employed) as follows.After ordering the markers, missing data were recovered by replacing the missing scores with those that resulted exclusively in non-recombinants.Clearly, the higher the level of missing data the stronger the effect of such correction will be, i.e., a reduction of the map length.Before transforming rf values for RIL to rf values for F1, double recombinants for any pair of adjacent intervals were also replaced by non-recombinants.This method of removing erroneous double recombinants seems reasonable for F2 or double haploids, but it is inappropriate given that in RIL mapping population, "double recombinants" are not necessarily a result of scoring errors or real double recombination events.Indeed, a considerable portion of "double recombinants" have likely resulted from recombination in adjacent intervals that occurred in meiosis IN DIF-FERENT generations of genotypes that remained heterozygous for those regions (in F2, F3, etc.).We considered several intervals, where our ordering was exactly the same as in the published map, but the distances were different (namely, our distances were higher than those published on the website).After conducting the previously mentioned "correction" steps, we obtained exactly the same distances as reported on the website map.The conducted revision analysis indicates that these distances may be irrelevant to the actual situation.Thus, the relatively small lengths of those maps are almost certainly an artifact introduced during the merging of different marker data sources, some of which contained high frequencies of missing data and inappropriate "error correction".
In genetic mapping, multilocus ordering is usually considered as a much more complicated problem compared to subdivision of the marker set into linkage groups.However, many examples indicate that markers from non-homologous chromosomes were assigned to the wrong chromosomes, presumably, due to pseudo-linkage.We have encountered these types of errors in analyzing cereal species (see [5]).Similarly, in cattle, such wrong assignments were found for 12% of markers/contigs (H.Lewin, personal communication).Thus, the pseudo-linkage phenomenon should be of special concern for large-scale genetic mapping.As indicated above, incorrect assignments can be caused by biological and statistical reasons.The probability of sampling deviation from the random segregation of markers from non-homologous chromosomes should grow with increased numbers of chromosomes, length of chromosomes, number of markers, and with small sample size.We propose in this paper that a considerable part of "wrong assignment" errors can be reduced by the algorithm of stepwise clustering markers into linkage groups alternated by multilocus ordering.
The possibility of re-sampling based on testing map stability by detecting and removing the markers that cause low map quality is the second major component of the proposed approach.Detection and removal of the markers responsible for local map instabilities and non-monotonic change in recombination rates allows building stable skeleton maps with minimal total length.Clearly, there is some degree of uncertainty in such choices; hence, there might be different versions of the skeleton map.In such situations, the high performance of our algorithms is an important advantage allowing further fast correction (complementing) of the skeleton map by using additional markers and/or some of the deleted markers.Further improvement of the mapping quality is achievable by joint analysis of mapping data from different mapping populations that can be referred to as consensus mapping [9,26].

Figure 1 .
Figure 1.Stepwise clustering of markers in linkage groups coordinated with multilocus ordering.

Figure 2 .
Figure 2. Reducing the chances of wrong clustering by stepwise relaxation of the threshold recombination: (a) Wrong clustering of markers, caused by a high threshold value of recopmbination in situations of pseudolinkage; (b) Preventing wrong clustering by using stringent threshold recombination.
(a)).Then, the identification of unstable regions can be conducted based on the frequency distribution of the right-side and left-side neighbors (Figures3(b)-3(d)).The identification of such regions can be conducted by summing up corresponding neighborhood matrices (Figures3(b)-3(c)) and calculating the neighborhood frequencies (Figure3(d))

Figure 3 .
Figure 3. Graphical display of the verification process: (a) multilocus orders for each jackknife (the example includes only two jackknives); (b) and (c) neighborhood matrices for each jackknife run; (d) matrix of neighbourhood frequencies averaged over all runs.

Figure 4 .
Figure 4. Stability testing of wheat Synthetic  Opata standard map of 1B chromosome by using jackknife re-sampling.The published order was chosen as a reference (diagonal).The stable order should be the one where for each marker its left-side and right-side neighbors do not vary across the repeated runs (1-1 pair along the diagonal).Notes: the numbers in brackets near the marker names indicate the marker order in the map presented on the website.(a)-(c) represent the three (overlapping) parts of the map; analysis was conducted using MultiPoint software package (http://www.multiqtl.com).

Figure 5 .
Figure 5. Stabilizing the map by deriving the skeleton map.The graph represents our trial to stabilize the wheat 1B map by detecting and removing the markers that caused the local instability of the neighborhoods.(a) and (b) represent the new (stabilized) and the original (web) versions, respectively.The order of skeleton markers (which proved to be the "first wave" markers) in our version of the map displays a remarkable similarity to the order on the website map, excluding one locality.
cM N mar -numebr of markers in the intial map version, n mar -number of delegate markers MMQTX -Map Manager QTX

Table 1 .
Examples of comparing the quality characteristics of the revised multilocus maps with those of the original maps.

Table 2 .
Comparing two versions of mouse maps.