|Home | About | Journals | Submit | Contact Us | Français|
A study on the codominant scoring of AFLP markers in association panels without prior knowledge on genotype probabilities is described. Bands are scored codominantly by fitting normal mixture models to band intensities, illustrating and optimizing existing methodology, which employs the EM-algorithm. We study features that improve the performance of the algorithm, and the unmixing in general, like parameter initialization, restrictions on parameters, data transformation, and outlier removal. Parameter restrictions include equal component variances, equal or nearly equal distances between component means, and mixing probabilities according to Hardy–Weinberg Equilibrium. Histogram visualization of band intensities with superimposed normal densities, and optional classification scores and other grouping information, assists further in the codominant scoring. We find empirical evidence favoring the square root transformation of the band intensity, as was found in segregating populations. Our approach provides posterior genotype probabilities for marker loci. These probabilities can form the basis for association mapping and are more useful than the standard scoring categories A, H, B, C, D. They can also be used to calculate predictors for additive and dominance effects. Diagnostics for data quality of AFLP markers are described: preference for three-component mixture model, good separation between component means, and lack of singletons for the component with highest mean. Software has been developed in R, containing the models for normal mixtures with facilitating features, and visualizations. The methods are applied to an association panel in tomato, comprising 1,175 polymorphic markers on 94 tomato hybrids, as part of a larger study within the Dutch Centre for BioSystems Genomics.
Amplified fragment length polymorphism (AFLP) (Vos et al. 1995) is a widely used DNA fingerprinting system. The physical end product of the AFLP procedure is a slab gel containing bands at different positions within columns of the gel. Instead of gels, capillary systems are nowadays often used. The columns are called lanes, and correspond to the different individual genomes (individuals). The bands visualize amplified DNA fragments of specific lengths, traveling in the lanes by electrophoresis. The position of a band within a lane is mainly determined by the size of the fragment, with shorter fragments traveling further. The pattern of bands within a lane is called a profile. Usually, AFLP bands are scored dominantly, i.e., binary, as absent or present. In this way, AFLP bands are dominant markers, which do not distinguish between individuals with one copy of the DNA fragment (heterozygous individuals) and two copies (homozygous individuals). However, the gels or capillary systems allow the intensities of the band to be scored as well. Assuming that the intensity of a band is a measure of the amount of amplified DNA, the band intensity can be exploited to infer the copy number of a DNA fragment. In the case of diploid organisms, an individual with the DNA fragment on two homologous chromosomes (homozygous AA) should have a more intense band than an individual with the DNA fragment on only one of two homologous chromosomes (heterozygous Aa). The heterozygous individual, in turn, should have a more intense band than an individual lacking the fragment completely (homozygous absent aa). Therefore, it must be possible to infer the copy number of an AFLP fragment from the band intensity, making the AFLP marker a codominant marker. Scoring the copy number of the AFLP fragment is also named genotype calling.
The idea to codominantly score AFLPs using the band intensities is not new. An early mention can be found in van Eck et al. (1995), and later Piepho and Koch (2000), and, in a reaction, Jansen et al. (2001) published about the statistical principles of the approach. These authors illustrate their methods by codominantly scoring AFLP markers from segregating F2 populations, with a priori known genotype frequencies 0.25, 0.50, and 0.25 for AA, Aa, and aa, respectively. As Meudt and Clarke (2007) report, codominant AFLP scoring so far is limited to model organisms and commercial crop organisms, for which genetic information already exists for accurate identification of the codominant scores. Vuylsteke (2007) mentions that codominant scoring of AFLP markers has become routine in segregating populations, as in F2 or backcross populations. Examples of studies of segregating populations, with known segregation ratio for the offspring, are, e.g., Castiglioni et al. (1999), Reamon-Büttner et al. (1998), and Deniau et al. (2006).
The aim of our study is to illustrate and optimize existing methodology for the codominant scoring of AFLP markers using data from an association panel, without a priori knowledge of allele frequencies. The association panel consists of a collection of 94 tomato hybrids, for which, due to confidentiality reasons, no pedigree information was made available.
An overview of the dataset, and analyses concerning diversity and linkage disequilibrium, containing a concise description of the codominant scoring, can be found in van Berloo et al. (2008b). Commercially available software, such as Quantar Pro (Keygene products BV 2004) from the private company Keygene NV, is rather limited in output facilities, as it gives hard classifications only, and does not contain options to back up the codominant scoring in case of an association panel. We therefore developed software, and used it for the codominant scoring of the AFLP data. In the present paper, we describe
The intensity of an AFLP band, named optical density by Piepho and Koch (2000), is a non-negative number, indicating the darkness of a band on a gray scale. Because band intensities vary from lane to lane (e.g., caused by differences in amount of DNA loaded in a lane), and due to background variation in intensity and image artifacts, the raw band intensities need to be corrected to make bands comparable between lanes. Corrections can be done in different ways. Piepho and Koch (2000) suggest to remove systematic trends discernible from monomorphic bands with the use of quadratic polynomial regression models and random lane effects, and to check for spatial correlation. In the present study, we use the correction as performed by the proprietary software of Keygene NV. This correction accounts for total lane intensity and intensity of monomorphic bands, and divides the intensities row-wise (per marker) by the maximum intensity per row, resulting in a range 0–1.
The (corrected) band intensity is related to the amount of amplified DNA at the band position. We assume a monotonous relationship: more amplified DNA tends to produce darker bands. This means for diploid organisms, such as tomato, that a homozygous individual with two copies of a fragment tends to have a band with higher intensity than a heterozygous individual with a single copy, which, in turn, has a higher intensity band than an individual lacking the fragment completely. Codominant scoring of a band is the prediction of the copy number of the fragment (or genotype class AA, Aa, or aa) from the intensity of the band. Codominant scoring is straightforward in the case that the intensities fall into three well-separated groups. But more often, groups overlap, e.g., because the relationship between band intensity and copy number is non-linear, as indicated by Piepho and Koch (2000). The intensity may be upwardly bounded due to saturation, hampering the discrimination between heterozygous and homozygous individuals. Other problems, blurring simple inference on zygosity, are errors in the AFLP procedure itself [like amplification errors in the polymerase chain reaction (PCR), and gel mobility errors], and measurement errors of the band intensities. To take account of these problems, a formal approach using a statistical model is beneficial.
Statistically speaking, codominant scoring is a type of cluster analysis with a predefined number of classes (three in the case of diploid organisms). Although ordinary clustering techniques could be used, the common approach described in the literature is to fit a Gaussian (or normal) mixture model. This is an example of model-based clustering (Fraley and Raftery 2002), because a proper statistical model is used to describe the data. For an association panel of n individuals, we have per marker n intensities, labeled y1, ..., yn. The Gaussian mixture model (McLachlan and Peel 2000) for intensity yi of variety i is:
with fj the density of a normal distribution with mean μj and standard deviation σj. The mixing probability πj is the prior probability that a randomly drawn individual belongs to group, or component, j. In the standard situation, we have three groups: 1 = no copies, 2 = one copy, and 3 = two copies. We assume for the expected intensities μj, that μ1 < μ2 < μ3. The posterior probability of cultivar i to belong to group k (k = 1, 2, 3) is
which are conditional genotype probabilities given the marker phenotype (intensity). In total, eight unknown parameters are to be estimated: μ1, μ2, μ3, σ1, σ2, σ3, and π1, π2 (and π3 = 1 − π1 − π2), using maximum likelihood. For segregating populations parameter values may be known, e.g., in case of F2 populations, the segregation ratio is 1:2:1, hence π1 = 0.25, π2 = 0.5, π3 = 0.25. We use the EM-algorithm (Dempster et al. 1977) to get maximum likelihood estimates, treating the situation as an incomplete data problem with missing class memberships, as in Jansen (1993) and Piepho and Koch (2000). In the algorithm, the E-step, in which estimates of the posterior class probabilities are returned by conditioning on data and parameters, and M-step, returning new parameter estimates alternate until convergence. The M-step consists of separate update steps for πj, fitting a generalized linear model for multinomial data to the weights and for μj and σj, fitting a linear model allowing for 3 group means (ANOVA model) and weights to the replicated intensities.
In non-standard situations, the number of components g of the normal mixture model may deviate from 3. We refer to item 2 of the next section. Mixture models are a topic of ongoing statistical research, because problems exist with the identifiability of parameters, and parameters occurring at the boundary of the parameter space. Therefore, most classical asymptotic results cannot be directly applied. Here, we supply a short review of recent work on mixture models. Böhning et al. (2007) give in an editorial an outline of the current state of the art. Slightly older is the book by McLachlan and Peel (2000), containing a wealth of references. Particularly, interesting aspects of mixture modeling for our situation are: (1) hypothesis testing, (2) order selection, i.e., determination of the number of groups, (3) robustification. Recent work on hypothesis testing for the special case of testing homogeneity (i.e., discriminating a one-component from a two-component mixture) is Chen and Li (2009), Li et al. (2009), and Garel (2007). The case of testing homogeneity is not of great interest in our situation, though. Other work focuses on testing homoscedastic versus heteroscedastic normal mixtures (e.g., Lo 2008), but conclusions are meager. The topic of order selection has kept statisticians busy for long. A worthwhile reference on hypothesis testing for the number of components is Feng and McCulloch (1994), but they describe the case of unequal variances, which we avoid (see following section). A very recent study on order selection is Chen and Khalili (2008) using a penalized likelihood approach. Comparing with other criteria in a simulation study, they conclude that their approach performs generally but not always better. Normality-based methods for estimation have the problem of sensitivity to outliers. Different authors studied the problem. Recent studies are McLachlan et al. (2006), using mixtures of t distributions, and Cuesta-Albertos et al. (2008), using a mix of initial robust clustering for subsamples and maximum likelihood. From this overview we learn that the final word on these topics has not been said.
We study a number of features relevant to the codominant scoring methodology in association panels. Some of them relate to the EM-algorithm, aiming at enhancement or stabilization of the unmixing, others at assessment of the quality of the AFLP marker data for codominant scoring, or model selection.
Comparison of nested models is usually done by likelihood ratio tests, but in the case of mixture models theoretical problems of non-identifiability arise, as earlier described. We take interest in
In other cases we compare fits of models by comparing BICs. If the compared models have equal numbers of parameters, the comparison by BIC is equivalent to the comparison by LL.
The usual result from the codominant scoring of AFLP markers is a hard classification of markers into categories. The classification can be done in different ways. Piepho and Koch (2000) suggest to take the category with highest posterior probability. The proprietary genotyping software of Keygene NV uses classification rules suggested by Jansen et al. (2001): genotype i is classified as:
The threshold probability 0.98 is the default value, but other values can be chosen as well. We notice that an extra region of doubt is necessary, because it may happen that genotypes exist, which cannot be classified as A, B, H, C or D. This may occur if the groups are not well separated, so that for some genotypes, but also The right-hand side plot of Fig. 1 shows an example. We call this extra region of doubt Z = unknown, meaning 0, 1, or 2 copies. The left-hand side plot shows the classification if probability threshold 0.95 is used. In that case all genotypes can be classified as A, B, H, C, or D.
The above-mentioned commonly used hard classification has a number of disadvantages. For instance, the classification rule, following from the probability threshold 0.98, is rather arbitrarily chosen. Furthermore, it is not clear how to deal with genotypes, once they are classified into one of the regions of doubt. Therefore, we propose to use instead the set of three posterior probabilities as result of the codominant scoring for genotype i. Using this approach, each genotype is allowed to belong to more than one class, with the posterior probabilities indicating the levels of membership to the classes. This type of clustering is called fuzzy clustering, see, e.g. Bezdek (1981). The resulting posterior genotype probabilities can be used in association mapping, analogously to the use of conditional QTL genotype probabilities given flanking marker information in case of QTL linkage mapping for biparental crosses.
Given the three posterior probabilities, it is straightforward to calculate predictors for the additive and dominance effects of the loci. The additive predictor for an individual is defined as with values between −1 and 1. The value −1 is obtained for loci which are classified as B (=aa) with probability 1. A locus has additive predictor value 1 if it is classified as A (=AA) with probability one. The dominance predictor xd depends only on the probability of a heterozygous genotype, and is defined as with values between 0 and 1. The additive and dominance predictors may be used, e.g., in association mapping, relating the codominant scores to phenotypic information by mixed models. A paper on genome-wide association mapping using these scores is in preparation.
Within the Centre for BioSystems Genomics, a Dutch plant genomic initiative (van Berloo et al. 2008a), one project aims at processes and mechanisms affecting fruit quality in tomato. Within this project an association panel, consisting of a diverse set of 94 tomato hybrids, was genotyped using AFLP with gel electrophoresis (van Berloo et al. 2008b). This set consists of 20 beef, 21 cherry, and 53 round tomato hybrids. The AFLP fingerprinting was performed at Keygene NV using standard in-house developed protocols. Fifty primer combinations were used, labeled A, B, …, Z, AA, AB, …, AX, based mostly on EcoRI/MSeI and some PstI/MSeI restriction enzyme combinations. The scoring range is approximately 50–550. Typically, between 50 and 100 bands are visible per primer combination per variety, the majority of which is monomorphic. Band intensities of a total of 1,175 polymorphic bands were scored by Keygene NV using the proprietary genotyping software. For 378 bands the map position is available from an integrated proprietary linkage map. We study both raw uncorrected intensities, with values in the range 0 to ≈106, and corrected intensities with values in the range 0−1. We refer to the dataset of band intensities of 1,175 AFLP markers on 94 tomato hybrids as the “tomato data”.
We study how the features mentioned in “Features for enhanced and stabilized unmixing, data quality and model selection” help in the codominant scoring of all 1,175 AFLP markers in the tomato data, focusing on the following topics.
We developed software routines in R (Ihaka and Gentleman 1996) for the codominant scoring of AFLP band intensities in an association panel, using the EM-algorithm. We built features into the software, as described in Materials and methods, allowing for different starting values of parameters, transformation of the response, restriction on parameters, different numbers of components, and for the types of output as described earlier. For a more detailed description of the software we refer to “Appendix”. All plots and mixture model output in this paper are results from applications of the R routines.
In Fig. 2, we show some examples of codominantly scored AFLP markers with well fitting three-component homoscedastic normal mixture models. The corrected band intensities are square-root transformed, unless mentioned otherwise. In subplots a and b, no variety is classified into a region of doubt. In subplots c and d, a few hybrids are classified as “D”. We added the boundaries of the classes into the plot, and minimum and maximum value of the raw band intensities. The variety in plot c classified as “D” has posterior probabilities
Figure 3 illustrates problems encountered in the codominant scoring of AFLP band intensities of the tomato dataset, that can be handled with the features described in “Features for enhanced and stabilized unmixing, data quality and model selection”. The subplots are labeled accordingly.
Table 1 shows the comparisons of the two types of parameter initialization of the EM-algorithm (by guesstimates and hierarchical clustering) for two-, three-, four-, and five-component homoscedastic mixture models for all 1,175 markers. We find that parameter initialization becomes more critical for more complex models. In case of mixture models with 2 groups, initialization by guesstimates and by hierarchical clustering results in identical parameter estimates (with maximized log-likelihood differing less than 10−6) for 95% of the markers. For models with 3, 4 and 5 groups this percentage is 74, 55, and 34%, respectively. For models with more than 2 groups, the cluster initialization outperforms the guesstimates. We conclude that cluster initialization is a better procedure for supplying starting values for parameters. To avoid being trapped in a local maximum, however, we advise to try other starting values as well, using, e.g., the described guesstimates. In the following analyses we fit models using both types of parameter initialization, and choose the results corresponding to the model with highest LL.
Table 2 shows the comparison of homoscedastic and heteroscedastic three-component mixture models by BIC for a range of power transformations. Between 3 and 15 markers, depending upon the transformation used, are discarded, because the LL of the heteroscedastic model is erroneously lower than that of the (smaller) homoscedastic model, due to convergence to local minima. Among the different power transformations, the square root transformation gives most often (63%) variance stabilization.
Table 3 shows the results of the comparisons of two-, three-, four-, and five-component homoscedastic mixture models for a range of power transformations. We find some very distinctive patterns. If the square root transformation is used, the three-component model is selected most frequently (for 561 markers). Transformation by power 0.6 shows almost similar results. With powers larger than 0.5, models with more groups tend to be favored, probably because large observations tend to become more outlying, which are accommodated by more components. Using a transformation with a power smaller than 0.5, both models with 2, and with 4 or 5 groups tend to be selected more often. We conclude from Tables 2 and and33 that the square root transformation is best, both for variance stabilization and for order selection.
Table 4 shows results for the diagnostics of data quality. In the comparison of normal mixture models with 2, 3, 4 and 5 components by BIC, we find that the desired model with three components fits best for 561 markers (≈50%). For 158 markers, a model with two components fits best. Models with more than three components are chosen for 456 markers. Results on the separation of group means in the best-fitting g-component model are shown in the middle part of Table 4. Notice that the majority of the markers (69%) have well separated group means, 31% is moderately separated, and only one marker is poorly separated. The percentages well separated markers monotonically decrease with the order g of the model: 89, 80, 53, and 34%, respectively. We conclude that the separation of group means shows a relationship with the choice of best fitting model.
The bottom part of Table 4 shows counts of markers with singletons in the last and first component of the best fitting g-component mixture model (g = 2, 3, 4, 5). We find that 62 (5%) of the markers have a first component with a singleton. This percentage is not heavily dependent on which model fits best. However, the counts of markers with a singleton in the last component are much higher, and now we do see a clear relationship with the best fitting model: for markers with a best fitting three-component model, only 42 (7.5%) have a singleton in the last component, whereas markers with best fitting two-, four-, and five-component mixture models have singletons in 25, 26, and 36% of the cases, respectively.
The problem with outlying observations is that they may be, but not necessarily are, erroneous: a component with a singleton may represent a true genotypic situation. If we assume that rare genotypes AA and aa occur approximately equally often across all markers, and that most singletons in the first component represent true aa genotypes, we conclude that if markers with best fitting three-component mixture model have singletons in the last component, most of these represent true AA genotypes. The much higher percentages of singletons in the last component found for markers with two-, four- or five-component models suggest that the intensity is erroneous outlying (whatever the reason may be), and need further examination.
Table 5 shows the results of the simulation study to underpin the LRT for HWE, as described in “Features for enhanced and stabilized unmixing, data quality and model selection”. We note that for allele frequencies p = 0.3, 0.4, 0.5 the type I error rates are close to the nominal value 0.05. For smaller values of p the LRT is slightly conservative, rejecting the null hypothesis not often enough (with error rates between 0.034 and 0.045). We suspect that the reason is data sparseness: if p is small, π1 = p2 is close to zero, rendering frequently mixtures with only 1 or 2 observations for the first component. We conclude that the LRT is justified to test for mixing probabilities according to HWE.
Figure 4 shows an example of a marker with mixing probabilities according to HWE. First a mixture model with unrestricted πj is fitted, shown in subplot 4a, with LL = 94.2. Second, a mixture model with πj according to HWE is fitted, shown in 4b, with LL = 93.8 and estimated allele frequency The hypothesis test of πj according to HWE uses the test statistic LR = 2 × (94.2 − 93.8) = 0.8, and has P value Hence, the null hypothesis of HWE is not rejected.
The results for all selected markers are shown in Table 6 (cf. Table 2 in van Berloo et al. 2008b). If the LRT gives a P value of >0.05, the null hypothesis of HWE for the marker is not rejected, and we accept the mixture model with mixing probabilities according to HWE. We find large differences in percentages of markers in HWE over the chromosomes, with low percentages on chromosomes 4, 5, and 8, to (almost) 100% on chromosome 3 and 9. In the selection of unmapped markers 53% does not show evidence against HWE.
In this paper we describe a method for the codominant scoring of AFLP markers in association panels without prior knowledge of genotype probabilities. AFLP bands are scored codominantly by fitting normal mixture models to the band intensities per marker, using the EM-algorithm. The EM-algorithm is used for maximum likelihood estimation of normal mixture parameters. It is known for its slow convergence rate, but proved fast enough for the size of the example dataset we analyze here. We study a number of features that facilitate the codominant scoring of AFLP bands, like different parameter initializations for the normal mixture fitting, restrictions on parameters (equal standard deviations, equal or nearly equal distances between component means, mixing probabilities according to HWE), easy data transformation, and outlier removal. Histogram visualization with superimposed normal densities, and optional classification scores and other grouping information assists further in the codominant scoring of the bands. The methods for codominant scoring with facilitating features are implemented in a program in R, that is available from the authors.
Traditionally, the output from codominant scoring based on mixture models is the “hard” classification of genotypes into categories “A”, “B”, “H”, augmented with regions of doubt “C” (=“not A”) and “D” (=“not B”), for which an extra region of doubt “Z” (=“B or H or A”) is needed for completeness. It remains unclear how cultivars classified into regions of doubt should be dealt with in further analysis, depending on the purpose of the subsequent analysis. For example, in standard QTL mapping a marker label “C” or “D” may be changed into in informative label “A”, “H”, “B”, using information from flanking markers. This is not possible in association mapping, where only information on the marker itself is used. We propose to replace the hard classification by a fuzzy classification: use the posterior probabilities of individuals to belong to each of the three genotype classes AA, Aa, or aa. The posterior probabilities are direct results of the fitted mixture model without the intervening threshold needed for a hard classification. Given the posterior genotype probabilities, predictors of additive or dominance effects are easy to calculate, and can be used, e.g., in association studies.
The EM-algorithm for fitting normal mixture models needs starting values of the parameters. We have studied two types of starting values, and find that cluster-based starting values outperform (what we call) guesstimates of the starting values, especially for more complex models. We recommend to fit models twice using both methods for starting values, and choose the fitted model with highest LL.
The EM-algorithm necessarily converges to a local maximum of the likelihood. Recently, papers appeared describing attempts for global optimization of the likelihood, using methods from Operations Research (Heath et al. 2009; Jank 2006a, b). Heath et al. (2009) mention that repeat application of EM (as we propose here) may achieve similar results. A further study into the global optimization of the likelihood in mixture models is advisable.
We find empirical evidence favoring the square root transformation to arrive at homoscedastic normal mixture models.
We have studied criteria for data quality of AFLP markers with respect to codominant scoring, focusing on optimal number of components of the mixture model, separation of components, and occurrence of outliers. In our example dataset (an association panel of tomato), the desired normal mixture model with three components, valid for diploid organisms, is selected by BIC for about half of the 1,175 polymorphic bands (if choosing from models with 2, 3, 4, or 5 components). A model with more than three components is selected for about 38% of the markers. Models with more than three components make no sense for diploid organisms, if the components of the mixture model correspond to copy numbers of a unique DNA fragment for the different genotypes. However, if an AFLP band would consist of two different DNA fragments of equal length, which we call collision (see Gort et al 2006, 2008), a four- or five-component model cannot be ruled out. A model with two components, which could have a biologically sound interpretation, is selected by BIC for only 13% of the markers.
In total, 69% of the markers with best-fitting g-component models have well separated components. This percentage declines with g. Models with good separation are to be preferred, because they will lead to crisp classifications: posterior probabilities close to 0 or 1. Markers with best fitting two-, four-, or five-component models have in 25–35% of the cases a single observation assigned to the component with highest mean, whereas for markers with best fitting three-component model this is only 7%. For the component with lowest mean we find 5–10% singletons in all cases. From this, we cautiously conclude that markers, with two-, four- or five-component mixture models selected as best, contain more often an erroneous outlying observation than markers with three-component models selected best.
From the above we can distill a recipee for the automatic selection of AFLP markers, which can be reliably and consistently scored: select markers with best fitting three-component mixture model according to BIC, good separation of components, lack of singletons, robustness against parameter initialization, and robustness against slight data transformation. We have seen that many markers do not show the preferred number of three clusters, or have other characteristics that make them less optimal. An interesting question is what should be done with these markers. We do not recommend to discard these markers blindly, but instead use map information to decide on their use. If it concerns a mapped marker with many other neighbouring markers, it could easily be discarded. If the map is rather sparse, it may be worthwhile to check what is causing the problem.
The LRT to test for mixing probabilities according to HWE appears to be reasonable, as we find from a simulation study. In the example association panel, large differences in percentages of markers in HWE are found between the chromosomes, with percentages ranging from 6–18% (chromosomes 4 and 5) to 95–100% (chromosomes 3, 6, and 9). These differences may be caused by population substructure in the set of tomato cultivars. We found that chromosomes 4 and 5 contain markers related to the cherry/non-cherry subgroups.
Codominant scoring can also be exploited in AFLP mapping studies. AFLP maps are almost always based on dominantly scored markers. Piepho (2001) describes how band intensities can be used to infer the recombination frequency, and next to order the markers on a map. The information of band intensities is used by Pérez-Enciso and Roussot (2002) in a general pedigree to estimate identity by descent probabilities, to be used in subsequent QTL mapping strategies. For completeness, we note that AFLP markers can be codominant in another sense. If two AFLP fragments differ in size by a few basepairs, e.g., by an indel, but are identical in other respects, and originate from the same locus, they can be used as codominant markers. Such bands or fragments are called allelic markers. Special algorithms and software can find such markers, and score them codominantly (Meudt and Clarke 2007). An example of a study of this type of codominance is Wong et al. (2007).
Liu (2007) urges caution in the use of codominant scoring because of the non-linear nature of the polymerase chain reaction, which is at the basis of the AFLP procedure, and even discourages the use in case of samples from random mating populations. We have demonstrated, though, in this study of an unstructured association panel of hybrids, that large numbers of AFLP markers can be scored codominantly in a satisfactory way. The main advantage of codominantly scoring AFLPs is obviously being able to distinguish heterozygous from homozygous individuals. Even if some uncertainty about the true genotypic class of a cultivar remains, and some AFLP bands are lost due to low data quality, this advantage makes the codominant scoring of AFLPs in association panels worthwhile.
We thank Ralph van Berloo for a description of the codominant scoring done by Keygene NV. We further acknowledge the contributions of the Center for BioSystems Genomics (Project CBSG 2012: BB9-12), and the Generation Challenge Program (Project GCP-G4007.09).
Open Access This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
We wrote software routines for the codominant scoring of AFLP profiles in R (Ihaka and Gentleman 1996), which are available from the authors. In the software we fit and visualize mixture models, using the EM-algorithm. The main routine takes, besides the normalized intensities and optionally the raw intensities, a number of arguments to allow for the different features described earlier. The arguments are concisely described below.
The definition of the R function with all arguments follows here:
Routine returns the estimated means, standard deviations, prior probabilities, and posterior probabilities. For mixtures of 2 or 3 groups also the hard classifications are given. In case of Gaussian mixtures the log likelihood is returned as well. Based on the data and the model fit, a histogram visualization with fitted densities can be produced. Optionally, the observations can be plotted on the x-axis using a color coding corresponding to the hard classification. We use the following color codes: red = B, green = H, blue = B, violet = C, magenta = D, black = Z.