Abstract
Inference of population structure from genetic markers is helpful in diverse situations, such as association and evolutionary studies. In this paper, we describe a twostage strategy in inferring population structure using multilocus genotype data. In the first stage, we use dimension reduction methods such as singular value decomposition to reduce the dimension of the data, and in the second stage, we use clustering methods on the reduced data to identify population structure. The strategy has the ability to identify population structure and assign each individual to its corresponding subpopulation. The strategy does not depend on any population genetics assumptions (such as HardyWeinberg equilibrium and linkage equilibrium between loci within populations) and can be used with any genotype data. When applied to real and simulated data, the strategy is found to have similar or better performance compared with STRUCTURE, the most popular method in current use. Therefore, the proposed strategy provides a useful alternative to analyse population data.
Keywords:
population structure; subpopulation; singular value decomposition; dimension reduction; clusteringIntroduction
Information about the population structure of species is useful in a variety of situations, such as admixture mapping, subspecies classification, genetic barrier detection and evolutionary study [15]. For example, anthropologists may have the debris of ancient people, supplied by archaeologists, and want to learn about the relationship between the ancient people and modern populations to infer the evolutionary history of human beings. Population structure can be identified based on visible characteristics such as language, culture, physical appearance and geographical region. But this can be subjective and may bear no relevance to genetics [3]. In other situations, the presence of population structure may constitute a practical nuisance. In association studies, casecontrol design is often used to identify genetic variants underlying complex traits by comparing allele frequencies between unrelated individuals who are affected and those who are unaffected. The presence of population structure can lead to spurious associations between a candidate marker and a phenotype, however, as a result of population structure in the sample [6,7]. In forensic studies, the identification of reference groups is very important, but this can be difficult when population structure exists [4,8]. In all of these situations, the first step is to identify population substructure.
Pritchard et al. introduced a modelbased clustering method to infer population structure and assign individuals to populations using multilocus genotype data [3]. They used a Bayesian formulation to generate the posterior distribution using a Markov chain Monte Carlo (MCMC) method based on Gibbs sampling. Their main modelling assumptions were HardyWeinberg equilibrium within populations and linkage equilibrium between loci within each population [3]. This is the predominant method currently used in genetic studies. Some other methods have been proposed,[912] but all are modelbased (parametric) methods. These methods have their own advantages and disadvantages. They all have model assumptions because of their parametric nature.
Here, we describe a twostage strategy for inferring population structure, which is an alternative and complementary approach to STRUCTURE[3,10] for exploring data. In the first stage, we use methods such as singular value decomposition (SVD) to reduce the dimension of data and then perform clustering on the reduced data. This twostage strategy is widely used in knowledge induction and representation to determine similarities between the meaning of words and passages by analysis of large text corpora [13,14]. Our method does not use assumptions such as HardyWeinberg equilibrium for populations or linkage equilibrium for loci. Here we show that our method is faster and has comparable (when model assumptions hold) or better performance (when model assumptions fail) than STRUCTURE when applied to real and simulated data [3,10].
In the next section, we describe the strategy and methods we use and some of the advantages of the approach we take. We illustrate our method with examples and make comparisons with STRUCTURE in the Results section. In the Discussion section, we highlight issues in the methods, the potential use of the methods, and future work.
Methods
It is well known that cluster analysis is difficult in highdimensional space because standard clustering algorithms such as the ExpectationMaximization (EM) algorithm[15] and the Kmeans method are probably trapped in local minima [13,16]. Although many initialisation methods have been proposed to deal with this problem, they have had only limited success [13]. Therefore, a twostage procedure seems valuable: first reduce the dimension of the original space and then cluster in the reduced (lowdimensional) space. In general, any dimension reduction methods and clustering methods can be plugged into this twostage framework. In this paper, we use SVD as the dimensionreduction method, and the mixture model and Kmeans as the clustering methods. We also propose a nonparametric clustering method, which can be viewed as a variant of the Kmeans method, for small sample sizes.
Dimension reduction
SVD is widely used in knowledge induction and representation and information retrieval. For example, SVD plays a key role in latent semantic analysis (LSA) or latent semantic indexing (LSI). The semantic dimensions are thought to contain redundant and noisy information, which can be separated out and should be ignored. Bartell et al. showed that the document representations given by LSI are equivalent to the optimal representations found when solving a particular multidimensional scaling problem in which the given interobject similarity information is provided by the inner product similarities between the documents themselves [17]. LSI automatically computes a much smaller semantic subspace from the original text collection. This improves recall and precision in information retrieval; information filtering or text classification; word sense disambiguation; word sorting and relatedness judgments; the prediction of learning from text; and summarising skills [14,18]. The effectiveness of LSI in empirical studies is often attributed to the reduction of noise, redundancy and ambiguity [14,18,19]. By introducing a dual probabilistic model based on similarity concepts, Ding showed that semantic correlations could be characterised quantitatively by their statistical significance  that is, the likelihood [18]. He further showed that LSI is the optimal solution of the model and proved the existence of the optimal semantic subspace. This model explains theoretically the performance improvements observed for LSI.
Mathematically speaking, SVD is a matrix decomposition technique. A realvalued mbyn matrix (say X) can be represented uniquely (up to certain trivial rearrangements of columns and subspace rotations, in the case of duplicated singular values) as the product of three matrices:
where both U and V are column orthonormal and S is a diagonal matrix of singular values [20]. There is a direct relationship between SVD and principal component analysis (PCA) when PCA is performed from the covariance matrix using the following equations:
If each row of X is normalised (centred and unitary), the covariance matrix ∑ of data X is XX^{T}. We know that:
where A is an orthonormal matrix and the λ values are the eigenvalues of ∑. The decomposition is unique up to some trivial column rearrangements. Matrix A contains the principal components of columns of X. From equations (2) and (4), we can see that the left singular vectors U are the same as the principal components of columns of X. Similarly, the right singular vectors V are the same as the principal components of rows of X.
Clustering
We choose to use two clustering methods: one is mixture 4, proposed by Figueiredo et al., based on the mixture model,[21] the other is Kmeans. The advantages of mixture 4 are that it is capable of selecting the number of components (ie the number of clusters) and that it is relatively robust to the initialisation of the parameters. Figueiredo et al. [21] used the following finite mixture models:
where Y = [Y_{1},..., Y_{d}]^{T}, a ddimensional random variable with y = [y_{1},...,y_{d}]^{T }being a realisation of Y; α_{1},...,α_{k }are the mixing probabilities; each θ_{m }is the set of parameters defining the mth component; and θ = {θ_{1},..., θ_{k}, α_{1},...,α_{k}} is the complete set of parameters needed to specify the mixture. They used the minimum message length criterion and the componentwise EM algorithm to integrate estimation and model selection in a single algorithm [22]. This method can avoid another wellknown drawback of the EM algorithm for mixture fitting  namely, the possibility of convergence towards a singular estimate at the boundary of the parameter space [21].
Kmeans is a commonly used nonparametric clustering method, but it has some drawbacks. We propose a clustering method (densitybased mean clustering [DBMC]), which is a variant of Kmeans but can avoid some of its drawbacks. The details of DBMC are provided in Appendix 1.
Number of subpopulations
There are many methods for estimating the number of clusters which can also be used for estimating the number of subpopulations. Zhu et al. showed that Bayesian information criterion (BIC) from their mixture model performed better than STRUCTURE in inferring the number of subpopulations [23]. All of these methods can be integrated into the clustering procedure.
Missing data imputation
It is not uncommon to have missing values in genetic studies. Such data can be manually flagged and excluded from subsequent analyses [24]. Many analytical methods, such as PCA or SVD, require complete matrices [25]. Although some studies reported dealing with SVD/PCA with missing data,[2629] they often rely on specific probabilistic models and have limited generalisability. Although one solution to the missing data problem is to repeat the experiment, and this method has been used in the validation of microarray analysis algorithms,[30] this strategy may be too expensive and impractical for most studies. Therefore, we need to estimate the missing values from nonexperimental methods.
There is little published literature concerning missing value estimations for genotype data from human populations. The uniqueness of this issue is that the genotype data are categorical by nature. Note that Sen and Churchill[31] and Broman et al. [32] discussed using the EM algorithm and the hidden Markov model to deal with missing genotypes, but they were mostly concerned with experiments involving inbred animals.
Genotype data are usually in the form of large matrices of genotypes of marker loci (columns) from different persons (rows) [3]. Without loss of information, we can transform this personmarker matrix into a genotypeperson matrix. For each marker, all of the genotypes appearing in the data are listed, one genotype per row, with a value of one for the cell if a person (column) has this genotype, and zero otherwise. Using this reformatting we now have a large 01 matrix. We can view this genotypeperson matrix as a frequency matrix, with each cell denoting the frequency of the person (its column) who has the genotype that is denoted by its row. Such frequency matrices are commonly used in LSA and are called 'worddocument matrices' or 'termdocument matrices' [14,18,33]. Because we can fully reconstruct the original personmarker matrix from this genotypeperson matrix, there is no information loss in this transformation. We impute the missing values on the basis of this genotypeperson matrix. In this study, we used an imputation method which is similar to the 'K nearestneighbour' (KNN)based method used in Troyanskaya et al. [34] The rationale underlying this method is that where data points are clustered together (similar) in the lower dimension, we can expect them to be clustered together (similar) in the higher dimension as well. In this way, the missing dimensions of a data point (individual in our case) can be estimated by its neighbours (those which are very similar to the data points under study), with no missing data in these dimensions. Details of this method are provided in Appendix 2. In this study, we did not iterate to impute the missing values (ie we only use KNN once).
Results
Data
To evaluate fully the performance of the proposed strategy, we applied it to two real datasets and two simulated datasets. The first real dataset was that reported in Rosenberg et al. ,[5] which has genotypes at 377 autosomal microsatellite markers in 1,056 individuals from 52 populations. Here, we considered the whole dataset, as well as two American populations, the Pima and Surui populations, with 25 and 21 individuals, respectively, to demonstrate our methods. The other real dataset was from the HapMap project;[35] we used genotype data from 45 Chinese and 44 Japanese on chromosome 17 which was released in October 2005. A dataset is formed by the 500 most informative single nucleotide polymorphisms (SNPs) using the methods proposed in Rosenberg et al. [36]
One simulated dataset was generated under the coalescent model using MS, a program developed by Hudson [37]. A progenitor population gave rise to two subpopulations 3,000 generations ago. Subpopulation 1 had a constant size of 10,000, began to grow exponentially 1,000 generations ago and has now reached 40,000. Subpopulation 2 had a constant size of 2,000 before 2,000 generations, and then instantaneously expanded to 10,000 and has remained at that size until the present. We also assume that the mutation rate per site per generation is 10^{8 }and that we are interested in a segment of 10 kilobases. No recombination is set. In this fashion, we have generated 100 such chromosomal segments, each segment harbouring 2777 SNPs. The chromosomal segments are pooled together to produce more genotype data. A dataset is formed by randomly sampling 400 haplotypes (200 individuals) from subpopulation 1 and 200 haplotypes (100 individuals) from subpopulation 2. The second simulated dataset was taken from Tang et al. [12] (http://www.fhcrc.org/science/labs/tang/ webcite). This dataset contains 50 individuals from each of the two ancestral populations and 200 individuals from the admixed population. The true individual admixture values of the admixed individuals are also available.
Population structure identification
We used STRUCTURE 2.0[10] and our SVDbased procedure on the datasets. In the second stage (the clustering stage) of our procedure, we used both the mixture model and Kmeans methods for clustering. For STRUCTURE, we tried the four available models. If not stated explicitly, we used the default model with admixture and correlated allele frequencies and set both burnin length and number of MCMC replications after burnin to be 20,000.
For Rosenberg et al.'s full dataset,[5] we followed their procedure and ran analyses (STRUCTURE 2.0 and mixture 4) multiple times for the number of populations (clusters) from two to six. Table 1 shows the Pearson correlation coefficients between the results of the STRUCTURE and mixture 4 analyses using the first five principal components. For number of populations (clusters) K = 5 and 6, the relatively small correlation coefficients of two clusters (cluster 5 for K = 5 and cluster 1 for K = 6) are caused by the accumulated differences between the estimates of the two methods (the probabilities of belonging to one cluster, or membership coefficients, must sum to 1 across clusters).
We tried the four models available in STRUCTURE 2.0 on the PimaSurui data subset with 100 randomly chosen markers. The model assuming independent allele frequencies among populations with no admixture yielded the best results. Burnin length and number of MCMC replications after burnin were all 10,000 in the analyses. The results are summarised in Table 2. Clearly, it is difficult to draw the conclusion that there are two populations in the dataset from the above results, but once we set K = 2, STRUCTURE 2.0 can assign each individual correctly to the population it belongs to.
Table 2. Results from STRUCTURE 2.0 on the PimaSurui data with 100 randomly selected markers without missing genotypes.
Before performing SVD, we first transformed the data into the genotypeperson format. Wall et al. reported that preprocessing is critical in SVD/PCA,[38] which is well known in LSA [14,18,19,39]. We applied the socalled tfidf transformation[18,39] on the genotypeperson matrix. For the PimaSurui subdataset with 100 randomly chosen markers, on the reduced twodimensional space, mixture 4 finds two clusters. The mixture 4 and Kmeans methods assign individuals correctly to their populations with two clusters.
Figure 1 plots pairwise cosine similarities between individuals in the reduced twodimensional space. Figure 2 shows the pairwise cosine similarities between individuals using the original data without reduction. It seems that the original data are noisier and that SVD not only reduces the dimension but also reduces noise.
Figure 1. Pairwise cosine similarities between individuals in the reduced twodimensional space of the PimaSurui data with 100 randomly selected markers without missing genotypes.
Figure 2. Pairwise cosine similarities between individuals using the original Pima  Surui data with the same 100 markers as in Figure 1.
To evaluate our method's performance, we reduced the number of markers. When the numbers of markers were 80, 60, 40 and 20, both methods performed equally well (data not shown). When the number of markers was reduced to ten, both methods still performed well, with STRUCTURE slightly better when the marker information was limited (data not shown).
To compare the performance of the methods fully, we conducted a simulation study using population genetics models. Table 3 shows the results for three datasets from the simulation study. It has been reported that STRUCTURE provides very stable estimates when the model assumptions hold [5]. In the presence of tightly linked SNPs, STRUCTURE not only performed worse but was also not very stable. For example, Table 3 shows that when 494 SNPs were used, the best performance of STRUCTURE only misclassified three individuals; however, the median number of misclassified individuals by STRUCTURE was 33. For the same dataset, the time taken for the analyses (using a laptop with Intel Pentium M 1.80 GHz CPU, 512 MB RAM and Windows XP) was about 30 minutes for STRUCTURE and about 28 seconds for the proposed method computing 100 largest singular values in MatLab software (The Mathworks, Inc.).
Table 3. The performance of STRUCTURE 2.0 and the singular value decomposition (SVD)based method on the simulated data.
For the dataset from the HapMap project, STRUCTURE best performance's misclassified two of the 45 Chinese individuals and two of the 44 Japanese individuals, whereas the best performance of the proposed strategy classified all individuals correctly. The inferior performance of STRUCTURE was partly due to the fact that some of the markers in the dataset were tightly linked. For example, the distances between some SNPs were less than 100 base pairs. It is well known that Chinese and Japanese populations are closely related and very difficult to distinguish in genetics studies. It is not too difficult, however, to distinguish between informative markers. This also confirms that SNPs can be very informative in inferring population structures [40].
We used the simulated data with admixed individuals to evaluate the performance of the proposed strategy on data exhibiting population admixture. Figure 3 shows the results. The membership coefficients were calculated following the method of Nascimento et al. [41] We found that STRUCTURE performed slightly better than the proposed strategy. This is expected because when the model assumptions hold, parametric methods with correct assumptions should always perform better.
Figure 3. The results of STRUCTURE and the singular value decomposition (SVD)based method based on the simulated data set of Tang et al. [12]For the SVDbased method, ten principal components were used. The Kmeans method was used for clustering. The individual admixture (IA) values were calculated using the method of Nascimento et al. [41].
Evaluation of DBMC
We used our DBMC method on the reduced data to evaluate its performance. DBMC performed well in different reduced dimensions. Figure 4 shows the formation of the initial partitioning by DBMC. There were four points (blue circles without other symbols superimposed) left ungrouped by Gap statistics. They were classified into the second cluster by the initialisation procedure. Therefore, after the densitybased initialisation, all but one individual (number 42 in the figures with coordinates  5.95710,  0.19782) were correctly classified. Only one iteration was required to finish the clustering correctly.
Figure 4. Illustration of the formation of the initial partitioning by densitybased mean clustering on the first two principal components of the PimaSurui data with 100 randomly selected markers without missing genotypes. The numbers (1 and 2 here) over the triangles indicate the order of cluster starting points identified. Points with black plus signs indicate the points identified as the first cluster by Gap statistics. Points with red crosses indicate the points identified by Gap statistics as the second cluster. Blue circles represent the original points in the twodimensional space after singular value decomposition. Downward triangles indicate the starting points for each cluster. Points with red crosses indicate the points identified by Gap statistics as the second cluster.
Discussion
We have described a twostage strategy for using multilocus genotype data to examine population structure and to assign individuals to populations. We prefer to call this approach a strategy, instead of a method, because it provides a framework, not just one method. One can choose different dimension reduction and clustering methods to fit into the framework.
Our strategy does not rely on any population genetic assumptions, such as HardyWeinberg equilibrium and linkage equilibrium between loci within populations. This means that violation of the assumptions does not invalidate our strategy. For modelbased methods, the violation of assumptions makes these methods invalid, at least theoretically, although some methods may be robust to certain departures from assumptions. We have shown, through simulation and real data analyses, that the proposed approach is not affected by departure from the linkage equilibrium assumption for markers in the data; however, tightly linked markers may provide redundant information, so more markers are usually needed. In this situation, the validity of our strategy is not affected, but the validity of modelbased methods becomes questionable.
It is reported that pretransformation is critical in SVD [18,38,39]. We choose to use the tfidf transformation, which is widely used in information retrieval, but other pretransformations are possible. Before the pretransformation, it would be helpful to eliminate the noninformative rows and columns [19]. In our experiment, we only eliminated the markers in situations where just one genotype appears in the whole sample. It is possible to use some criteria (such as entropy) to filter out the noninformative data. This would make the analysis faster (because the matrix becomes smaller) and more efficient (because the remaining matrix is more informative).
There are many dimension reduction techniques besides SVD, such as PCA and its variants (probabilistic PCA, nonlinear PCA etc), correspondence analysis, multidimensional scaling, independent component analysis, projection pursuit and projection pursuit regression, principal curves and methods based on topologically continuous maps and neural networks [42]. The reason we chose SVD is that many efficient algorithms exist for this method. Because we only need a few principal components, even more efficient algorithms are available for this purpose. Although the personmarker matrix can be large if there are many markers or individuals, the SVD procedure can be performed very quickly. In addition, SVD seems to reduce noise, as shown in Figures 1 and 2. It is likely that other dimension reduction methods may also yield good results.
It would be more statistically sound to view our genotypeperson matrix as a binary matrix. Several methods have been proposed for PCA on binary data,[4345] but they implicitly assume that the observations are independent across the dimensions, which does not apply in our casethe genotypes are by no means independent at each marker. Nevertheless, we have tried the logistic PCA[44] on our data, but the results were not as good as those achieved by treating the genotypeperson matrix as a real matrix. Figure 5 shows the first three principal components from logistic PCA using the criterion that the log likelihood change is less than 0.1. It is obvious that the two clusters are not clearly separated. Although we could perform dimension reduction using probabilistic PCA on the original genotype data matrix, which is a categorical matrix, no methods are available for such analyses. One PCA method for categorical data is implemented in SPSS software (SPSS Inc.) through the use of the optimal scaling (or optimal scoring) approach to turn the categorical problem into a quantitative one,[46] and applies PCA to the numeric matrix. The coding scheme of the categorical data in the original matrix may affect the resulting numeric matrix, however; moreover, one needs to set the optimal scaling level for analysis variables subjectively, whereas we intend to avoid subjective choices in our strategy.
Figure 5. The first three principal components from logistic principal component analysis using the criterion that the log likelihood change is less than 0.1 in the PimaSurui data.
It is an open question as to how to choose dimensionality for the reduced space when using SVD or PCA. Much work has been done on this topic, including likelihood ratio, Minimum Description Length (MDL), Akaike information criterion (AIC), BIC,[47] Laplace's method,[48] and probabilistic PCA model [26]. All of these methods are based on some probabilistic model, however  usually the normality assumption. In our case, the assumption is obviously not appropriate. Laplace's method seems to give a more reasonable choice than others, but can only serve as a guideline. Because our purpose is clustering, one possible way for choosing the optimal dimension is by clustering results. For each given dimension, we can perform cluster analysis on the reduced space and evaluate the resulting clusters  for example, between to withincluster variation. We can then select the optimal dimension as the one with the best clustering evaluation. Methods based on appropriate models (perhaps binary or categorical models) or nonparametric (empirical) approaches should be more appropriate for our problem  and we are planning to investigate this in the future.
Ando observed that, in LSA, using SVD, the topics underlying outlier documents (ie those documents that are very different from other documents) tend to be lost as lower numbers of dimensions are chosen [33]. A general explanation of the good performance of LSI is that when eigenvectors with smaller eigenvalues are left out, noise is eliminated, and, as a result, the similarities among the linguistic units are measured more accurately than in the original space. According to the mathematical formulation of SVD, dimensional reduction comes from two sources: outlier documents and minor terms. These two types of noise are mathematically equivalent and are inseparable under SVD. However, people do not want to consider the outlier documents as 'noise', when their interest is in characterising the relationships among the documents while all the documents are assumed to be equal. In our case, fine structure (small numbers of individuals who are very different from others) may be lost, especially when the sample size is small. Hastie et al. noted that finer structure can be lost with any dimension reduction method [49]. Ando proposed an algorithm which differs from SVD in that terms and documents are treated in a nonsymmetrical way [33]. By scaling the vectors in each computation of eigenvectors, his algorithm tries to eliminate noise from the minor terms without eliminating the influence of the outlier documents. Further analyses are needed to evaluate this method.
In this paper, we chose to use mixture models as our clustering methods. The advantages of the mixture models include their readiness for use, their ability to choose the number of clusters automatically and their computational efficiency. These are by no means the only choice, however, and we have also considered Kmeans methods here. In fact, both mixture models and DBMC perform well. Conventional Kmeans performs a little worse (given the number of clusters as a priori). In our analysis it happened that some initial values produced very different (worse) clustering results by the conventional Kmeans. In general, when the sample size is large and the model provides a reasonable description of the data, mixture models (modelbased methods in general) perform well. When the clusters are restricted to globular regions, Kmeans should work well. In our analysis of the PimaSurui dataset, our sample size was not small (25 and 21 for two clusters) and the cluster shapes were convex (data not shown), so it is not surprising that both mixture models and Kmeans performed well.
We used cosine similarity to measure 'similarity' between individuals. Because cosine similarity is easy to interpret and simple to compute, it is widely used in text mining and information retrieval [14,18,19]. It is natural to measure 'similarity' between vectors by their inner product. Cosine is closely related to inner product and correlation. If the vectors have unit length, cosine is equivalent to inner product. If the vectors are centred, cosine is the same as correlation.
Our strategy can be used for identifying populations and assigning individuals in situations where there is little information about population structure. It should also be useful in situations where cryptic population structure is a concern, such as in casecontrol studies in association mapping.
In summary, we find that the strategy we have described in this paper has the ability to identify population structure, make correct inferences of the number of subpopulations and assign individuals to their corresponding subpopulation. Most of all, it is model free and does not depend on any genetics assumptions. Although it has several advantages over its parametric counterpart, as pointed out by Tang et al.: [12] 'no one method is universally preferable'; however, it provides a useful alternative to analyse genetic data for population structure inferences.
Appendix 1
A variant of the Kmeans method
Kmeans is a commonly used nonparametric clustering method, but it has the following drawbacks:
(1) The initial partition may affect the results. Randomisation is often used but has limited success [13].
(2) The procedure may not converge. If the procedure is not well defined, it is quite possible for the procedure to oscillate indefinitely between two or more partitions and never converge. This defect was recognised by the developer of the Kmeans method [50].
(3) It cannot determine the number of clusters, which is either preset or visually determined.
We propose a clustering method which is based on Kmeans and can avoid the above drawbacks. The basic idea of our method is that to identify a cluster starting from the point with the highest density around it in the current dataset. To be more specific, suppose we are given n data points X_{1}, X_{2},..., X_{n}. Let π_{1}, π_{2},..., π_{k }denote a partitioning of the data into k disjoint clusters such that
The algorithm of this method (densitybased mean clustering [DBMC]) is as follows. Vary the total number of clusters from k = 1, 2,..., K. For each k, perform the following procedure:
1. For every data point in the sample, calculate the distance to its mth (usually three or four) nearest neighbour and identify the point that has the smallest value (highest density). Choose this point as the starting point.
2. Find the point nearest to the starting point and merge these two points to form a cluster. Repeat until all the points are in the cluster. This results in a sequence of nested clusters.
3. Use the Gap statistic[49] to obtain the cluster size and form one cluster. The Gap statistic uses the criterion of betweentototal variance for the goodness of a cluster. The Gap statistic selects the optimal cluster among the nested clusters as the one with the biggest difference (Gap) between the observed and the expected variance by permutation.
4. For the remaining data, repeat steps 1 to 3, until k clusters are found or all points are included (no point left). This leads to the initial partitioning, namely Calculate the centroids for each cluster, denoted as , and set the index of iteration to t = 0.
5. For each data point, find the nearest centroid and assign the point to the cluster represented by this centroid. This results in a new partitioning:
Compute the new centroids and repeat this updating procedure until a certain stopping criterion described below is met.
6. Evaluate the resulting clusters as described below. Among all of the k values studied, select the best clustering according to the quality evaluated in step 6.
For step 5, a stopping criterion is needed, an example is:
where one choice for E() is the objective function discussed by Dhillon and Modha[19] and an alternative candidate is the betweento total (betweencluster plus withincluster) sums of squares [51]. To ensure the convergence of when we use the betweento total sums of squares as the stopping criterion at each iteration, we choose the new partitioning to have a larger value of the betweento withinsums of squares, or else the iteration stops. Therefore, the algorithm outlined above never results in a decrease in the E(.) value, which is bounded from above by some constant [19]. Therefore, if DBMC is iterated indefinitely, then the value of E(.) will eventually converge. Note that this only means that the algorithm procedure will converge, but it does not imply that the underlying partitioning converges [19,52].
There are a number of methods for estimating the number of clusters [53]. Here, we chose the Gap statistic using the resampling method [54]. Suppose that the maximum possible number of clusters in the data is M. The basic idea of the Gap method for estimating the number of cluster K is to identify , , which provides the strongest significant evidence against the null hypothesis H_{0 }of K = 1, that is, 'no cluster' in the data. The Gap method employs the socalled uniformity hypothesis, which states that the data are sampled from a uniform distribution in the ddimensional space. It compares an observed internal index, such as the withinclusters sum of squares, to its expectation under a reference null distribution via resampling, and chooses the smallest k which maximises the Gap statistic as the number of clusters [53,54]. The basic idea of the Gap statistics for estimating cluster size[49] is similar to that of the Gap method for estimating the number of clusters.
Appendix 2
Missing genotype imputation
If we consider a person A who has a missing value in marker G, the KNNbased method would find K other persons, who have observed genotypes for marker G and are most similar to A in genotypes in markers other than G. A weighted average of frequencies of genotypes of marker G from the K closest persons is then used as an estimate for the missing value in person A. In the weighted average, the contribution of each person is weighted by the similarity of his/her genotypes to those of person A. We then use SVD on the imputed matrix to obtain the projections of each person onto a reduced space; choose the K nearest neighbours for person A in the reduced space; and repeat the KNN imputation. This iteration is repeated until some preset criteria are met. In summary, the algorithm works as follows.
1. Start with the genotypeperson matrix X which has missing values.
2. Compute cosine similarity between x*, who has missing values, and all other persons, using only those coordinates not missing in x*. Identify the K nearest neighbours.
3. Impute the missing coordinates of x* by the weighted average of the corresponding coordinates of the K individuals closest to produce X^{0}. Set i = 0.
4. Apply SVD to the complete matrix X^{i }to derive the reduced space and identify the K nearest neighbours in the reduced space.
Set i < i + 1 and repeat steps 3 and 4 until a preset number of iterations is reached, or M^{i } M^{i+1}/M^{i} is below some threshold, where M^{i }is the entire imputed matrix at the ith stage.
Acknowledgements
We thank Dr Mário Figueiredo for providing the source code for his mixture model and patiently answering questions; Mr Preslav Nakov for providing information about preprocessing; Dr Thomas Minka for providing the source code for the choice of dimensionality for PCA; Dr Michael Tipping for answering questions about dimension reduction for binary data; Dr Andrew Schein for providing the code for logistic PCA; Dr Noah Rosenberg for providing the data; Drs Rebecka Jörnsten and Hongyuan Zha for discussion on clustering number selection; Dr Andrew Pakstis for providing information about marker selection; and Dr Baolin Wu for general discussions. This work was supported in part by National Institutes of Health grants GM59507 and GM57672.
References

Corander J, Waldmann P, Sillanpaa MJ: Bayesian analysis of genetic differentiation between populations.
Genetics 2003, 163:367374. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Patterson N, Hattangadi N, Lane B, et al.: Methods for highdensity admixture mapping of disease genes.
Am J Hum Genet 2004, 74:9791000. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data.
Genetics 2000, 155:945959. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Rosenberg NA, Burke T, Elo K, et al.: Empirical evaluation of genetic clustering methods using multilocus genotypes from 20 chicken breeds.
Genetics 2001, 159:699713. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Rosenberg NA, Pritchard JK, Weber JL, et al.: Genetic structure of human populations.
Science 2002, 298:23812385. PubMed Abstract  Publisher Full Text

Lander ES, Schork NJ: Genetic dissection of complex traits.
Science 1994, 265:20372048. PubMed Abstract  Publisher Full Text

Risch NJ: Searching for genetic determinants in the new millennium.
Nature 2000, 405:847856. PubMed Abstract  Publisher Full Text

Kim JJ, Verdu P, Pakstis AJ, et al.: Use of autosomal loci for clustering individuals and populations of East Asian origin.
Hum Genet 2005, 117:511519. PubMed Abstract  Publisher Full Text

Dawson KJ, Belkhir K: A Bayesian approach to the identification of panmictic populations and the assignment of individuals.
Genet Res 2001, 78:5977. PubMed Abstract

Falush D, Stephens M, Pritchard JK: Inference of population structure using multilocus genotype data: Linked loci and correlated allele frequencies.
Genetics 2003, 164:15671587. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Satten GA, Flanders WD, Yang Q: Accounting for unmeasured population substructure in casecontrol studies of genetic association using a novel latentclass model.
Am J Hum Genet 2001, 68:466477. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Tang H, Peng J, Wang P, Risch NJ: Estimation of individual admixture: Analytical and study design considerations.
Genet Epidemiol 2005, 28:289301. PubMed Abstract  Publisher Full Text

Ding CH, X H, H Z, H S: Adaptive dimension reduction for clustering high dimensional data.

Landauer TK, Foltz PW, Laham D: Introduction to latent semantic analysis.
Discourse Process 1998, 25:259284. Publisher Full Text

Dempster AP, Laird NM, Rubin DB: Maximum likelihood for incomplete data via the EM algorithm (with discussion).

Moore AW: Very fast EMbased mixture model clustering using multiresolution kdtrees. In Advances in Neural Information Processing Systems. Edited by Kearns M, Solla S, Cohn D. MIT Press, Cambridge, MA; 1999:543549.

Bartell BT, Cottrell GW, Belew RK: Latent semantic indexing is an optimal special case of multidimensional scaling.
Proc SIGIR'92 Research and Development in Information Retrieval 1992, 161167.

Ding CH: A probabilistic model for dimensionality reduction in information retrieval and filtering.
Proc 1st SIAM Computational Information Retrieval Workshop 2000.

Dhillon ID, Modha DS: Concept decomposition for large sparse text data using clustering.
Machine Learning 2001, 42:143175. Publisher Full Text

Golub G, Van Loan C: Matrix Computations. Johns Hopkins University Press, Baltimore, MD; 1996.

Figueiredo M, Jain AK: Unsupervised learning of finite mixture models.

Celeux G, Chrétien S, Forbes F, Mkhadri A: A componentwise EM algorithm for mixtures.

Zhu X, Zhang S, Zhao H, Cooper RS: Association mapping, using a mixture model for complex traits.
Genet Epidemiol 2002, 23:181196. PubMed Abstract  Publisher Full Text

Alizadeh AA, Eisen MB, Davis RE, et al.: Distinct types of diffuse large Bcell lymphoma identified by gene expression profiling.
Nature 2000, 403:503511. PubMed Abstract  Publisher Full Text

Alter O, Brown PO, Botstein D: Singular value decomposition for genomewide expression data processing and modeling.
Proc Natl Acad Sci USA 2000, 97:1010110106. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bishop CM: Variational principle components.
Proc 9th International Conference on Artificial Neural Networks 1999, 1:509514.

Brand ME: Incremental singular value decomposition of uncertain data with missing values.
European Conference on Computer Vision (ECCV) 2002, 2350:707720.

Chan KL, Lee TW, Sejnowski TJ: Handling missing data with variational learning of ICA. In Advances in Neural Information Processing Systems. Edited by Michael J, Kearns MN, Solla SA. MIT Press, Cambridge, MA; 2003:415420.

Roweis S: EM algorithms for PCA, SPCA. In Advances in Neural Information Processing Systems. Edited by Michael J, Kearns MN, Solla SA. MIT Press, Cambridge, MA; 1998.

Butte AJ, Ye J, Haring HU, et al.: Determining significant fold differences in gene expression analysis.

Sen S, Churchill GA: A statistical framework for quantitative trait mapping.
Genetics 2001, 159:371387. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Broman KW, Wu H, Sen S, Churchill GA: R/qtl: QTL mapping in experimental crosses.
Bioinformatics 2003, 19:889890. PubMed Abstract  Publisher Full Text

Ando RK: Latent semantic space: Iterative scaling improves precision of interdocument similarity measurement.

Troyanskaya O, Cantor M, Sherlock G, et al.: Missing value estimation methods for DNA microarrays.
Bioinformatics 2001, 17:520525. PubMed Abstract  Publisher Full Text

The International HapMap Consortium: The international HapMap project.
Nature 2003, 426:789796. PubMed Abstract  Publisher Full Text

Rosenberg NA, Li LM, Ward R, Pritchard JK: Informativeness of genetic markers for inference of ancestry.
Am J Hum Genet 2003, 73:14021422. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hudson RR: Generating samples under a WrightFisher neutral model of genetic variation.
Bioinformatics 2002, 18:337338. PubMed Abstract  Publisher Full Text

Wall ME, Rechtsteiner A, Rocha LM: Singular value decomposition and principal component analysis. In A Practical Approach to Microarray Data Analysis. Edited by Berrar DP, Dubitzky W, Kluwer MG. Norwell, MA; 2003.

Nakov P, Popova A, Mateev P: Weight functions impact on LSA performance.

Liu N, Chen L, Wang S, et al.: Comparison of singlenucleotide polymorphisms and microsatellites in inference of population structure.
BMC Genet 2005, 6:S26. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Nascimento S, Mirkin B, MouraPires F: Modeling proportional membership in fuzzy clustering.
IEEE Trans Fuzzy Syst 2003, 11:173186. Publisher Full Text

CarreiraPerpinan M: A review of dimension reduction techniques.
Technical Report CS9609, Department of Computer Science, University of Sheffield, Sheffield, UK 1997.

Collins M, Dasgupta S, Schapire RE: A generalization of principal component analysis to the exponential family.

Schein A, Saul L, Ungar L: A generalized linear model for principal component analysis of binary data.
Proc of the Ninth International Workshops Artificial Intelligence and Statistics 2003.

Tipping M: Probabilistic visualisation of highdimensional binary data. In Advances in Neural Information Processing Systems. Edited by Michael J, Kearns MN, Solla SA. MIT Press, Cambridge, MA; 1999:592598.

Jornsten R, Yu B: Simultaneous gene clustering and subset selection for sample classification via MDL.
Bioinformatics 2003, 19:11001109. PubMed Abstract  Publisher Full Text

Xu G, Zha H, Golub G, Kailath T: Fast algorithms for updating signal subspaces.
IEEE Trans Circuits Syst II: Analog and Digital Signal Processing 1994, 41:537549. Publisher Full Text

Hastie T, Tibshirani R, Eisen MB, et al.: "Gene shaving" as a method for identifying distinct sets of genes with similar expression patterns.

MacQueen J: Some methods for classification and analysis of multivariate observations.

Calinski RB, Harabasz J: A dendrite method for cluster analysis.

Pollard D: Quantization and the method of kmeans.
IEEE Trans Inform Theory 1982, 28:199205. Publisher Full Text

Fridlyand J, Dudoit S: Application of resampling methods to estimate the number of clusters and to improve the accuracy of a clustering method.
Technical Report 600, Department of Statistics, University of California, Berkeley, CA 2001.

Tibshirani R, Walther G, Hastie T: Estimating the number of clusters in a data set via the gap statistic.
JR Stat Soc Ser B 2001, 63:411423. Publisher Full Text