Emerging technologies now make it possible to genotype hundreds of thousands of genetic variations in individuals, across the genome. The study of loci at finer scales will facilitate the understanding of genetic variation at genomic and geographic levels. We examined global and chromosomal variations across HapMap populations using 3.7 million single nucleotide polymorphisms to search for the most stratified genomic regions of human populations and linked these regions to ontological annotation and functional network analysis. To achieve this, we used five complementary statistical and genetic network procedures: principal component (PC), cluster, discriminant, fixation index (FST) and network/pathway analyses. At the global level, the first two PC scores were sufficient to account for major population structure; however, chromosomal level analysis detected subtle forms of population structure within continental populations, and as many as 31 PCs were required to classify individuals into homogeneous groups. Using recommended population ancestry differentiation measures, a total of 126 regions of the genome were catalogued. Gene ontology and networks analyses revealed that these regions included the genes encoding oculocutaneous albinism II (OCA2), hect domain and RLD 2 (HERC2), ectodysplasin A receptor (EDAR) and solute carrier family 45, member 2 (SLC45A2). These genes are associated with melanin production, which is involved in the development of skin and hair colour, skin cancer and eye pigmentation. We also identified the genes encoding interferon-γ (IFNG) and death-associated protein kinase 1 (DAPK1), which are associated with cell death, inflammatory and immunological diseases. An in-depth understanding of these genomic regions may help to explain variations in adaptation to different environments. Our approach offers a comprehensive strategy for analysing chromosome-based population structure and differentiation, and demonstrates the application of complementary statistical and functional network analysis in human genetic variation studies.
Keywords:discriminant analysis; principal component analysis; cluster analysis; fixation index; FST; population structure; gene network
The comprehensive identification and control of population genetic structure and dissection of polymorphism are important steps in genomic studies aimed at gene mapping through (either directly or indirectly) linkage disequilibrium (LD) [1-4]. Previous estimates of population structure have provided tremendous insight into population genetics and human evolution, and have increased our knowledge of the distribution of genetic variation and relationships among human populations [5-8]. Until recently, however, these studies have been based on limited numbers of loci/genes or small fractions of the genome and thus have provided only average estimates of quantities such as fixation index (FST) across whole genomes and populations .
The study of a few genes with significant population differentiation can be extremely efficient -- for example, in screening for potential tumour markers or drug targets. Such analyses do not reach the full potential of genome-wide experiments to increase our understanding of whole biological processes, however. What is needed instead is a holistic approach to analysing the entire genome which shows major population differentiation and allows biologists to develop an integrated understanding of the functional networks/pathways driving population diversity . Pääbo  suggested that, in variation studies, rather than 'populations', 'ethnicities' or 'races', a more efficient approach for studying within- and between-individual human chromosomal variation is to consider the genome of any particular individual as a mosaic of haplotype blocks.
To date, analyses of the relationship between genetic variation and ancestral geographic origin have been limited to a few regions or genes because large-scale, genome-wide single nucleotide polymorphism (SNP) data from geographically diverse individuals have not been available. Given that levels of diversity/polymorphism are directly related to recombination (meiosis) and mutation rates that differ within and among chromosomes, and that genes are not randomly distributed along chromosomes, the precise genes contributing to disease development and genealogy are not the same on each chromosome or part thereof [13-19].
Recently, Grimwood et al.  showed that the gene density on chromosome 19 is more than double the genome-wide average. Hence, the precision of equal segregation estimates of recombination fraction for all chromosomes and population-average values are not reliable, in terms of chromosome biological and evolutionary significance. The Santa Cruz Biotechnology group also announced a human chromosome database that features a chromosome-based index, which includes the chromosomal location of known human genes and links to the National Center for Biotechnology Information (NCBI) mRNA, protein and Online Mendelian Inheritance in Man (OMIM) databases (http://www.scbt.com webcite).
With the growing emphasis on dense SNPs and genome-wide association studies, and the recent accumulation of large, publicly available data-sets [21,22] -- such as the completion of HapMap, with over 3.7 million SNPs across the genome -- there is an increasing need not only for fine-scale resolution of clines of population structure, but also to identify functional pathways in genomic regions of major population differentiation with influences on disease risk . The thorough evaluation of the extent of fine-scale genetic structure among closely neighbouring populations, as well as the study of the ability to infer individual membership down to a particular population within a continent, have only begun in the past five years.
The objectives of this study were as follows: (i) to examine the extent and patterns of within-and between-chromosomal variations; (ii) to determine population genetics structure and population membership and (iii) to identify SNPs with major population differentiation and link this information with ontological annotation and functional networks/pathways.
Materials and methods
Data mining, processing and description
We downloaded the HapMap SNP data (http://www.hapmap.org webcite, release # 24, on NCBI B36 assembly, dbSNP b126). The HapMap project contains genotypes from 60 unrelated Caucasians from the USA with northern and western European ancestry (CEU), 60 unrelated Yoruba individuals from Ibadan, Nigeria (YRI), 45 Japanese individuals from Tokyo (JPT) and 45 Han Chinese individuals from Beijing (CHB) (http://www.hapmap.org webcite). Two criteria were used to filter the SNPs included in the analysis: (i) locus call rate ≥95 per cent (ie we excluded all SNPs with more than 5 per cent missing data) and (ii) the SNP should be shared among populations so that the same sets of SNPs were used throughout in the population comparisons. A computer program using Python (http://www.python.org webcite) was written to export and pre-process SNP genotype information from the databases. Genotypes were summarised for each population. For each dataset, the number of alleles per locus (SNP) was coded to a string of numbers to obtain a full design matrix of alleles (the cells give the number of copies of each major allele for each individual: zero, one or two). Figure 1 depicts our approach to SNP mining, multivariate chromosomal and population diversity and network analysis strategies. Of the total 3.7 million SNPs in the HapMap data release,[23,25] 809,000 SNPs fulfilled the criteria and were used in this analysis.
Figure 1. Schematic presentation of single nucleotide polymorphism (SNP) mining, multivariate chromosomal and population diversity and network analysis strategies. There are ~3.7 million SNPs in the HapMap data release. Genotypes were summarised for each population. For each dataset, the number of alleles per locus (SNP) was coded to a string of numbers to obtain a full design matrix of alleles (the cells give the number of copies of each major allele for each individual: 0, 1 or 2). Two criteria were used to filter the SNPs included in the analysis: (i) locus call rate ≥ 95 per cent (ie we excluded all SNPs with more than 5 per cent missing data); and (ii) the SNP should be shared among populations, so that the same sets of SNPs were used throughout in the population comparisons. From the total of ~3.7 million SNPs in the HapMap data release, only 809,624 SNPs were eligible for analysis.
Multivariate statistical techniques (namely, principal component [PC], cluster, discriminant, network analyses and FST statistics) were used to examine chromosomal structure within and between populations and associated functional networks by estimating chromosomal overall differentiation values. The analysis was carried out either using all SNPs together or separately for each chromosome. Because PC analysis (PCA) does not take into account group differences in reducing the dataset to a few representative variables, and it can be difficult to make appropriate inference about population relationships from the PC scatter plot, we further analysed the data using cluster analysis (CA) to classify individuals into mutually exclusive groups with high homogeneity within clusters and with low homogeneity between clusters. In other words, CA provides a visual assessment and identifies individuals who are similar (or dissimilar) to one another. To further confirm the grouping obtained in CA, discriminate analysis (DA) was performed. DA consists of the separation of a priori given classes for each individual. The variance-covariance between classes is maximised and the variance-covariance within classes is minimised under simultaneous consideration of all analysed data.
PCA was done using the EIGENSOFT software package (http://genepath.med.harvard.edu/~reich/Software.htm webcite) either on all SNPs simultaneously (all loci together) or separately per each chromosome. The analysis follows singular value decomposition, a procedure that produces eigenvectors, corresponding eigenvalues and proportions of eigenvalues, as well as the scores of the PCs . Using PCA, we estimated axes of variation corresponding to ancestry. The first eigenvector separates the samples in a way that explains the largest amount of variability, while the second and subsequent ones explain lesser amounts of variability. The spatial relationships of populations in each chromosome and all chromosomes were presented by plotting the scores of the first and second PCs. The numbers of significant PCs (at the level of p < 0.05) were tested using Tracy-Widom statistics. Pairwise population genetic diversity was determined by calculation of Wright's FST using EIGENSOFT. FST values indicate how much of the genetic variability between individuals from different populations is due to population affiliation.
Hierarchical clustering of molecular variance was followed using the similarity for qualitative data (SIMQUAL) module with the first 10 PCs that account for most of the variation. Average taxonomic distance matrices (DIST) were computed as a measure of genetic distance. This matrix was subjected to unweighted pair-group method analysis (UPGMA) to generate a dendrogram using the Sequential, Agglomerative, Hierarchical and Nested (SAHN) module. Both numerical taxonomic analyses were performed using the Numerical Taxonomy and Multivariate Analysis System Program, version 2.11f (NTSYS-pc) . The cophenetic correlation coefficient was calculated, and Mantel's test  was performed, to check the goodness of fit of a CA.
In DA, a linear combination of features that best separates two or more groups of objects is sought. The discriminant functions are determined based on the maximisation of the ratio of the external (between populations) to the internal (between individuals within the same population) variability . The values of Wilks' lambda (λ) and their X2 statistics are used to evaluate the number of significant discriminant functions. In turn, to determine the most important features of the objects, partial Wilks' λ and its Fisher statistics were utilised . Discriminant function analysis  was done following the SAS system  DISCRIM, CANDISC and STEPDISC procedures, and significance was tested using Wilks' λ . In order to avoid the limitation of a large number of alleles compared with the number of observations and the correlation occur in allele frequencies, we ran discriminant analysis using the uncorrelated SNPs in the top significant PCs. This ensures that variables submitted to DA are perfectly uncorrelated and that their number is lower than that of analysed individuals. Linear discriminant analysis is similar to logistic regression and is useful for building a predictive model of group membership based on observed characteristics. The procedure yields a set of discriminant functions based on the linear combinations of variables that provide the best discrimination between groups.
In the final set of analyses, a dataset containing a total of 126 genomic regions linked to SNPs that differed between populations (FST ≥ 0.5) was uploaded into the Ingenuity Pathways Analysis (IPA) 8.7 network analysis (Ingenuity Systems, Redwood City CA, USA). The network generated from the 126 input genes (called focus genes) uses both direct and indirect relationships/connectivity. These networks were ranked by scores that measured the probability that the genes were included in the network by chance alone. Networks with scores of three or more were classified as not being generated by random chance . The significance threshold for Fisher's exact test to determine the probability that each biological function and/or disease assigned to that network is due to chance alone was 0.05 or less. Canonical pathways associated with input genes were elucidated with a statistical significance value. The gene ontology (GO) analysis was used to identify functional commonalities between the genes based on the number of shared ancestors in gene products (http://gostat.wehi.edu.au webcite).
Estimates of FST differ between chromosomes and populations
The empirical genome-wide distribution of FST showed heterogeneity in chromosomal ancestry across the genome (Figure 2). The average FST values for autosomes and sex chromosomes were significantly different (0.120 and 0.210, respectively; t-test, t = 16.1, p < 10-15). The higher average FST for chromosome X compared with autosomes might indicate differences in inheritance mechanisms that potentially affect sex chromosomes and autosomes differently [34-36]. Similarly, statistically significant differences in FST estimates (FST analysis per chromosome- FST_SNP_CHROM) among autosomal chromosomes were detected. The variability that drives FST distribution among autosomes could be due to variations in natural selection and/or recombination rates during meiosis . Based on Wright's qualitative guidelines, FST statistics range from 0 (no differentiation) to 1 (fixed difference between populations for different alleles). Values of FST less than 0.05 represent low or little population genetics differentiation, values between 0.05 and 0.15 represent moderate population divergence, values between 0.15 and 0.25 indicate large population differentiation and FST values greater than 0.25 represent very large population divergence . Usually, an FST > 0.5 is considered sufficient for ancestry differentiation.
Figure 2. Pairwise FST chromosomal and population comparisons of the HapMap SNP dataset. A simple measure of population differentiation is Wright's FST, which measures the fraction of total genetic variation due to between-population differences. It could also represent a matrix of pairwise net distance (divergence) among the population.
The global pairwise FST value estimated for the 210 worldwide samples using all loci together (FST_SNP_ALL) was 0.130 (p < 10-6). As FST increases, populations become more distant and/or unrelated to each other . We observed an average genetic differentiation between CEU and YRI (FST = 0.153), CEU and CHB (FST = 0.110), CEU and JPT (FST = 0.111), YRI and CHB (FST = 0.190) and YRI and JPT (FST = 0.192) (Figure 2). It is evident that more divergence has occurred between YRI and each of the three other populations than between the other pairs of populations. Genetic distances between CHB and JPT populations were low (mean FST = 0.007), indicating that substantial gene flow compensates for the effects of genetic drift. This low FST value as a result of high similarities in allele frequencies between CHB and JPT samples motivates researchers to analyse CHB and JPT populations jointly, as a single panel .
Regardless of the populations compared, most of the variation was observed within populations (average, 87 per cent versus 13 per cent variation observed between populations). Within-population diversity reflects the number of different types in the population, taking into account their frequencies. By contrast, between-population differentiation measures variation based on the relative frequencies of types within these subpopulations and, ideally, measures the average distance of subpopulations from their respective lumped remainders. The fact that only 13 per cent of the total genetic variation results from differences between populations indicates that alleles present in one population are also present in other populations [40-43]. The remaining 87 per cent represents the average difference between members of the same population. One way to interpret this number is to say that the expected genetic difference between unrelated individuals from distant continents exceeds by 13 per cent the expected difference between members of the same community . An interesting common feature in population genetics studies of humans, animals, plants and other types of species is that within-population diversity is greater than between-population diversity [45-48]. This estimate is highly consistent for protein polymorphisms, blood groups, microsatellites, SNPs and morphological/phenotypic markers [49,50]. Therefore, it is necessary to quantify and control population structure, not only for major population differences, but also for subtle variation/structure arising within populations.
Significant numbers of PCs vary by chromosome
As shown in Figure 3, the first component -- which accounted for 50.2 per cent of variation -- separated YRI populations from CEU and CHB/JPT, while the second (accounting for 24.2 per cent of the total variance) could be associated with an Africa/Europe gradient. The overlap between the CHB and JPT populations suggests a low level of genetic differentiation, as shown by the pairwise FST divergence (Figure 2). Overall, these results demonstrate a clear partition of the West African populations considered. On a genome-wide average scale, about 74 per cent of the diversity in human samples was explained by the first two PCs. The eigenvalues for PC3-PC10 showed a plateau, suggesting that the first two PCs account for most of the populations' average substructure in this analysis. Such a genome-wide level of population structure may lead to an erroneous conclusion that the samples are genetically homogeneous. Thus, correction for population structure is only as good as the level of structure (at a finer or coarser level) that one wishes to correct.
Figure 3. Plot for the first two principal components (PCs) for HapMap individual for the genome-wide average shows the relationships between human populations in terms of their geographical origin. On a genome-wide average scale, about 74 per cent of the diversity in human population was explained on the basis of the first two PCs.
In order to obtain fine-scale resolution of population membership, PCA was performed on each chromosome. Our analysis showed that the contribution of the first two PCs in classifying geographical regions varied among chromosomes, ranging from 65 per cent (Chr X) to 76 per cent (Chr 15). The contribution of PC1 ranged from 47 per cent (Chr X) to 51 per cent (Chr 3, Chr 8). The contribution of PC2 to the total variation ranged from 18 per cent (Chr X) to 27 per cent for Chr 15 (Supplementary Figure S1 (Figure 6)). As shown in Figure 4, on a finer scale, the number of significant PCs accounting for population differentiation varies from 2 (Chr 2) to 31 (Chr X) among chromosomes. The higher number of significant PCs on Chr X explains why it has the lowest chromosome-wise contribution to the first two PCs (Supplementary Figure S1 (Figure 6)).
Figure 4. Significant numbers of PCs among chromosomes in the HapMap dataset. On a finer scale, the number of significant PCs that account for population differentiations vary from 2 to 31 among chromosomes.
Figure S1. Chromosome-wise principal component analysis (PCA) analysis of the entire HapMap dataset. The first PC accounted for more than double the variance of the second PC. The level of contribution of the first two PCs across chromosomes in classifying geographical regions are presented here. The chromosome-wise contribution of the first two PCs ranges from 65 per cent (Chr X) to 76 per cent (Chr 15). The contribution of PC1 ranges from 47 per cent (Chr X) to 51 per cent (Chr 3, Chr 8). The contribution of PC2 to the total variation ranges from 18 per cent (Chr X) to 27 per cent for Chr 15.
We next characterised the genetic relationships existing among the four different populations. The diagrammatic output of CA (constructed from PCs) for the mean of 210 individuals indicated that these individuals could be clustered into groups that basically coincided with their geographical distribution (data not shown). The analysis confirmed the distinctiveness of the CEU and YRI populations and the close average genetic distance between the CHB and JPT populations (Supplementary Figure S2 (Figure 7)). The results of the chromosomal-based CA (data not shown) were comparable to those of the PCA, and both methods classified racial populations into separate groups.
Figure S2. Unweighted pair-group method analysis dendrogram (a branching diagram used to show the relationships between members of a group) based on average taxonomic distance matrices among population means of HapMap SNP datasets. The cluster analysis (CA; constructed from principal components) for the mean of 210 individuals indicates the distance at which the various groups are formed and join together. CA, which is based on the means for all individuals from each geographical origin, was used to obtain similarities among individuals according to their correlation measures across all SNP datasets. Branch height represents dissimilarity. Note that, compared with YRI and CEU branch height, the CHB and JPT branch height is much shorter, representing that the genetic distance between these two populations is relatively close.
DA predicts population membership for 70 per cent of individuals
Although the overall population differentiation between the CHB and JPT populations appeared low using PCA and FST analysis, DA  indicated that ~30 per cent of the two populations were misclassified (Table 1). Thirty of the CHB individuals (n = 45) were correctly classified (67 per cent), while among the JPT individuals (n = 45), 38 were correctly classified (Table 1). The European and African populations were 100 per cent correctly classified to their respective groups. The classification matrix presented in Table 1 summarises the predictive ability of DA when each sample is assigned to a particular geographical region. Given the growing interest in high-density whole-genome association and admixture studies, DA is informative because misclassified individuals can be identified and assigned to their appropriate ancestral populations. Levels of correct and incorrect classification of human populations to their geographical regions of origin differed for each chromosome (Supplementary Table 1). For example, correct classifications of the 45 CHB individuals to their geographical regions of origin ranged from 23 (for Chr 6) to 35 (for Chr X). Chr 9 had the highest number of misclassified individuals and Chr X had the lowest. On the other hand, correctly classified individuals in the JPT population ranged from 25 (for Chr 9) to 36 (for Chr 19). Chr 18 had the lowest and Chr 9 the highest number of misclassified individuals in the JPT population. Chr 9 had the highest number of misclassifications in both the CHB and JPT populations. The variation in misclassification rate might indicate the existence of mosaic chromosomal blocks derived from other populations as a result of common ancestry or admixture. The use of more PCs might improve classification accuracy.
Table 1. Classification matrix for HapMap individuals based on SNP markers using DA
The summarised results of a stepwise DA to select variables with the most classification power are shown in Table 2. Wilks' λ and associated F-values are used as indices of discriminatory power and are presented for each successive step. To test the statistical significance of the discriminant function, the value of Wilks' λ (~0.00) was assessed . The Wilks' λ test showed that the ratio of the within-groups sum of squares to the total sum of squares was significant (Wilks' λ 9.53E-5, p < 0.001); thus, the null hypothesis of CEU = CHB = JPT = YRI was rejected.
Table 2. Stepwise order of inclusion of variables in the DA that distinguishes between human populations
The eigenvalue is the ratio of the between-groups sum of squares to the within-groups sum of squares . This value measures the spread of the group centroids in the dimension of multivariate space (eigenvalue 10587.53; p < 0.0001). The canonical correlation measures the association between discriminant scores and groups. This association appeared to be statistically significant (stepwise elimination, R2 = 0.99, 0.96; p < 0.0001), and the data were subjected to the stepwise procedure. The first canonical discriminant function had a high eigenvalue, accounting for more than 81 per cent of the total variance. The first and second functions together accounted for 100 per cent of the variance.
Functional networks and pathways in highly stratified genomic regions
To characterise the main functional networks/pathways underlying genes with substantial population differentiation, we carried out network analysis (see Materials and methods) for between-population comparisons (between CEU-YRI, CEU-CHB/JPT and YRI-CHB/JPT). A summary of networks, molecular and cellular functions, diseases and disorders and canonical pathways associated with the genomic regions are presented in Table 3. A total of 126 genes were significantly differentiated among populations and eligible for network analysis, which led to the identification of five significant networks (Figure 5). Network 1 was centred on the nuclear factor (NF)-κB complex and had 19 focus genes; network 2 was centred on tumour necrosis factor (TNF) and had 14 focus genes; network 3 was centred on v-myc myelocytomatosis viral oncogene homologue (MYC) and had 13 focus genes; network 4 was centred on interferon-γ (IFNG) and had 12 focus genes; and network 5 was centred on hepatocyte nuclear factor-4α and had 12 focus genes. Interestingly, although no genes were shared among all these five different networks, two networks (networks 1 and 2) contained chloride intracellular channel 4 (CLIC4), which plays a role in apoptosis, differentiation and diabetes. The overlap of this gene suggests that similar biological pathways were targeted by selection in these populations. In addition, the gene involved in skin, hair and eye pigmentation -- including oculocutaneous albinism II (OCA2), hect domain and RLD 2 (HERC2), ectodysplasin A receptor (EDAR) and solute carrier family 45, member 2 (SLC45A2) were over-represented in our GO analysis (http://gostat.wehi.edu.au webcite) (Table 4). Enriched GO biological function terms include cytoskeletal protein binding (p = 1.1 × 10-6), actin binding (p = 1.0 × 10-6) and fibroblast growth factor receptor antagonist activity (p = 6.2 × 10-5).
Table 3. IPA summary of associated networks, molecular and cellular functions, diseases and disorders and canonical pathways for the 126 genes mapped to significantly differentiated genomic regions
Figure 5. IPA network analysis for 126 genes mapped to significantly differentiated genomic regions. Genes with red nodes are focus genes in our analysis, the others are generated through the network analysis from the Ingenuity Pathways Knowledge Base (http://www.ingenuity.com webcite). Edges are displayed with labels that describe the nature of the relationship between the nodes. The lines between genes represent known interactions, with solid lines representing direct interactions and dashed lines representing indirect interactions. Nodes are displayed using various shapes that represent the functional class of the gene product.
Table 4. Gene Ontology analysis for the 126 genes mapped to significantly differentiated genomic regions
In conclusion, our approach offers a complementary statistical strategy for summarising overall variability and global versus chromosomal structure, assessing population structure and identifying genomic regions driving genetic divergence among populations. We first used PCA (to reduce data dimensionality); however, because PCA does not take into account group differences in reducing the dataset to a few representative variables, we further analysed the data using CA to classify individuals into mutually exclusive groups with high homogeneity within clusters and low homogeneity between clusters. To further confirm and predict group membership, DA was performed using the top significant PCs. PCs were used to ensure that variables submitted to DA were perfectly uncorrelated, and that their number was lower than that of analysed individuals. Finally, using FST (to study population differentiation) analysis, we described the importance of chromosome-based population genetic structure to identify differing genomic regions driven by natural selection. We followed the target genomic regions using network/pathway analysis to elucidate their roles and functional implications in human genetic variations and diseases.
Although most researchers traditionally focus on the top few axes of variation in a dataset, substantial information about population structure exists in lower-ranked chromosomal level PCs. Adjustment of global ancestry between study subjects may lead to false positives when chromosomal (local) population structure is an important confounding factor . Using chromosome-based analysis, fine-scale substructure was detectable beyond the broad population level classifications that previously have been explored using genome-wide average estimates in this dataset. The study of population structure in terms of chromosomes has broader practical relevance to researchers who use genetics and genomics approaches in gene mapping because genetic diversity is directly related to recombination rate (meiosis), which differs among chromosomes, and genes are not randomly distributed along chromosomes.
By restricting our analysis to each chromosome independently, instead of using global average estimates, we have reported for the first time that the number of fine-scale subpopulations is chromosome dependent. For example, chromosome 2 has two significant PCs which account for population differentiation, whereas chromosome X has 31. This result suggests that one has to examine a large enough number of PCs in order to find all the significant population differences. Thus, the variation in the number of chromosome-specific significant PCs might indicate the detection of a population structure that could have been missed if the average of all chromosomes was used. Even though chromosome 1 is the largest chromosome, followed by chromosome 2, the number of significant PCs that account for structure is lower in both of these chromosomes than in the rest of the chromosomes, indicating that genome size does not correlate with the biological complexity of organisms . Interestingly, similar results were reported by Becquet et al  in their study of chimpanzee population genetics structure. In plants, a recent study showed that the optimal number of subpopulations required to correct population structure is trait dependent . This study reminded us that the number of subpopulations for one trait may not be optimal for other traits. The current analytical approach using genome-wide average PCs as a covariate will control for confounding due to global ancestry but will not control for confounding due to the local (chromosome-based) ancestry effect. It is increasingly important to recognise intra-chromosomal variation, especially when populations have been recently admixed.
Similar to the results of chromosome-based PCA analysis, DA shows that the classification of populations to their correct geographical regions of origin is chromosome dependent. For example, in our analysis, the number of CHB individuals correctly classified to their geographical regions of origin ranged from 23 (for Chr 6) to 35 (for Chr X), while correctly classified individuals in the JPT population ranged from 25 (for Chr 9) to 36 (for Chr 19). Given the growing interest in tracing ancestral origins or contributions in genetically mixed populations, DA is informative and appealing because misclassified individuals can be identified and grouped into appropriate populations prior to large-scale genotyping.
To avoid single-marker FST-based inferences for selection, which can be misleading, we ran an in-depth investigation of the patterns of genetic variation in and around the highly differentiated loci and their effects on the phenotype using network/ontology analyses. We overlaid 126 genes (selected based on FST > 0.5) onto the Ingenuity Pathways Knowledge Database (http://www.ingenuity.com webcite). Using this analytical approach, we confirmed the over-representation of genes implicated in hair and skin development (OCA2, HERC2, EDAR and SLC45A2) in two of the top networks (Table 3). EDA-A1 and EDA-A2 are two isoforms of ectodysplasin that are encoded by the anhidrotic ectodermal dysplasia (EDA) gene. Genetic variability in the EDA ligand has been associated with loss of hair, sweat glands and teeth . The non-synonymous SNP rs1385699 identified within the EDA2 receptor gene (EDA2R) is fixed in both Asian populations, where as an R57K substitution in EDA2R has derived-allele (T) frequencies of 100 per cent. The EDA2R gene product is involved in the positive regulation of NF-κB transcription factor activity specifically within the hair follicle, TNF receptor activity, embryonic development and apoptosis . These genes were previously reported as candidates involved in human pigmentation phenotypes and in the development of skin cancer . The most striking difference provided by our more direct approach was the over-representation of canonical pathways related to androgen and oestrogen metabolism (Supplementary Figure S3 (Figure 8)) and gene groups implicated in the functional category of inflammation, as well as hair and skin development (Figure S4 (Figure 9)).
Figure S3. Global canonical pathways of the 126 genes linked to genomic regions of major population differentiation. The significance threshold, shown in yellow, represents a p value of greater than 0.05. The first four sets of functions shown represent a p-value of less than 0.01. Bars that are above the line indicate significant enrichment of a pathway.
Figure S4. The 16 most significant functional categories from IPA linked to the 126 genes of major population differentiation. The significance threshold, shown in yellow, represents a p value of greater than 0.05. Bars that are above the line indicate significant enrichment of a function.
In critically evaluating our results, it is important to note that our analyses, and hence interpretations, are subject to several limitations. First, an important caveat in the use of population-level genetic databases such as HapMap is the ascertainment criterion that was imposed during the initial selection of polymorphic SNPs to be assayed, and the subsequent release of the HapMap database primarily focused on SNPs that were common. The fundamental theorem underpinning HapMap is the common disease common variance (CD/CV) hypothesis .
Secondly, the HapMap study (Phase III) is currently being extended to include additional samples and diverse populations (http://www.hapmap.org webcite). The number of SNPs genotyped in Phase III is substantially fewer (~1.5 million SNPs) than in the present study, however, thereby providing less density and coverage. Such low coverage may miss important loci in regions of elevated molecular divergence in related populations, such as between CHB and JPT . When whole-genome sequences (such as (http://www.1000genomes.org webcite)) become widely available, the ability to use many rare variants to identify short shared genomic segments will perhaps allow routine identification of geographical regional or village-level ancestries, given a suitably large and carefully collected reference sample [65,66]. The 1000 Genomes Project, which aims to provide a whole-genome sequence resource for at least 1,200 individuals sampled from multiple population groups globally, will be invaluable for understanding the practical consequences of SNP ascertainment biases.
Thirdly, a SNP with a large difference in allele frequency between populations is a strong candidate to explain large differences in disease prevalence between populations [67,68]. This is because disease is tightly linked to survival and reproductive success, and genes responsible for variation in disease should have the most differentiated SNP frequencies between human populations. Indeed, studies have suggested that genes associated with complex diseases such as cardiovascular disease and type 2 diabetes have been targets for positive natural selection . If disease genes have often been targeted by selection, then identifying loci that have experienced selection may aid in disease-related research . Further studies are required to determine the extent to which differences in allele frequencies between populations predict disease prevalence differences between populations, however.
The study of population genetic structure between chromosomes is a fundamental issue in population biology because it helps us to obtain a deeper understanding of the ancestral population and associated evolutionary processes. For example, understanding heterogeneity in chromosomal ancestry in an admixed population is important because it can be a confounding factor when variation in admixture levels among individuals across chromosomes causes false-positive associations in genetic association studies. In addition, this analysis can be a source of statistical power for ancestry -- phenotype correlation studies that use observed racial/ethnic differences to find mosaic regions of the genome and map loci influencing complex phenotypes . The distribution of SNP density along chromosomes will inform us about chromosomal segments that are more susceptible to selective pressures or differential patterns. Understanding how chromosomal variations in ancestry relate to disease risk is a major challenge to the biomedical research community . Particularly, in the USA, there has been a significant intermixing among racial/ethnic groups, thereby creating a complex pattern of ancestral populations which are a mosaic of multiple continental populations. The development of population structure adjustment based on chromosome will provide higher-resolution genographic maps and offer investigators designing genetic association studies more powerful tools for detecting stratification.
The final question we need to answer is, what causes population differentiation? Humans have wide altitudinal and latitudinal distribution ranges, and hence, different individuals may face very different environmental constraints and selection pressures. Population differentiation could arise as a result of geographical separation and subsequent drift and/or bottlenecks; natural selection (ie the local adaptation process by which organisms become adapted to their environments); differential admixture with other populations; and (possibly) different mutation rates (eg differential exposure to ionising radiation, environmental toxins, etc.). A central theme in evolutionary biology is that natural selection acting on heritable phenotypic variation will result in adaptation and differentiation among local populations inhabiting environments differing in their selective regimes . Natural selection may confer an adaptive advantage to individuals in a specific environment if an allele provides a competitive advantage. Alleles under selection are likely to occur only in those geographical regions where they confer an advantage. Alleles associated with harmful traits decrease in frequency, while those associated with beneficial traits become more common. Local adaptation acting in concert with other processes (eg recombination) is sufficiently pervasive to confound measurements of population differentiation, making a single such genome-wide measurement somewhat unreliable, especially when applied to any specific chromosome or region.
In summary: population differentiation, at a genetic level, is the result of numerous processes; differentiation is measurable and quantifiable by a variety of approaches; and most of the processes leading to differentiation affect all autosomes equally, except for natural selection, which leads to extreme values that reflect local adaptation due to natural selection. We also note that rather than some 'normal' distribution of FST values, with exceptional values occasionally reflecting natural selection, there is substantial inter-chromosomal variation in the inferred patterns and characteristics of population structure. These inter- and intra-chromosomal variations, either across the genome as a whole or along single chromosomes, may directly affect population divergence. This study underlines the potential of chromosome-based analysis of genome-wide data to quantify substructure in populations that might otherwise appear relatively homogeneous. Before embarking on a large-scale genomic study, proper control of chromosome-wise stratification/confounding, predicting population memberships is crucial.
Principal component analysis (PCA)
PCA was performed using a correlation matrix, which was then subjected to eigenvector analysis to extract the principal components (PCs) that can summarise the variation in a data matrix X, consisting of N rows (samples) and K columns (variables: single nuclear polymorphisms [SNPs]), in terms of a few underlying and informative scores or latent variables . The X-matrix is decomposed as the product of two matrices, the (N × A) score matrix, T, and the (A × K) loading matrix, P', where A is the number of PCs, plus a (N × K) 'noise' matrix of residuals E.
where T is the score matrix summarising the X-variables and P' is the loading matrix showing the influence of the variables on the projection model. E is the residual matrix expressing the deviations between the original values and the projections. In general, PCA transforms a number of correlated allele frequencies into a smaller number of uncorrelated synthetic variables, or PCs.
Cluster analysis (CA)
For the HapMap SNP dataset that was found to be polymorphic among CEU, CHB, JPT and YRI samples, Pearson correlation coefficients were computed for the 210 individuals. The individuals were then grouped by a hierarchical clustering algorithm using the average linkage method, which was implemented using NTSYS v2.1 software . The genetic distance between each pair of individuals, m and m', was summarised by the allele-sharing method, D (m, m'), as follows:
where l is the number of loci for which both individuals have been tested.
Discriminant function analysis
Discriminant functions based on population grouping were obtained by the stepwise inclusion of SNPs to minimise Wilks' lambda (λ) between groups, as described by Rechner  and as follows: L = b1x1 + b2x2 + b3x3 ... + bzxz; where x1 through xz represent the various predictor variables (SNPs); b1 through bz represent the weight associated with each of the predictor variables; and L is the object's resultant qualitative discrimination score, with a cut-off score to assign objects to one group or another. Objects with L > × are assigned to one group, and those with L < × are assigned to another group, based on allele frequency differences. L represents classifying variables.
Fixation index (FST) estimates between Populations
Global FST values for pairwise population comparisons were calculated using genome-wide SNP allele frequency variances estimated from the unrelated individuals in each HapMap population (CEU, CHB, JPT and YRI), following Wright . The formula used was as follows:
where is the average allele frequency (over all populations) of the i-th allele, m is the number of alleles and Fi is the value of FST for each allele. SNP-specific FST measures of population genetic differentiation based on allele frequencies in two populations, a metric of variation within a population versus between populations, are outlined below, following McKeigue . In this formula, p1 and p2 denote the frequencies of a particular allele in population 1 and population 2, respectively.
Network and gene ontology analysis of genes showing differentiation between populations
Ingenuity Pathways Analysis (IPA) was used to organise genes showing evidence of selection into networks of interacting genes and to identify pathways containing functionally related genes . More precisely, network analysis consists of searching for direct and indirect interactions between candidate genes and all other molecules (genes, gene products or small molecules) contained in the Ingenuity Pathways Knowledge Base (IPKB). The complete list of gene identifiers was uploaded into IPA, and each was mapped to its corresponding IPKB gene object. Candidate genes are eligible for network generation if there is at least one wild-type IPKB interacting molecule. Based on the information available for eligible candidate genes (focus genes), IPA further constructs networks by maximising the number of focus genes and their inter-connectivity within the limit of 35 molecules per network. Note that additional highly connected non-focus molecules are also included. Finally, for each network, a right-tailed Fisher exact test is implemented to evaluate how likely it is that the focus genes it contains might be found together by chance. Only those networks with a score (-log[p value]) greater than three were considered as significant . In addition, networks might be inter-connected (ie sharing at least one molecule), which strengthens the importance for the underlying biological functions. Networks are graphically represented by nodes with various shapes (according to the molecule type) and edges (according to their biological relationships).
The likelihood for a gene pair to be regulated in the same manner increases with the similarity of their gene ontology (GO) description. The GO similarity score between two gene products is based on the number of shared ancestors. As a gene product might be assigned with multiple GO terms, we seek the maximum similarity score between all possible combinations. As we seek to discover gene-gene interactions, we reformulate the GO approach as follows. Let gene i and gene j be assigned hi and hj GO terms, respectively. Then, the GO similarity for the gene (i, j) pair is taken to be the maximum number of shared ancestors for all combinations of the hi and hj . IPA essentially evaluates the enrichment of particular biological processes and molecular functions of gene sets by examining information collected by databases such as GO, Kyoto Encyclopedia of Genes and Genomes or the IPKB.
Table S1. Discriminant analysis classification accuracy and associated percentage across the genome and population