Sibship T2association tests of complex diseases for tightly linked markers

Fan, Ruzong; Knapp, Michael

doi:10.1186/1479-7364-2-2-90

Primary research
Published: 01 June 2005

Sibship T²association tests of complex diseases for tightly linked markers

Ruzong Fan¹ &
Michael Knapp²

Human Genomics volume 2, Article number: 90 (2005) Cite this article

1129 Accesses
2 Citations
Metrics details

Abstract

For population case-control association studies, the false-positive rates can be high due to inappropriate controls, which can occur if there is population admixture or stratification. Moreover, it is not always clear how to choose appropriate controls. Alternatively, the parents or normal sibs can be used as controls of affected sibs. For late-onset complex diseases, parental data are not usually available. One way to study late-onset disorders is to perform sib-pair or sibship analyses. This paper proposes sibship-based Hotelling's T² test statistics for high-resolution linkage disequilibrium mapping of complex diseases. For a sample of sibships, suppose that each sibship consists of at least one affected sib and at least one normal sib. Assume that genotype data of multiple tightly linked markers/haplotypes are available for each individual in the sample. Paired Hotelling's T² test statistics are proposed for high-resolution association studies using normal sibs as controls for affected sibs, based on two coding methods: 'haplotype/allele coding' and 'genotype coding'. The paired Hotelling's T² tests take into account not only the correlation among the markers, but also take the correlation within each sib-pair. The validity of the proposed method is justified by rigorous mathematical and statistical proofs under the large sample theory. The non-centrality parameter approximations of the test statistics are calculated for power and sample size calculations. By carrying out power and simulation studies, it was found that the non-centrality parameter approximations of the test statistics were accurate. By power and type I error analysis, the test statistics based on the 'haplotype/allele coding' method were found to be advantageous in comparison to the test statistics based on the 'genotype coding' method. The test statistics based on multiple markers can have higher power than those based on a single marker. The test statistics can be applied not only for bi-allelic markers, but also for multi-allelic markers. In addition, the test statistics can be applied to analyse the genetic data of multiple markers which contain double heterozygotes -- that is, unknown linkage phase data. An SAS macro, Hotel_sibs.sas, is written to implement the method for data analysis.

Introduction

In recent years, there has been great interest in the research of association studies of complex diseases [1–6]. By association studies, we mean linkage disequilibrium (LD) mapping of genetic traits. For population case-control studies, the marker allele frequency in cases can be compared with that of controls using χ² test statistics [7–11]. If there is association between one marker and the trait locus, it is expected that the χ² tests would lead to significant results. Essentially, this method can be applied to analyse the data for one marker at a time. For multiple markers, the linkage phase may be unknown, [12] and the method cannot be applied simultaneously to analyse the data of multiple markers which contain double heterozygotes. With the development of dense maps such as single nucleotide polymorphisms (SNPs), haplotype maps and high-resolution micro-satellites in the human genome, enormous amounts of genetic data on human chromosomes are becoming available [13–15]. It is interesting when building appropriate models and useful algorithms in association mapping of complex diseases to have the ability to use multiple markers/haplotypes simultaneously.

For tightly linked genetic markers, one may perform association studies of complex diseases based on the Hotelling's T² test statistics [16]. For population case-control data, Xiong et al. proposed two sample Hotelling's T² test statistics to analyse genotype data of multiple bi-allelic markers such as SNPs; [17] in addition, logistic regression models were proposed [2, 18]. To analyse the multi-allelic micro-satellite or hap-lotype data, Fan and Knapp extended Xiong et al. method using two coding methods -- 'haplotype/allele coding' and 'genotype coding' [19]. For the genetic data of nuclear families or parent-offspring pairs, paired Hotelling's T² test statistics were proposed, in order to perform association studies based on multiple markers/haplotypes [20].

For late-onset complex diseases, parental data are usually not available. One way to study late-onset disorders is to perform sib-pair or sibship analyses [21, 22]. This paper proposes sibship-based paired Hotelling's T² test statistics for high-resolution LD mapping of complex diseases. For a sample of sibships, suppose that each sibship consists of at least one affected sib and at least one normal sib. Assume that genotype data for multiple markers are available for each individual in the sample. Paired Hotelling's T² test statistics are proposed for high-resolution association studies, using normal sibs as controls for affected sibs. The paired Hotelling's T² tests not only take the correlation among the markers into account, but also the correlation within each sib-pair. The validity of the proposed method is justified by rigorous mathematical and statistical proofs under the large sample theory. The non-centrality parameter approximations of the test statistics are calculated for power calculations and comparisons; these are included in the section: Supplementary information: Non-centrality parameters. Type I error rates are calculated by simulations to evaluate the performance of the proposed test statistics. In the section: Supplementary information: Simulation study, the results from the simulation study are presented, to show that the non-centrality parameter approximations of the test statistics are accurate. An SAS macro, Hotel_sibs.sas, was written to implement the method and can be downloaded from the authors' website http://www.stat.tamu.edu/~rfan/software.html/.

Methods

We assume that a disease locus D is located in a chromosome region. Suppose that the disease locus has two alleles D and d. Allele D is disease susceptible and d is normal. Assume that the disease-susceptible allele D has population frequency P_D, and the normal allele d has population frequency P_d.

Paired Hotelling's T²test statistics

In the region of the disease locus D, assume that J tightly linked markers H₁, ..., H_J are typed. By tightly linked, we mean that the markers are so close to each other that the recombination fractions among markers are 0. Let us denote the alleles of marker H_j by $H_{j 1}, \dots, H_{j n_{j}}$ , where n_j denotes the number of its alleles. Here, markers can be micro-satellites or di-allelic markers such as SNPs or haplotypes. If H₁, ..., H_J are phase-known haplotypes, the methods developed in this paper are still valid, since the haplotypes can be treated as markers; but the related terminology needs to be changed accordingly. Usually, haplotypes consist of phase-unknown markers; in these cases, we prefer to analyse the genotype marker data directly, instead of estimating the haplotypes first and then analysing the haplotype data. The method developed in this paper can be used to analyse phase-unknown genotype data directly. Consider N sib-pairs, each consisting of an affected sibling and a normal sibling. We define coding vectors $X_{i}^{(A)}$ and $Y_{i}^{(U)}$ for the affected sibling and normal sibling of the i-th sib-pair, respectively, by one of the following two ways [19, 20].

(i) Haplotype/allele coding: For the affected sibling of the i-th sib-pair, let $G_{i j}^{(A)}$ be his/her genotype at marker H_j. Define $X_{i}^{(A)} = {(z_{i 11}^{(A)}, \dots, z_{i 1 (n_{1} - 1)}^{(A)},, \dots, z_{i J 1}^{(A)}, \dots, z_{i J (n_{J} - 1)}^{(A)})}^{τ}$ , where $z_{i j k}^{(A)}$ is the number of alleles H_jk for the affected sibling of the i-th sib-pair -- that is,

z_{i j k}^{(A)} = \{\begin{matrix} 2 & if & G_{i j}^{(A)} = H_{j k} H_{j k} \\ 1 & if & G_{i j}^{(A)} = H_{j k} H_{j l}, l \neq k \\ 0 & else \end{matrix}

Here and hereafter, the superscript τ denotes the transposition of a matrix or a vector. The dimension of $X_{i}^{(A)}$ is $(n_{1} - 1) + \dots + (n_{J} - 1) = \sum_{j = 1}^{J} n_{j} - J$ , which is usually smaller than dimension $\sum_{j = 1}^{J} n_{j} (n_{j} + 1) / 2 - J$ of the following genotype coding method.

(ii) Genotype coding: Note that $G_{i j}^{(A)}$ can be one of n_j(n_j + 1)/2 possible choices: n_j homozygous genotypes H_jkH_jk, and n_j(n_j - 1)/2 heterozygous genotypes H_jkH_jl, k < l. Depending on the genotype, let us define an indicator vector $X_{i j}^{(A)} = {(x_{i j 1}^{(A)}, \dots, x_{i j (n_{j} - 1)}^{(A)}, x_{i j 12}^{(A)}, \dots, x_{i j 1 n_{j}}^{(A)}, \dots, x_{i j (n_{J} - 1) n_{j}}^{(A)})}^{τ}$ . Here, $x_{i j k}^{(A)}$ is the indicator variable of genotype H_jkH_jk defined by $x_{i j k}^{A} = \{\begin{matrix} 1 & if & G_{i j}^{(A)} = H_{j k} H_{j k} \\ 0 & else \end{matrix}$ ; and $x_{i j k l}^{(A)}$ , k <l is the indicator variable of genotype H_jkH_jl defined by $x_{i j k l}^{A} = \{\begin{matrix} 1 & if & G_{i j}^{(A)} = H_{j k} H_{j l} \\ 0 & else \end{matrix}$ . The dimension of $x_{i j k}^{(A)}$ is n_j(n_j + 1)/2 - 1 -- that is, the total number n_j(n_j + 1)/2 of genotypes of marker H_j minus 1 to remove the redundancy. Let $X_{i}^{(A)} = {(X_{i 1}^{{(A)}^{τ}}, \dots, X_{i J}^{{(A)}^{τ}})}^{τ}$ be the combined genotype coding of the J markers H₁, ... H_J . The dimension of $X_{i}^{(A)}$ is $\sum_{j = 1}^{J} n_{j} (n_{j} + 1) / 2 - J$ .

For the unaffected sibling of the i-th sib-pair, let $G_{i j}^{(U)}$ be his/her genotype at marker H_j. One may define a vector $Y_{i}^{(U)}$ in the same way, based on either the 'genotype coding' or 'haplotype/allele coding' method. Table 1 in reference 19 gives an example of 'genotype coding' and 'haplotype/allele coding' for a marker with three alleles, to illustrate the above two coding methods.

Table 1 Type I error rates of N = 200 or 300 sib-pairs at a significance level α = 0.01 using one marker, H₁, or two markers, H₁ and H₂.

Full size table

Let ${\bar{X}}^{(A)} = \sum_{i = 1}^{N} X_{i}^{(A)} / N$ and $Ȳ^{(U)} = \sum_{i = 1}^{N} Y_{i}^{(U)} / N$ be average coding vectors of affected and unaffected siblings, respectively. Intuitively, ${\bar{X}}^{(A)}$ and $Ȳ^{(U)}$ should be similar vectors if the disease locus D is not associated with markers H_j, j = 1, ..., J. In the Appendix we prove that the expected value of ${\bar{X}}^{(A)} - Ȳ^{(U)}$ is 0 if there is no association. Hence, one may build a test statistic based on the difference ${\bar{X}}^{(A)} - Ȳ^{(U)}$ to test the association between disease locus D and markers H_j. To do this, one needs to consider the variance-covariance matrix of ${\bar{X}}^{(A)} - Ȳ^{(U)}$ . Since siblings' marker genotypes are related to each other, ${\bar{X}}^{(A)}$ and $Ȳ^{(U)}$ are not independent. Moreover, ${\bar{X}}_{i}^{(A)}$ and $Y_{i}^{(U)}$ are paired with each other in a sib-pair. Therefore, paired T² test statistics can be used to test the association between disease locus D and markers H_j as follows. Define a paired-sample variance-covariance matrix by

\begin{array}{l} S & = \frac{1}{N - 1} \sum_{i = 1}^{N} [(X_{i}^{(A)} - Y_{i}^{(U)}) - ({\bar{X}}^{(A)} - Ȳ^{(U)})] [(X_{i}^{(A)} - Y_{i}^{(U)}) \\ {- ({\bar{X}}^{(A)} - Ȳ^{(U)})]}^{τ} \\ = \frac{1}{N - 1} [\sum_{i = 1}^{N} (X_{i}^{(A)} - {\bar{X}}^{(A)}) {(X_{i}^{(A)} - {\bar{X}}^{(A)})}^{τ} \\ - \sum_{i = 1}^{N} (X_{i}^{(A)} - {\bar{X}}^{(A)}) {(Y_{i}^{(U)} - Ȳ^{(U)})}^{τ} \\ - \sum_{i = 1}^{N} (Y_{i}^{(U)} - Ȳ^{(U)}) {(X_{i}^{(A)} - {\bar{X}}^{(A)})}^{τ} \\ + \sum_{i = 1}^{N} (Y_{i}^{(U)} - Ȳ^{(U)}) {(Y_{i}^{(U)} - Ȳ^{(U)})}^{τ}] . \end{array}

A paired Hotelling's T² statistic can be defined as $T^{2} = N {({\bar{X}}^{(A)} - Ȳ^{(U)})}^{τ} S^{- 1} ({\bar{X}}^{(A)} - Ȳ^{(U)})$ [16, 23]. Let us denote the above Hotelling's T² statistic for 'haplotype/allele coding' as T_H, and the Hotelling's T² statistic for 'genotype coding' as T_G. Assume that the sample size N is sufficiently large that the large-sample theory applies. Under the null hypothesis of no association, the statistic T_H (or T_G) is asymptotically distributed as central χ² with $\sum_{j = 1}^{J} n_{j} - J$ $(or \sum_{j = 1}^{J} n_{j} (n_{j} + 1) / 2 - J)$ degrees of freedom. Under the alternative hypothesis of association, T_H (or T_G) is asymptotically distributed as non-central χ². For power calculation and comparison, the non-centrality parameter of statistic T_H or T_G can be derived under the alternative hypothesis of association.

For general sibships each containing at least one affected sibling and at least one normal sibling, the Hotelling's T² test statistics T_H and T_G above can be generalised as follows. Assume that N sibships are available. In the i-th sibship, assume that n_i siblings are affected and m_i siblings are normal. Let ${\bar{X}}_{i}^{(A)}$ and $Ȳ_{i}^{(U)}$ be average coding vectors of affected and normal siblings, respectively. To be precise, let $X_{i j}^{(A)}$ , j = 1, ⋯, n_i be the coding vectors of the affected siblings of the i-th sibship. Then, ${\bar{X}}^{(A)} = \sum_{j = 1}^{n_{j}} X_{i j}^{(A)} / n_{i}$ ; $Ȳ_{i}^{(U)}$ is defined, accordingly. Utilising ${\bar{X}}_{i}^{(A)}$ to replace $X_{i}^{(A)}$ and $Ȳ_{i}^{(U)}$ to replace $Y_{i}^{(U)}$ in the above paragraph and defining ${\bar{X}}^{(A)} = \sum_{i = 1}^{N} {\bar{X}}_{i}^{(A)} / N$ and $Ȳ^{(U)} = \sum_{i = 1}^{N} Ȳ_{i}^{(U)} / N$ , we may define the related Hotelling's T² test statistics T_H and T_G.

Non-centrality parameters

The derivation of non-centrality parameters of sib-pairs is provided in the section Supplementary information: Non-centrality parameters.

Results

Type I errors

Tables 1, 2 and 3 show type I error rates of test statistics T_H and T_G at a significance level α = 0.01, using one marker H₁ or two markers H₁ and H₂. Three models are considered. In model I, one marker H₁ is used in analysis: H₁ is a bi-allelic marker with equal allele frequency P(H₁₁) = P(H₁₂) = 0.50. In model II, two bi-allelic markers H₁ and H₂ are used in analysis, where P(H_ij) = 0.5, i, j = 1, 2, $Δ_{H_{11} H_{21}} = 0.05$ . In model III, one marker H₁ is used in analysis, where H₁ is a quadri-allelic marker with allele frequencies P(H₂₁) = P(H₂₂) = 0.35, P(H₂₃) = P(H₂₄) = 0.15.

Table 2 Type I error rates of N = 200 or 300 sibships at a significance level α = 0.01 using one marker, H₁, or two markers, H₁ and H₂.

Full size table

Table 3 Type I error rates of N = 200 or 300 sibships at a significance level α = 0.01 using one marker, H₁, or two markers, H₁ and H₂.

Full size table

Each time, 5,000 simulated datasets are generated and each dataset contains N = 200 or 300 sibships under the assumption that there is no association between the marker(s) and the disease locus; a type I error rate is then calculated as the proportion of the 5,000 datasets for which the empirical test statistics are greater than, or equal to, the cut-off point at the significance level α = 0.01. The process is repeated 100 times. Thus, 100 type I error rates are calculated. The mean, standard deviation, minimum and maximum of the 100 type I error rates are presented in the entries of Tables 1, 2 and 3. Since the disease locus is not associated with the marker(s), the empirical test statistics which are greater than or equal to the cut-off point at the significance level α = 0.01 are treated as false positives. Thus, the type I error rates of Tables 1, 2 and 3 are empirical results.

In Table 1, only sib-pairs are used in the calculations. In each sib-pair, one sibling is affected and the other one is normal. In Table 2, combinations of both sib-pairs and sibships of size 3 are used: the number of sib-pairs is equal to N/2; the number of sibships of size 3 is N/2; in each of N/4 sibships of size 3, one is affected and the other two are normal; in the remaining N/4 sibships of size 3, two are affected and the other one is normal. In Table 3, combinations of sib-pairs and sibships of sizes 3 and 4 are used: the number of sib-pairs is equal to N/2; the number of sibships of size 3 is N/5; and the number of sibships of size 4 is 3N/10; in each of N/10 sibships of size 3, one is affected and the other two are normal; in the remaining N/10 sibships of size 3, two are affected and the other one is normal; in each of N/10 sibships of size 4, one is affected and the other three are normal; in each of N/10 sibships of size 4, two are affected and the other two are normal; in the remaining N/10 sibships of size 4, three are affected and the other one is normal.

From the results presented in Tables 1, 2 and 3, it is clear that T_H has a lower type I error than T_G. That is, the test statistic of the 'haplotype/allele coding' method has a lower type I error than the test statistic of the 'genotype coding' method. The 'haplotype/allele coding' method leads to more robust and reliable test statistics. The type I error rates of the test statistic of the 'haplotype/allele coding' method are reasonable for models I, II and III when N = 200. In addition, the type I error rates of the test statistic of the 'genotype coding' method are reasonable for models I and II when N = 200. The type I error rates of the test statistic for the 'genotype coding' method are slightly higher than the nominal level 0.01 for model III when N = 200 and become lower when N = 300. Note that the number of degrees of freedom for tests T_G and T_H is 3 and 9, respectively, for model III. Hence, the number of degrees of freedom for test T_G is large for model III. When the number of degrees of freedom for tests is large, the asymptotic criteria can be problematic. In this case, a large sample is necessary to keep the type I error rates in a reasonable range.

The results are similar in Tables 1, 2 and 3. Thus, the type I error rates are little affected by the varying structure of the sibships. The reason for this is that we basically take averages of the coding vectors for sibships whose size is larger than 2.

Power calculation and comparison

To make power comparisons, we consider four genetic models: heterogeneous recessive, heterogeneous dominant, additive and multiplicative. For optimistic models, Table 4 gives penetrance probabilities taken from Nielsen et al. or Fan and Knapp [11, 19]. For less optimistic models, Table 5 lists penetrance probabilities taken from Fan and Knapp [19]. For j = 1, ..., J, let us denote the measures of LD between allele H_jk of the marker H_j and the disease locus D by Δ_jk = P(H_jkD) - P(H_jk)P_D, k = 1, ..., n_i. Here, P(H_jkD) is the frequency of haplotype H_jkD, and P(H_jk) is the frequency of allele H_jk. For two bi-allelic markers H₁ and H₂, let $Δ_{H_{1} H_{2}} = P (H_{11} H_{21}) - P (H_{11}) P (H_{21})$ be the measure of LD between the two markers, where P(H₁₁H₂₁) is the frequency of haplotype H₁₁H₂₁. Assume that the two markers H₁ and H₂ flank the disease locus D in the order H₁DH₂. Let $Δ_{1 D 2} = P (H_{11} D H_{21}) - P (H_{11}) Δ_{21} - P_{D} Δ_{H_{1} H_{2}} - P (H_{21}) Δ_{11} - P (H_{11}) P_{D} P (H_{21})$ be the measure of the third order LD [24]. Here, P(H₁₁DH₂₁) is the frequency of haplotype H₁₁DH₂₁.

Table 4 First set of parameters of simulated genetic models.

Full size table

Table 5 Second set of parameters of simulated genetic models.

Full size table

Figure 1 shows power curves of T_H and T_G against the measure of LD Δ₁₁ at a significance level α = 0.05 using two bi-allelic marker H₁ and H₂, when P(H_i1) = P(H_i2) = 0.50, i = 1, 2, P_D = 0.15 and N = 200 sib-pairs for the first set of parameters of the four genetic models of Table 4. The power curves of T_H1and T_G1are calculated based on one marker H₁. In the graphs, Delta_11 = Δ₁₁; the other parameters are given in the legend of the Figure. Figure 2 shows power curves of T_H and T_G against the measure of LD Δ₁₁ at a significance level α = 0.05 using two bi-allelic marker H₁ and H₂, when P(H_i1) = P(H_i2) = 0.50, i = 1, 2, P_D = 0.15 and N = 600 sib-pairs for the second set of parameters of the four genetic models listed in Table 5. Similarly to Figure 1, the power curves of T_H1and T_G1are calculated based on one marker H₁. The other parameters are the same as those of Figure 1.

From Figures 1 and 2, it is clear that T_H generally has a higher power than that of T_G. This is consistent with the results of Fan and Knapp for population case-control studies and Fan et al. for nuclear family data [19, 20]. This is most likely due to the large number of degrees of freedom of the test statistic T_G. The power of T_H (or T_G) based on two markers H₁ and H₂ is generally higher than that of T_H1(or T_G1), which is only based on one marker H₁. Hence, it is advantageous to use two markers rather than one marker in the analysis. This observation can be generalised -- that is, it is advantageous to use multiple tightly linked markers in analysis. Note that the number of degrees of freedom of test statistic T_G can increase rapidly as the number of markers increases. This is particularly true when multi-allelic markers are used in analysis; but the number of degrees of freedom of T_H only increases by one if one more bi-allelic marker is added to the analysis. Thus, T_H has the advantage of high power when multiple markers are used; in addition, the number of degrees of freedom of T_H would be not very large. For optimistic models in Table 4, the sample sizes required to achieve certain power levels are lower than those of the less optimistic models in Table 5.

Not only can the test statistics T_H and T_G be applied to analyse the genetic data of the bi-allelic markers, but they can also be applied to analyse the genetic data of the multi-allelic markers. Figure 3 shows the power curves of T_H and T_G against the measure of LD Δ₁₁ at a significance level α = 0.05 using a quadri-allelic marker H₁, when P(H₁₁) = P(H₁₂) = 0.35, P(H₁₃) = P(H₁₄) = 0.15, P_D = 0.15 and N = 200 sib-pairs for the first set of parameters of the four genetic models of Table 4. The other parameters are given in the legend of the Figure. Figure 4 shows power curves of T_H and T_G at a significance level α = 0.05 using a quadri-allelic marker H₁, when P(H₁₁) = P(H₁₂) = 0.35, P(H₁₃) = P(H₁₄) = 0.15, P_D = 0.15 and N = 600 sib-pairs for the second set of parameters of the four genetic models of Table 5. Similarly to Figures 1 and 2, T_H generally has a higher power than that of T_G.

In addition to the power curves of T_H and T_G, which are based on sib-pair data, Figures 3 and 4 show the simulated power curves of ST_H and ST_G, which are based on sibships of varying structures. In Figure 3, combinations of both sib-pairs and sibships of size 3 are used to calculate the simulated power curves of ST_H and ST_G: the number of sib-pairs is equal to N/2 = 100; the number of sibships of size 3 is N/2 = 100; in each of N/4 = 50 sibships of size 3, one is affected and the other two are normal; in the remaining N/4 = 50 sibships of size 3, two are affected and the other one is normal. In Figure 4, combinations of sib-pairs and sibships of sizes 3 and 4 are used to calculate the simulated power curves of ST_H and ST_G: the number of sib-pairs is equal to N/2 = 300; the number of sibships of size 3 is N/5 = 120; and the number of sibships of size 4 is 3N/10 = 180; in each of N/10 = 60 sibships of size 3, one is affected and the other two are normal; in the remaining N/10 = 60 sibships of size 3, two are affected and the other one is normal; in each of N/10 = 60 sibships of size 4, one is affected and the other three are normal; in each of N/10 = 60 sibships of size 4, two are affected and the other two are normal; in the remaining N/10 = 60 sibships of size 4, three are affected and the other one is normal.

To calculate the simulated power curves ST_H and ST_G, the interval (0, 0.045) of the LD measure Δ₁₁ of LD is uniformly divided into 20 subintervals in Figures 3 and 4. Correspondingly, the 20 subintervals lead to 21 endpoints. For each endpoint, there is a set of parameters for each power curve. Using the set of parameters, 2,500 datasets are simulated for each endpoint. For each dataset, the empirical statistics T_H and T_G were calculated. The simulated power is the proportion of the 2,500 simulated datasets for which the empirical statistic is larger than the cut-off point of the corresponding χ²-distribution at a 0.05 significance level.

From Figures 3 and 4, it can be seen that the simulated power ST_H is generally higher than the power of T_H, and the simulated power ST_G is generally higher than the power of T_G. Intuitively, sibships of large size contain more information than that of a sib-pair. The test statistics T_H and T_G can accurately capture the information contained in sibships of large size. Moreover, it can also be seen in Tables 1, 2 and 3 that the type I error is not inflated by including sibships of varying structure.

Simulation study

To evaluate the accuracy of the non-centrality parameter approximations, we performed simulations for the power curves in Figures 1, 2, 3 and 4. The results are presented in the section: Supplementary information: Simulation study. It can be seen that the approximations are excellent.

Discussion

The goal of this study was to develop sibship-based Hotelling's T² test statistics for high-resolution association mapping of complex diseases. This extends our previous research of paired Hotelling's T² test statistics of nuclear family data or parent-offspring pairs [20]. For late-onset complex diseases, parental data are usually not available. This motivated us to perform sib-pair or sibship analyses to study late-onset disorders. Based an two coding methods--'haplotype/allele coding' and 'genotype coding'--paired Hotelling's T² test statistics T_H and T_G are proposed for high-resolution association studies, using normal sibs as controls for affected sibs. The test statistics can be applied to any number of markers, which can be either bi-allelic or multi-allelic. After power calculation and comparison, it was found that it is advantageous to use two markers rather than one marker in the analysis. This observation can be generalised -- that is, it is advantageous to use multiple tightly linked markers in analysis. The test statistic T_H based on the 'haplotype/allele coding' method is generally more powerful than the test statistic T_G based on the 'genotype coding' method. This is most likely due to the large number of degrees of freedom of T_G. Moreover, the type I error rates of the test statistic T_H are lower than those of test statistic T_G.

For population case-control association studies, false-positive rates can be high due to inappropriate controls, which can occur if there is population admixture or stratification [25]. Moreover, it is not always clear how to choose the appropriate controls. Alternatively, the parents or normal sibs can be used as controls of affected sibs [22, 26–29]. For parental/sibling controls, the methods proposed by Fan and Knapp [19] and Xiong et al. [17] are not valid, since cases and controls are correlated with each other. The two sample Hotelling's T² test statistics only take into account the correlation among markers [17, 19]. For sibship data, not only the correlation among the markers but also the correlation within each sib-pair needs to be taken into account. The paired Hotelling's T² test statistics T_H and T_G developed in this paper correctly take both the correlation among the markers and the correlation within each sib-pair into account. The proposed method is potentially useful in association mapping of late-onset complex diseases.

Cordell and Clayton [2] and Chapman et al. [18] proposed logistic regression models for population-based case control studies or family studies. Both our proposed method and the logistic regression models can be used in association studies of multi-locus marker data. One advantage of the logistic regression models is that it is easy to add covariates to model the environmental effects, in addition to the genetic effects; however, it is not clear how to incorporate the environmental effects into our Hotelling's T² test statistics. While we are able to calculate the non-centrality parameters for our T² test statistics for power and sample size calculations, it is not clear if one might get similar results for the logistic regression models. In the study by Cordell and Clayton [2], the authors mainly discuss the analysis of SNP data and only briefly describe a way to analyse the multi-allelic markers data. We feel that more investigations are necessary in order for multi-allelic markers data to be used in the logistic regression models. By contrast, our proposed T² can be used to analyse either bi-allelic or multi-allelic marker data, or both simultaneously. Moreover, more investigations are needed to make power comparisons of the two methods.

In Figures 3 and 4, we show that the power of test statistics T_H and T_G based on combinations of sibships of varying structures are generally higher than the power of the test statistics based on sib-pairs. This is because the test statistics T_H and T_G use the average coding vectors for sibships whose sizes are larger than 2. This averaging strategy does not affect the mean of the coding vectors ${\bar{X}}^{(A)}$ and $Ȳ^{(U)}$ , but it will lead to a variance-covariance matrix S, which increases the test statistics. Moreover, it can be seen from Tables 1, 2 and 3 that the type I error is not inflated by including sibships of varying structure. Although the proposed test statistics benefit from this, it is unlikely that they are optimal. One way would be to use weighted sibships in constructing test statistics. In this paper, we assume that there are no missing data. For practical genotype data, genotypic information may be missing at some markers for a portion of the sample [26]. As a result, the methods used here need to be updated to address the problem of missing data. Another issue is that it is not clear how to combine population data, the nuclear family data and sibship data in one single analysis. In practice, the three types of genetic data can be available. They can be analysed separately, but it would be preferable to combine them in a unified analysis, which may lead to higher power. These issues needs more in-depth investigation.

Appendix

Consider a sib-pair in which one sibling is affected and the other is unaffected/normal. For convenience, assume that the first sibling is affected and the second sibling is normal. Let us denote A₁ = (the first sibling is affected), U₂ = (the second sibling is unaffected). Let f_DD, f_Dd = f_dD and f_dd be the probabilities that an individual with genotypes DD, Dd and dd is affected with the disease, respectively. Since allele D is disease susceptible, one may assume that f_DD ≥ f_Dd ≥ f_dd. Let ${\bar{f}}_{D D} = 1 - f_{D D}$ , ${\bar{f}}_{D d} = 1 - f_{D d}$ and ${\bar{f}}_{d d} = 1 - f_{d d}$ . Denote the disease prevalence in population by $A = f_{D D} P_{D}^{2} + 2 f_{D d} P_{D} P_{d} + f_{d d} P_{d}^{2}$ , and $Ā = {\bar{f}}_{D D} P_{D}^{2} + 2 {\bar{f}}_{D d} P_{D} P_{d} + {\bar{f}}_{d d} P_{d}^{2} = 1 - A$ . Assume that the affected status of an individual depends only on his/her own genotype at the disease locus. Let us denote the event (i IBD) = the sib-pair share i gene identical by descent (IBD) at the disease locus D. Then the joint probability

\begin{array}{l} P (A_{1}, U_{2}) & = P (A_{1}, U_{2} | 2 IBD) / 4 + P (A_{1}, U_{2} | 1 IBD) / 2 \\ + P (A_{1}, U_{2} | 0 IBD) / 4 \\ = \frac{1}{4} [\sum_{s, t \in \{D, d\}} f_{s t} {\bar{f}}_{s t} P_{s} P_{t} + 2 \sum_{s, t, q \in \{D, d\}} f_{s t} {\bar{f}}_{t q} P_{s} P_{t} P_{q} \\ + \sum_{s, t, q, r \in \{D, d\}} f_{s t} {\bar{f}}_{q r} P_{s} P_{t} P_{q} P_{r}] \\ = \frac{1}{4} [\sum_{s, t \in \{D, d\}} f_{s t} {\bar{f}}_{s t} P_{s} P_{t} + 2 \sum_{s, t, q \in \{D, d\}} f_{s t} {\bar{f}}_{t q} P_{s} P_{t} P_{q} + A Ā], \end{array}

(1)

where s, t, q, r take values of disease allele D and d. To calculate the above equations, we consider the three partitions (2 IBD), (1 IBD) and (0 IBD). These three partitions have probabilities 1/4, 1/2 and 1/4, respectively. Conditional on each partition, the corresponding conditional probabilities are then calculated. The frequency of homozygous genotype H_jkH_jk in an affected sibling is given by:

\begin{array}{l} a_{j k k} & = P [G_{i j}^{(A)} = H_{j k} H_{j k} | A_{1}, U_{2}] \\ = P [G_{i j}^{(A)} = H_{j k} H_{j k}, A_{1}, U_{2}, (2 IBD) \cup (1 IBD) \cup (0 IBD)] / \\ P (A_{1}, U_{2}) \\ = [\frac{1}{4} \sum_{s, t \in \{D, d\}} f_{s t} {\bar{f}}_{s t} P (H_{j k} s) P (H_{j k} t) \\ + \frac{1}{2} \sum_{s, t, q \in \{D, d\}} f_{s t} {\bar{f}}_{t q} P (H_{j k} t) P (H_{j k} s) P_{q} \\ + \frac{1}{4} \sum_{s, t \in \{D, d\}} f_{s t} P (H_{j k} s) P (H_{j k} t) Ā] / P (A_{1}, U_{2}) . \end{array}

(2)

Similarly, the frequency of homozygous genotype H_jkH_jk in an unaffected sibling is given by:

\begin{array}{l} ā_{j k k} & = P [G_{i j}^{(U)} = H_{j k} H_{j k} | A_{1}, U_{2}] = P [G_{i j}^{(U)} = H_{j k} H_{j k}, A_{1}, U_{2}, \\ (2 IBD) \cup (1 IBD) \cup (0 IBD)] / P (A_{1}, U_{2}) \\ = [\frac{1}{4} \sum_{s, t \in \{D, d\}} {\bar{f}}_{s t} f_{s t} P (H_{j k} s) P (H_{j k} t) \\ + \frac{1}{2} \sum_{s, t, q \in \{D, d\}} {\bar{f}}_{s t} f_{t q} P (H_{j k} t) P (H_{j k} s) P_{q} \\ + \frac{1}{4} \sum_{s, t \in \{D, d\}} {\bar{f}}_{s t} P (H_{j k} s) P (H_{j k} t) A] / P (A_{1}, U_{2}) . \end{array}

(3)

Note that $ā_{j k k}$ can be calculated by the formula for a_jkk by substituting f_st with ${\bar{f}}_{s t}$ and vice versa. Note that the haplotype frequencies P(H_jkD) = Δ_jk + P(H_jk)P_D, P(H_jkd) = -Δ_jk + P(H_jk)P_d. Under the null hypothesis of no association between the markers H_i, i = 1, 2, ..., J, and the disease locus D -- that is, Δ_ij = 0 for all j, the haplotype frequencies are equal to the product of allele frequencies; for example, P(H_jkD) = P(H_jk)P_D and P(H_jkd) = P(H_jk)P_d. From equations (4) and (5), $a_{j k k} = ā_{j k k} = P {(H_{j k})}^{2}$ .

Similarly, the frequency of the heterozygous genotype H_jkH_jl, k ≠ l, in an affected sibling can be calculated as follows:

\begin{array}{l} a_{j k l} & = P [[G_{i j}^{(A)} = H_{j k} H_{j l} | A_{1}, U_{2}] = P [G_{i j}^{(A)} = H_{j k} H_{j l}, A_{1}, U_{2}, \\ (2 IBD) \cup (1 IBD) \cup (0 IBD)] / P (A_{1}, U_{2}) \\ = [\frac{1}{4} \sum_{s, t \in \{D, d\}} f_{s t} {\bar{f}}_{s t} (P (H_{j k} s) P (H_{j l} t) + P (H_{j k} t) P (H_{j l} s)) \\ + \frac{1}{2} \sum_{s, t, q \in \{D, d\}} f_{s t} {\bar{f}}_{t q} (P (H_{j k} t) P (H_{j l} s) + P (H_{j k} s) P (H_{j l} t)) P_{q} \\ + \frac{1}{4} \sum_{s, t \in \{D, d\}} f_{s t} (P (H_{j k} s) P (H_{j l} t) + P (H_{j k} t) P (H_{j l} s)) Ā] / P (A_{1}, U_{2}) . \end{array}

(4)

The frequency of the heterozygous genotype H_jkH_jl, k ≠ l, in an unaffected sibling can be calculated as follows:

\begin{array}{l} ā_{j k l} & = P [G_{i j}^{(U)} = H_{j k} H_{j l} | A_{1}, U_{2}] = P [G_{i j}^{(U)} = H_{j k} H_{j l}, A_{1}, U_{2}, \\ (2 IBD) \cup (1 IBD) \cup (0 IBD)] / P (A_{1}, U_{2}) \\ = [\frac{1}{4} \sum_{s, t \in \{D, d\}} {\bar{f}}_{s t} f_{s t} (P (H_{j k} s) P (H_{j l} t) + P (H_{j k} t) P (H_{j l} s)) \\ + \frac{1}{2} \sum_{s, t, q \in \{D, d\}} {\bar{f}}_{s t} f_{t q} (P (H_{j k} t) P (H_{j l} s) + P (H_{j k} s) P (H_{j l} t)) P_{q} \\ + \frac{1}{4} \sum_{s, t \in \{D, d\}} {\bar{f}}_{s t} (P (H_{j k} s) P (H_{j l} t) + P (H_{j k} t) P (H_{j l} s)) A] / P (A_{1}, U_{2}) . \end{array}

(5)

Note that $ā_{j k l}$ can be calculated by the formula for a_jkl by substituting f_st using ${\bar{f}}_{s t}$ and vice versa. Under the null hypothesis of no association between the markers H_i, i = 1, 2, ..., J, and the disease locus D -- that is, Δ_ij = 0 for all j, the haplotype frequencies are equal to the product of the allele frequencies; for example, P(H_jkD) = P(H_jk)P_D, P(H_jkd) = P(H_jk)P_d, P(H_jlD) = P(H_jl)P_D and P(H_jld) = P(H_jl)P_d. From equations (4) and (5), $a_{j k l} = ā_{j k l} = 2 P (H_{j k}) P (H_{j l})$ . Therefore, the expectation $E ({\bar{X}}^{(A)} - Ȳ^{(U)} | A_{1}, U_{2}) = 0$ for the 'genotype coding' method.

For the 'haplotype/allele coding' method, equations (2), (3), (4) and (5) imply

\begin{array}{l} E (z_{i j k}^{(A)} | A_{1}, U_{2}) & = 2 a_{j k k} + \sum_{l \neq k} a_{j k l}, E (z_{i j k}^{(U)} | A_{1}, U_{2}) \\ = 2 ā_{j k k} + \sum_{l \neq k} ā_{j k l} . \end{array}

(6)

From equation (6), expectation $E (z_{i j k}^{(A)} - z_{i j k}^{(U)} | A_{1}, U_{2}) = 2 P (H_{j k}) - 2 P (H_{j k}) = 0$ by 'haplotype/allele coding' method, under the null hypothesis of no association between the markers H_i, j = 1, ..., J and disease locus D.

Supplementary information: Non-centrality parameters

Consider N sib-pairs, each consisting of an affected sibling and a normal sibling. For convenience, assume that the first sibling is affected and the second sibling is normal in each sib-pair. Let us denote A₁ = (the first sibling is affected), U₂ = (the second sibling is unaffected). For 'haplotype/allele coding', the coding vector of the affected sibling in the i-th sib-pair is $X_{i}^{(A)} = {(z_{i 11}^{(A)}, \dots, z_{i 1 (n_{1} - 1)}^{(A)}, \dots, z_{i J 1}^{(A)}, \dots, z_{i J (n_{J} - 1)}^{(A)})}^{τ}$ . Similarly, $Y_{i}^{(U)} = {(z_{i 11}^{(U)}, \dots, z_{i 1 (n_{1} - 1)}^{(U)}, \dots, z_{i J 1}^{(U)}, \dots, z_{i J (n_{J} - 1)}^{(U)})}^{τ}$ is the coding vector of the normal sibling. Denote the variance-covariance matrix of $X_{i}^{(A)} - Y_{i}^{(U)}$ by $\sum_{h a p} = Var (X_{i}^{(A)} - Y_{i}^{(U)} | A_{1}, U_{2}) = Var (X_{i}^{(A)} | A_{1}, U_{2}) - Cov (X_{i}^{(A)} - Y_{i}^{(U)} | A_{1}, U_{2}) - Cov (Y_{i}^{(U)}, X_{i}^{(A)} | A_{1}, U_{2}) + Var (Y_{i}^{(U)} | A_{1}, U_{2})$ . The elements of the above variance-covariance matrices are given in Appendices A, B, and C: $Var (X_{i}^{(A)} | A_{1}, U_{2})$ and $Var (Y_{i}^{(U)} | A_{1}, U_{2})$ in Appendix A, and $Cov (X_{i}^{(A)} - Y_{i}^{(U)} | A_{1}, U_{2})$ in Appendices B and C. Using quantities of $E (Z_{i j k}^{(A)} | A_{1}, U_{2})$ and $E (Z_{i j k}^{(U)} | A_{1}, U_{2})$ in the Appendix to the manuscript, $E (X_{i}^{(A)} - Y_{i}^{(U)} | A_{1}, U_{2})$ can be calculated. The non-centrality parameter λ_H of Hotelling's statistics T_H is given by $λ_{H} = N E {(X_{i}^{(A)} - Y_{i}^{(U)} | A_{1}, U_{2})}^{τ} {[\sum_{h a p}]}^{- 1} E (X_{i}^{(A)} - Y_{i}^{(U)} | A_{1}, U_{2})$ .

For the 'genotype coding' method, the coding vector of the affected sibling in the i-th sib-pair is $X_{i j}^{(A)} = {(x_{i j 1}^{(A)}, \dots, x_{i j (n_{j} - 1)}^{(A)}, x_{i j 12}^{(A)}, \dots, x_{i j 1 n_{j}}^{(A)}, \dots, x_{i j (n_{j} - 1) n_{j}}^{(A)})}^{τ}$ j = 1, ..., J. Similarly, $Y_{i j}^{(U)} = {(x_{i j 1}^{(U)}, \dots, x_{i j (n_{j} - 1)}^{(U)}, x_{i j 12}^{(U)}, \dots, x_{i j 1 n_{j}}^{(U)}, \dots, x_{i j (n_{j} - 1) n_{j}}^{(U)})}^{τ}$ is the coding vector of the normal sibling. Let a_jkl and $ā_{j k l}$ be the frequencies of genotype H_jkH_jl in affected and unaffected siblings given in the Appendix to the manuscript. Then,

E [X_{i j}^{(A)} | A_{1}, U_{2}] = {(a_{j 11}, \dots, a_{j (n_{j} - 1) (n_{j} - 1)}, a_{j 12}, \dots, a_{j 1 n_{j}}, \dots, a_{j (n_{j} - 1) n_{j}})}^{τ},

(-1)

E [Y_{i j}^{(U)} | A_{1}, U_{2}] = {(ā_{j 11}, \dots, ā_{j (n_{j} - 1) (n_{j} - 1)}, ā_{j 12}, \dots, ā_{j 1 n_{j}}, \dots, ā_{j (n_{j} - 1) n_{j}})}^{τ} .

(-2)

Using $E [X_{i j}^{(A)} | A_{1}, U_{2}]$ and $E [Y_{i j}^{(U)} | A_{1}, U_{2}]$ , one may calculate the expectation $E [{\bar{X}}^{(A)} - Ȳ^{(U)} | A_{1}, U_{2}] = {(E {[X_{i 1}^{(A)} - Y_{i 1}^{(U)} | A_{1}, U_{2}]}^{τ}, \dots, E {[X_{i J}^{(A)} - Y_{i J}^{(U)} | A_{1}, U_{2}]}^{τ})}^{τ}$ . Let $\sum_{g e n o} = Cov (X_{i}^{(A)} - Y_{i}^{(U)} | A_{1}, U_{2}) = Var (X_{i}^{(A)} | A_{1}, U_{2}) - Cov (X_{i}^{(A)} - Y_{i}^{(U)} | A_{1}, U_{2}) - Cov (Y_{i}^{(U)}, X_{i}^{(A)} | A_{1}, U_{2}) + Var (Y_{i}^{(U)} | A_{1}, U_{2})$ be the variance-covariance matrix of $X_{i}^{(A)} - Y_{i}^{(U)}$ . Then the non-centrality parameter λ_G of Hotelling's statistics T_G is given by $λ_{G} = N E {[{\bar{X}}^{(A)} - Ȳ^{(U)} | A_{1}, U_{2}]}^{τ} {[\sum_{g e n o}]}^{- 1} E [{\bar{X}}^{(A)} - Ȳ^{(U)} | A_{1}, U_{2}]$ . The elements of the above variance-covariance matrices are given in Appendices D and E: $Var (X_{i}^{(A)} | A_{1}, U_{2})$ and $Var (Y_{i}^{(U)} | A_{1}, U_{2})$ in Appendix D, and $Cov (X_{i}^{(A)} - Y_{i}^{(U)} | A_{1}, U_{2})$ in Appendix E.

Appendix A

Consider the 'haplotype/allele coding' method. The variance-covariance matrices are

\begin{gathered} Var (X_{i}^{(A)} | A_{1}, U_{2}) \\ = Var [{(z_{i 11}^{(A)}, \dots, z_{i 1 (n_{1} - 1)}^{(A)}, \dots, z_{i J 1}^{(A)}, \dots, z_{i J (n_{J} - 1)}^{(A)})}^{τ} | A_{1}, U_{2}], \end{gathered}

\begin{gathered} Var [Y_{i}^{(U)} | A_{1}, U_{2}] \\ = Var [{(z_{i 11}^{(U)}, \dots, z_{i 1 (n_{1} - 1)}^{(U)}, \dots, z_{i J 1}^{(U)}, \dots, z_{i J (n_{J} - 1)}^{(U)})}^{τ} | A_{1}, U_{2}] . \end{gathered}

The variance of the number of the alleles H_jk in the affected sibling and unaffected sibling can be calculated as

\begin{array}{l} Var (z_{i j k}^{(A)} | A_{1}, U_{2}) & = E [{(z_{i j k}^{(A)})}^{2} | A_{1}, U_{2}] - {[E (z_{i j k}^{(A)} | A_{1}, U_{2})]}^{2} \\ = 4 a_{j k k} + \sum_{l \neq k} a_{j k l} - {[2 a_{j k k} + \sum_{l \neq k} a_{j k l}]}^{2}, \end{array}

\begin{array}{l} Var (z_{i j k}^{(U)} | A_{1}, U_{2}) & = E [{(z_{i j k}^{(U)})}^{2} | A_{1}, U_{2}] - {[E (z_{i j k}^{(U)} | A_{1}, U_{2})]}^{2} \\ = 4 ā_{j k k} + \sum_{l \neq k} ā_{j k l} - {[2 ā_{j k k} + \sum_{l \neq k} ā_{j k l}]}^{2} . \end{array}

Similarly, the covariance between the number of alleles H_jk and the number of alleles H_jl, l ≠ k, in the affected sibling and unaffected sibling can be calculated as

\begin{gathered} Cov (z_{i j k}^{(A)}, z_{i j l}^{(A)} | A_{1}, U_{2}) \\ = E (z_{i j k}^{(A)} z_{i j l}^{(A)} | A_{1}, U_{2}) \\ - E (z_{i j k}^{(A)} | A_{1}, U_{2}) E (z_{i j l}^{(A)} | A_{1}, U_{2}) \\ = P (G_{i j}^{(A)} = H_{j k} H_{j l} | A_{1}, U_{2}) \\ - [2 a_{j k k} + \sum_{k^{'} \neq k} a_{j k k^{'}}] [2 a_{j l l} + \sum_{l^{'} \neq l} a_{j l l^{'}}] \\ = a_{j k l} - [2 a_{j k k} + \sum_{k^{'} \neq k} a_{j k k^{'}}] [2 a_{j l l} + \sum_{l^{'} \neq l} a_{j l l^{'}}], \end{gathered}

\begin{gathered} Cov (z_{i j k}^{(U)}, z_{i j l}^{(U)} | A_{1}, U_{2}) \\ = E (z_{i j k}^{(U)} z_{i j l}^{(U)} | A_{1}, U_{2}) \\ - E (z_{i j k}^{(U)} | A_{1}, U_{2}) E (z_{i j l}^{(U)} | A_{1}, U_{2}) \\ = ā_{j k l} - [2 ā_{j k k} + \sum_{k^{'} \neq k} ā_{j k k^{'}}] [2 ā_{j l l} + \sum_{l^{'} \neq l} ā_{j l l^{'}}] . \end{gathered}

For j ≠ g, assume that markers H_j and H_g flank disease locus D in the order of H_jDH_g. Let P(H_jkDH_gh) be frequencies of haplotype H_jkDH_gh. The frequencies of other haplotypes are denoted accordingly. For the i-th sib-pair, let $G_{i D}^{(U)}$ be the disease genotype of the unaffected sibling and $G_{i D}^{(A)}$ be the disease genotype of the affected sibling. To calculate the covariance between $z_{i j k}^{(A)}$ , $z_{i g h}^{(A)}$ , denote for j ≠ g, k ≠ k', h ≠ h',

\begin{array}{l} g_{k k h h}^{(A, j g)} & = E [1_{(G_{i j}^{(A)} = H_{j k} H_{j k})} 1_{(G_{i g}^{(A)} = H_{g h} H_{g h})} | A_{1}, U_{2}] \\ = P [G_{i j}^{(A)} = H_{j k} H_{j k}, G_{i g}^{(A)} = H_{g h} H_{g h}, A_{1}, U_{2}, \\ (2 IBD) \cup (1 IBD) \cup (0 IBD)] / P (A_{1}, U_{2}) \\ = [\frac{1}{4} \sum_{s, t \in \{D, d\}} f_{s t} {\bar{f}}_{s t} P [G_{i j}^{(A)} = H_{j k} H_{j k}, \\ G_{i g}^{(A)} = H_{g h} H_{g h}, G_{i D}^{(A)} = s t, G_{i D}^{(U)} = s t] \\ + \frac{1}{2} \sum_{s, t, q \in \{D, d\}} f_{s t} {\bar{f}}_{t q} P [G_{i j}^{(A)} = H_{j k} H_{j k}, \\ G_{i g}^{(A)} = H_{g h} H_{g h}, G_{i D}^{(A)} = s t, G_{i D}^{(U)} = t q] \\ + \frac{1}{4} \sum_{s, t, q, r \in \{D, d\}} f_{s t} {\bar{f}}_{q r} P [G_{i j}^{(A)} = H_{j k} H_{j k}, \\ G_{i g}^{(A)} = H_{g h} H_{g h}, G_{i D}^{(A)} = s t, G_{i D}^{(U)} = q r]] / P (A_{1}, U_{2}) \\ = [\frac{1}{4} \sum_{s, t \in \{D, d\}} f_{s t} {\bar{f}}_{s t} P (H_{j k} s H_{g h}) P (H_{j k} t H_{g h}) \\ + \frac{1}{2} \sum_{s, t, q \in \{D, d\}} f_{s t} {\bar{f}}_{t q} P (H_{j k} t H_{g h}) P (H_{j k} s H_{g h}) P_{q} \\ + \frac{1}{4} \sum_{s, t \in \{D, d\}} f_{s t} P (H_{j k} s H_{g h}) P (H_{j k} t H_{g h}) Ā] / P (A_{1}, U_{2}) \end{array}

\begin{array}{l} g_{k k h h^{'}}^{(A, j g)} & = E [1_{(G_{i j}^{(A)} = H_{j k} H_{j k})} 1_{(G_{i g}^{(A)} = H_{g h} H_{g h^{'}})} | A_{1}, U_{2}] \\ = P [G_{i j}^{(A)} = H_{j k} H_{j k}, G_{i g}^{(A)} = H_{g h} H_{g h^{'}}, A_{1}, U_{2}, \\ (2 IBD) \cup (1 IBD) \cup (0 IBD)] / P (A_{1}, U_{2}) \\ = [\frac{1}{4} \sum_{s, t \in \{D, d\}} f_{s t} {\bar{f}}_{s t} P [G_{i j}^{(A)} = H_{j k} H_{j k}, G_{i g}^{(A)} = H_{g h} H_{g h^{'}}, \\ G_{i D}^{(A)} = s t, G_{i D}^{(U)} = s t] \\ + \frac{1}{2} \sum_{s, t, q \in \{D, d\}} f_{s t} {\bar{f}}_{t q} P [G_{i j}^{(A)} = H_{j k} H_{j k}, G_{i g}^{(A)} = H_{g h} H_{g h^{'}}, \\ G_{i D}^{(A)} = s t, G_{i D}^{(U)} = t q] \\ + \frac{1}{4} \sum_{s, t, q, r \in \{D, d\}} f_{s t} {\bar{f}}_{q r} P [G_{i j}^{(A)} = H_{j k} H_{j k}, G_{i g}^{(A)} = H_{g h} H_{g h^{'}}, \\ G_{i D}^{(A)} = s t, G_{i D}^{(U)} = q r] / P (A_{1}, U_{2}) \\ = [\frac{1}{4} \sum_{s, t \in \{D, d\}} f_{s t} {\bar{f}}_{s t} (P (H_{j k} s H_{g h}) P (H_{j k} t H_{g h^{'}}) \\ + P (H_{j k} t H_{g h}) P (H_{j k} s H_{g h^{'}})) \\ + \frac{1}{2} \sum_{s, t, q \in \{D, d\}} f_{s t} {\bar{f}}_{t q} (P (H_{j k} s H_{g h}) P (H_{j k} t H_{g h^{'}}) . \\ + P (H_{j k} t H_{g h}) P (H_{j k} s H_{g h^{'}})) P_{q} \\ + \frac{1}{4} \sum_{s, t \in \{D, d\}} f_{s t} (P (H_{j k} s H_{g h}) P (H_{j k} t H_{g h^{'}}) \\ + P (H_{j k} t H_{g h}) P (H_{j k} s H_{g h^{'}})) Ā] / P (A_{1}, U_{2}) \end{array}

\begin{array}{l} g_{k k^{'} h h}^{(A, j g)} & = E [1_{(G_{i j}^{(A)} = H_{j k} H_{j k^{'}})} 1_{(G_{i g}^{(A)} = H_{g h} H_{g h})} | A_{1}, U_{2}] \\ = P [G_{i j}^{(A)} = H_{j k} H_{j k^{'}}, G_{i g}^{(A)} = H_{g h} H_{g h}, A_{1}, U_{2}, \\ (2 IBD) \cup (1 IBD) \cup (0 IBD)] / P (A_{1}, U_{2}) \\ = [\frac{1}{4} \sum_{s, t \in \{D, d\}} f_{s t} {\bar{f}}_{s t} P [G_{i j}^{(A)} = H_{j k} H_{j k^{'}}, G_{i g}^{(A)} = H_{g h} H_{g h}, \\ G_{i D}^{(A)} = s t, G_{i D}^{(U)} = s t] \\ + \frac{1}{2} \sum_{s, t, q \in \{D, d\}} f_{s t} {\bar{f}}_{t q} P [G_{i j}^{(A)} = H_{j k} H_{j k^{'}}, G_{i g}^{(A)} = H_{g h} H_{g h}, \\ G_{i D}^{(A)} = s t, G_{i D}^{(U)} = t q] \\ + \frac{1}{4} \sum_{s, t, q, r \in \{D, d\}} f_{s t} {\bar{f}}_{q r} P [G_{i j}^{(A)} = H_{j k} H_{j k^{'}}, G_{i g}^{(A)} = H_{g h} H_{g h}, \\ G_{i D}^{(A)} = s t, G_{i D}^{(U)} = q r] / P (A_{1}, U_{2}) \\ = [\frac{1}{4} \sum_{s, t \in \{D, d\}} f_{s t} {\bar{f}}_{s t} (P (H_{j k} s H_{g h}) P (H_{j k^{'}} t H_{g h}) \\ + P (H_{j k} t H_{g h}) P (H_{j k^{'}} s H_{g h})) \\ + \frac{1}{2} \sum_{s, t, q \in \{D, d\}} f_{s t} {\bar{f}}_{t q} (P (H_{j k} s H_{g h}) P (H_{j k^{'}} t H_{g h}) \\ + P (H_{j k} t H_{g h}) P (H_{j k^{'}} s H_{g h})) P_{q} \\ + \frac{1}{4} \sum_{s, t \in \{D d\}} f_{s t} (P (H_{j k} s H_{g h}) P (H_{j k^{'}} t H_{g h}) \\ + P (H_{j k} t H_{g h}) P (H_{j k^{'}} s H_{g h})) Ā] / P (A_{1}, U_{2}) \end{array}

\begin{array}{l} g_{k k^{'} h h^{'}}^{(A, j g)} & = E [1_{(G_{i j}^{(A)} = H_{j k} H_{j k^{'}})} 1_{(G_{i g}^{(A)} = H_{g h} H_{g h^{'}})} | A_{1}, U_{2}] \\ = P [G_{i j}^{(A)} = H_{j k} H_{j k^{'}}, G_{i g}^{(A)} = H_{g h} H_{g h^{'}}, A_{1}, U_{2}, \\ (2 IBD) \cup (1 IBD) \cup (0 IBD)] / P (A_{1}, U_{2}) \\ = [\frac{1}{4} \sum_{s, t \in \{D, d\}} f_{s t} {\bar{f}}_{s t} P [G_{i j}^{(A)} = H_{j k} H_{j k^{'}}, \\ G_{i g}^{(A)} = H_{g h} H_{g h^{'}}, G_{i D}^{(A)} = s t, G_{i D}^{(U)} = s t] \\ + \frac{1}{2} \sum_{s, t, q \in \{D, d\}} f_{s t} {\bar{f}}_{t q} P [G_{i j}^{(A)} = H_{j k} H_{j k^{'}}, \\ G_{i g}^{(A)} = H_{g h} H_{g h^{'}}, G_{i D}^{(A)} = s t, G_{i D}^{(U)} = t q] \\ + \frac{1}{4} \sum_{s, t, q, r \in \{D, d\}} f_{s t} {\bar{f}}_{q r} P [G_{i j}^{(A)} = H_{j k} H_{j k^{'}}, \\ G_{i g}^{(A)} = H_{g h} H_{g h^{'}}, G_{i D}^{(A)} = s t, G_{i D}^{(U)} = q r]] / P (A_{1}, U_{2}) \\ = [\frac{1}{4} \sum_{s, t \in \{D, d\}} f_{s t} {\bar{f}}_{s t} (P (H_{j k} s H_{g h}) P (H_{j k^{'}} t H_{g h^{'}}) \\ + P (H_{j k} t H_{g h}) P (H_{j k^{'}} s H_{g h^{'}}) \\ + P (H_{j k} s H_{g h^{'}}) P (H_{j k^{'}} t H_{g h}) + P (H_{j k} t H_{g h^{'}}) P (H_{j k^{'}} s H_{g h})) \\ + \frac{1}{2} \sum_{s, t, q \in \{D, d\}} f_{s t} {\bar{f}}_{t q} (P (H_{j k} s H_{g h}) P (H_{j k^{'}} t H_{g h^{'}}) \\ + P (H_{j k} t H_{g h}) P (H_{j k^{'}} s H_{g h^{'}}) \\ + P (H_{j k} s H_{g h^{'}}) P (H_{j k^{'}} t H_{g h}) + P (H_{j k} t H_{g h^{'}}) P (H_{j k^{'}} s H_{g h})) P_{q} \\ + \frac{1}{4} \sum_{s, t \in \{D, d\}} f_{s t} (P (H_{j k} s H_{g h}) P (H_{j k^{'}} t H_{g h^{'}}) \\ + P (H_{j k} t H_{g h}) P (H_{j k^{'}} s H_{g h^{'}}) + P (H_{j k} s H_{g h^{'}}) P (H_{j k^{'}} t H_{g h}) \\ + P (H_{j k} t H_{g h^{'}}) P (H_{j k^{'}} s H_{g h})) Ā] / P (A_{1}, U_{2}) . \end{array}

For k = 1,..., n_j - 1 and h = 1,..., n_g - 1, j ≠ g, the covariance

\begin{gathered} Cov (z_{i j k}^{(U)}, z_{i g h}^{(U)} | A_{1}, U_{2}) \\ = E [z_{i j k}^{(U)} z_{i g h}^{(U)} | A_{1}, U_{2}] - E [z_{i j k}^{(U)} | A_{1}, U_{2}] E [z_{i g h}^{(U)} | A_{1}, U_{2}] \\ = 4 ḡ_{k k h h}^{(U, j g)} + 2 \sum_{h^{'} \neq h} ḡ_{k k h h^{'}}^{(U, j g)} + 2 \sum_{k^{'} \neq k} ḡ_{k k^{'} h h}^{(U, j g)} \\ + \sum_{k^{'} \neq k} \sum_{h^{'} \neq h} ḡ_{k k^{'} h h^{'}}^{(U, j g)} - [2 ā_{j k k} + \sum_{k^{'} \neq k} ā_{j k k^{'}}] [2 ā_{g h h} + \sum_{h^{'} \neq h} ā_{g h h^{'}}] . \end{gathered}

Similarly, for k = 1,..., n_j - 1 and h = 1,..., n_g - 1, j ≠ g, the covariance

\begin{gathered} Cov (z_{i j k}^{(U)}, z_{i g h}^{(U)} | A_{1}, U_{2}) \\ = E [z_{i j k}^{(U)} z_{i g h}^{(U)} | A_{1}, U_{2}] - E [z_{i j k}^{(U)} | A_{1}, U_{2}] E [z_{i g h}^{(U)} | A_{1}, U_{2}] \\ = 4 ḡ_{k k h h}^{(U, j g)} + 2 \sum_{h^{'} \neq h} ḡ_{k k h h^{'}}^{(U, j g)} + 2 \sum_{k^{'} \neq k} ḡ_{k k^{'} h h}^{(U, j g)} \\ + \sum_{k^{'} \neq k} \sum_{h^{'} \neq h} ḡ_{k k^{'} h h^{'}}^{(U, j g)} - [2 ā_{j k k} + \sum_{k^{'} \neq k} ā_{j k k^{'}}] [2 ā_{g h h} + \sum_{h^{'} \neq h} ā_{g h h^{'}}] . \end{gathered}

where $ḡ_{k k h h}^{(U, j g)}$ , $ḡ_{k k h h^{'}}^{(U, j g)}$ , $ḡ_{k k^{'} h h}^{(U, j g)}$ and $ḡ_{k k^{'} h h^{'}}^{(U, j g)}$ are the expected genotype frequencies in the normal sibling as follows:

\begin{array}{l} ḡ_{k k h h}^{(U, j g)} & = E [1_{(G_{i j}^{(U)} = H_{j k} H_{j k})} 1_{(G_{i g}^{(U)} = H_{g h} H_{g h})} | A_{1}, U_{2}], \\ g_{k k h h^{'}}^{(U, j g)} & = E [1_{(G_{i j}^{(U)} = H_{j k} H_{j k})} 1_{(G_{i g}^{(U)} = H_{g h} H_{g h^{'}})} | A_{1}, U_{2}], \\ g_{k k^{'} h h}^{(U, j g)} & = E [1_{(G_{i j}^{(U)} = H_{j k} H_{j k^{'}})} 1_{(G_{i g}^{(U)} = H_{g h} H_{g h})} | A_{1}, U_{2}], \\ g_{k k^{'} h h^{'}}^{(U, j g)} & = E [1_{(G_{i j}^{(U)} = H_{j k} H_{j k^{'}})} 1_{(G_{i g}^{(U)} = H_{g h} H_{g h^{'}})} | A_{1}, U_{2}] . \end{array}

To calculate $ḡ_{k k h h}^{(U, j g)}$ , $ḡ_{k k h h^{'}}^{(U, j g)}$ , $ḡ_{k k^{'} h h}^{(U, j g)}$ and $ḡ_{k k^{'} h h^{'}}^{(U, j g)}$ , one may use the formulae of $g_{k k h h}^{(A, j g)}$ , $g_{k k h h^{'}}^{(A, j g)}$ , $g_{k k^{'} h h}^{(A, j g)}$ and $g_{k k^{'} h h^{'}}^{(A, j g)}$ by substituting f_st using ${\bar{f}}_{s t}$ .

Appendix B

The conditional covariance

\begin{array}{l} Cov (Y_{i}^{(U)}, X_{i}^{(A)} | A_{1}, U_{2}) & = E [Y_{i}^{(U)} X_{i}^{{(A)}^{τ}} | A_{1}, U_{2}] \\ - E [Y_{i}^{(U)} | A_{1}, U_{2}] E [X_{i}^{{(A)}^{τ}} | A_{1}, U_{2}] \\ = \frac{E [Y_{i}^{(U)} X_{i}^{{(A)}^{τ}} 1_{A_{1}} 1_{U_{2}}]}{P (A_{1}, U_{2})} \\ - E [Y_{i}^{(U)} | A_{1}, U_{2}] E [X_{i}^{{(A)}^{τ}} | A_{1}, U_{2}] . \end{array}

For the 'haplotype/allele coding' method, the expectations $E [Y_{i}^{(U)} | A_{1}, U_{2}]$ and $E [X_{i}^{{(A)}^{τ}} | A_{1}, U_{2}]$ are given by two quantities $E (z_{i j k}^{(A)} | A_{1}, U_{2})$ and $E (z_{i j k}^{(U)} | A_{1}, U_{2})$ (see Appendix to the paper). To get $E [Y_{i}^{(U)} X_{i}^{{(A)}^{τ}} 1_{A_{1}} 1_{U_{2}}]$ , we will calculate $E [z_{i j k}^{(U)} z_{i j k}^{(A)} 1_{A_{1}} 1_{U_{2}}]$ and $E [z_{i j k}^{(U)} z_{i j l}^{(A)} 1_{A_{1}} 1_{U_{2}}]$ , l ≠ k in this Appendix. In Appendix C, we will calculate the expectation $E [z_{i j k}^{(U)} z_{i g h}^{(A)} 1_{A_{1}} 1_{U_{2}}]$ for j ≠ g. Note that:

\begin{gathered} E [z_{i j k}^{(U)} z_{i j k}^{(A)} 1_{A_{1}} 1_{U_{2}}] \\ = E [(2 \cdot 1_{(G_{i j}^{(U)} = H_{j k} H_{j k})} + \sum_{l \neq k} 1_{(G_{i j}^{(U)} = H_{j k} H_{j l})}) \\ (2 \cdot 1_{(G_{i j}^{(A)} = H_{j k} H_{j k})} + \sum_{l \neq k} 1_{(G_{i j}^{(A)} = H_{j k} H_{j l})}) 1_{A_{1}} 1_{U_{2}}] . \end{gathered}

(-3)

Since the siblings can share 2, 1 and 0 genes identical by descent (IBD) at the disease locus D with probabilities 1/4, 1/2 and 1/4, respectively, the expectation

\begin{array}{l} E [1_{(G_{i j}^{(U)} = H_{j k} H_{j k})} 1_{(G_{i j}^{(A)} = H_{j k} H_{j k})} 1_{A_{1}} 1_{U_{2}}] \\ = P [G_{i j}^{(U)} = H_{j k} H_{j k}, G_{i j}^{(A)} = H_{j k} H_{j k}, A_{1}, U_{2}, \\ (2 IBD) \cup (1 IBD) \cup (0 IBD)] \\ = [\frac{1}{4} \sum_{s, t \in \{D, d\}} {\bar{f}}_{s t} f_{s t} P [G_{i j}^{(U)} = H_{j k} H_{j k}, G_{i j}^{(A)} = H_{j k} H_{j k}, \\ G_{i D}^{(U)} = s t, G_{i D}^{(A)} = s t] \\ + \frac{1}{2} \sum_{s, t, q \in \{D, d\}} {\bar{f}}_{s t} f_{t q} P [G_{i j}^{(U)} = H_{j k} H_{j k}, G_{i j}^{(A)} = H_{j k} H_{j k}, \\ G_{i D}^{(U)} = s t, G_{i D}^{(A)} = t q] \\ + \frac{1}{4} \sum_{s, t, q, r \in \{D, d\}} {\bar{f}}_{s t} f_{q r} P [G_{i j}^{(U)} = H_{j k} H_{j k}, G_{i j}^{(A)} = H_{j k} H_{j k}, \\ G_{i D}^{(U)} = s t, G_{i D}^{(A)} = q r] \\ = \frac{1}{4} \sum_{s, t \in \{D, d\}} {\bar{f}}_{s t} f_{s t} (P (H_{j k} s) P (H_{j k} t) \\ + \frac{1}{2} \sum_{s, t, q \in \{D, d\}} {\bar{f}}_{s t} f_{t q} P (H_{j k} t) P (H_{j k} s) P (H_{j k} q) \\ + \frac{1}{4} \sum_{s, t, q, r \in \{D, d\}} {\bar{f}}_{s t} f_{q r} (P (H_{j k} s) P (H_{j k} t) P (H_{j k} q) P (H_{j k} r) . \end{array}

(-4)

For l ≠ k, one may calculate the expectation

\begin{array}{l} E [1_{(G_{i j}^{(U)} = H_{j k} H_{j k})} 1_{(G_{i j}^{(A)} = H_{j k} H_{j k})} 1_{A_{1}} 1_{U_{2}}] \\ = P [G_{i j}^{(U)} = H_{j k} H_{j k}, G_{i j}^{(A)} \\ = H_{j k} H_{j l}, A_{1}, U_{2}, (2 IBD) \cup (1 IBD) \cup (0 IBD)] \\ = \frac{1}{4} \sum_{s, t \in \{D, d\}} {\bar{f}}_{s t} f_{s t} P [G_{i j}^{(U)} = H_{j k} H_{j k}, G_{i j}^{(A)} = H_{j k} H_{j l}, G_{i D}^{(U)} \\ = s t, G_{i D}^{(A)} = s t] + \frac{1}{2} \sum_{s, t, q \in \{D, d\}} {\bar{f}}_{s t} f_{t q} P [G_{i j}^{(U)} = H_{j k} H_{j k}, G_{i j}^{(A)} \\ = H_{j k} H_{j l}, G_{i D}^{(U)} = s t, G_{i D}^{(A)} = t q] \\ + \frac{1}{4} \sum_{s, t, q, r \in \{D, d\}} {\bar{f}}_{s t} f_{q r} P [G_{i j}^{(U)} = H_{j k} H_{j k}, G_{i j}^{(A)} \\ = H_{j k} H_{j l}, G_{i D}^{(U)} = s t, G_{i D}^{(A)} = q r] \\ = \frac{1}{2} \sum_{s, t, q \in \{D, d\}} {\bar{f}}_{s t} f_{t q} \cdot 2 P (H_{j k} t) P (H_{j k} s) P (H_{j l} q) \\ + \frac{1}{4} \sum_{s, t, q, r \in \{D, d\}} {\bar{f}}_{s t} f_{q r} P (H_{j k} s) P (H_{j k} t) [P (H_{j k} q) P (H_{j l} r) \\ + P (H_{j k} r) P (H_{j l} q)] . \end{array}

(-5)

Similarly, one has the following expectation

\begin{array}{l} E [1_{(G_{i j}^{(U)} = H_{j k} H_{j k})} 1_{(G_{i j}^{(A)} = H_{j k} H_{j k})} 1_{A_{1}} 1_{U_{2}}] \\ = \frac{1}{2} \sum_{s, t, q \in \{D, d\}} {\bar{f}}_{s t} f_{t q} \cdot 2 P (H_{j l} s) P (H_{j k} t) P (H_{j k} q) \\ + \frac{1}{4} \sum_{s, t, q, r \in \{D, d\}} {\bar{f}}_{s t} f_{q r} [P (H_{j k} s) P (H_{j l} t) \\ + P (H_{j k} t) P (H_{j l} s)] P (H_{j k} q) P (H_{j l} r) . \end{array}

(-6)

For l ≠ k, one may calculate the expectation

\begin{array}{l} E [1_{(G_{i j}^{(U)} = H_{j k} H_{j k})} 1_{(G_{i j}^{(A)} = H_{j k} H_{j l})} 1_{A_{1}} 1_{U_{2}}] \\ = P [G_{i j}^{(U)} = H_{j k} H_{j l}, G_{i j}^{(A)} \\ = H_{j k} H_{j l}, A_{1}, U_{2}, (2 IBD) \cup (1 IBD) \cup (0 IBD)] \\ = \frac{1}{4} \sum_{s, t \in \{D, d\}} {\bar{f}}_{s t} f_{s t} P [G_{i j}^{(U)} = H_{j k} H_{j l}, G_{i j}^{(A)} = H_{j k} H_{j l}, G_{i D}^{(U)} \\ = s t, G_{i D}^{(A)} = s t] + \frac{1}{2} \sum_{s, t, q \in \{D, d\}} {\bar{f}}_{s t} f_{t q} P [G_{i j}^{(U)} = H_{j k} H_{j l}, G_{i j}^{(A)} \\ = H_{j k} H_{j l}, G_{i D}^{(U)} = s t, G_{i D}^{(A)} = t q] + \frac{1}{4} \sum_{s, t, q, r \in \{D, d\}} {\bar{f}}_{s t} f_{q r} P [G_{i j}^{(U)} \\ = H_{j k} H_{j l}, G_{i j}^{(A)} = H_{j k} H_{j l}, G_{i D}^{(U)} = s t, G_{i D}^{(A)} = q r] \\ = \frac{1}{4} \sum_{s, t \in \{D, d\}} {\bar{f}}_{s t} f_{s t} [P (H_{j k} s) P (H_{j l} t) + P (H_{j k} t) P (H_{j l} s)] \\ + \frac{1}{2} \sum_{s, t, q \in \{D, d\}} {\bar{f}}_{s t} f_{t q} [P (H_{j k} t) [P (H_{j l} s) P (H_{j l} q)] + P (H_{j l} t) \\ \times [P (H_{j k} s) P (H_{j k} q)]] + \frac{1}{4} \sum_{s, t, q, r \in \{D, d\}} {\bar{f}}_{s t} f_{q r} [P (H_{j k} s) P (H_{j l} t) \\ + P (H_{j k} t) P (H_{j l} s)] [P (H_{j k} q) P (H_{j l} r) \\ + P (H_{j k} r) P (H_{j l} q)] . \end{array}

(7)

For l₁ ≠ l₂, l₁ ≠ k and l₂ ≠ k, one may calculate the expectation

\begin{array}{l} E [1_{(G_{i j}^{(U)} = H_{j k} H_{j l_{1}})} 1_{(G_{i j}^{(A)} = H_{j k} H_{j l_{2}})} 1_{A_{1}} 1_{U_{2}}] \\ = P [G_{i j}^{(U)} = H_{j k} H_{j l_{1}}, G_{i j}^{(A)} \\ = H_{j k} H_{j l_{2}}, A_{1}, U_{2}, (2 IBD) \cup (1 IBD) \cup (0 IBD)] \\ = \frac{1}{4} \sum_{s, t \in \{D, d\}} {\bar{f}}_{s t} f_{s t} P [G_{i j}^{(U)} = H_{j k} H_{j l_{1}}, G_{i j}^{(A)} = H_{j k} H_{j l_{2}}, \\ G_{i D}^{(U)} = s t, G_{i D}^{(A)} = s t] \\ + \frac{1}{2} \sum_{s, t, q \in \{D, d\}} {\bar{f}}_{s t} f_{t q} P [G_{i j}^{(U)} = H_{j k} H_{j l_{1}}, G_{i j}^{(A)} = H_{j k} H_{j l_{2}}, \\ G_{i D}^{(U)} = s t, G_{i D}^{(A)} = t q] \\ + \frac{1}{4} \sum_{s, t, q, r \in \{D, d\}} {\bar{f}}_{s t} f_{q r} P [G_{i j}^{(U)} = H_{j k} H_{j l_{1}}, G_{i j}^{(A)} = H_{j k} H_{j l_{2}}, \\ G_{i D}^{(U)} = s t, G_{i D}^{(A)} = q r] \\ = \frac{1}{2} \sum_{s, t \in \{D, d\}} {\bar{f}}_{s t} f_{t q} \cdot 2 P (H_{j k} t) P (H_{j l_{1}} s) P (H_{j l_{2}} q) \\ + \frac{1}{4} \sum_{s, t, q, r \in \{D, d\}} {\bar{f}}_{s t} f_{q r} [P (H_{j k} s) P (H_{j l_{1}} t) + P (H_{j k} t) P (H_{j l_{1}} s)] \\ \times [P (H_{j k} q) P (H_{j l_{2}} r) + P (H_{j k} r) P (H_{j l_{2}} q)] . \end{array}

(8)

By using equations (4), (5), (6), (7) and (8), we may calculate $E [z_{i j k}^{(U)} z_{i j k}^{(A)} 1_{A_{1}} 1_{U_{2}}]$ in (3). If k ≠ l, then

\begin{array}{l} E [z_{i j k}^{(U)} z_{i j l}^{(A)} 1_{A_{1}} 1_{U_{2}}] \\ = E [(2 \cdot 1_{(G_{i j}^{(U)} = H_{j k} H_{j k})} + \sum_{m \neq k} 1_{(G_{i j}^{(U)} = H_{j k} H_{j m})}) \\ \times (2 \cdot 1_{(G_{i j}^{(A)} = H_{j l} H_{j l})} + \sum_{n \neq l} 1_{(G_{i j}^{(A)} = H_{j l} H_{j n})}) 1_{A_{1}} 1_{U_{2}}] \\ = 4 E [1_{(G_{i j}^{(U)} = H_{j k} H_{j k})} 1_{(G_{i j}^{(A)} = H_{j l} H_{j l})} 1_{A_{1}} 1_{U_{2}}] \\ + 2 E [1_{(G_{i j}^{(U)} = H_{j k} H_{j k})} 1_{(G_{i j}^{(A)} = H_{j l} H_{j k})} 1_{A_{1}} 1_{U_{2}}] \\ + 2 \sum_{n \neq k, l} E [1_{(G_{i j}^{(U)} = H_{j k} H_{j k})} 1_{(G_{i j}^{(A)} = H_{j l} H_{j n})} 1_{A_{1}} 1_{U_{2}}] \\ + 2 E [1_{(G_{i j}^{(U)} = H_{j k} H_{j l})} 1_{(G_{i j}^{(A)} = H_{j l} H_{j l})} 1_{A_{1}} 1_{U_{2}}] \\ + 2 \sum_{m \neq k, l} E [1_{(G_{i j}^{(U)} = H_{j k} H_{j m})} 1_{(G_{i j}^{(A)} = H_{j l} H_{j l})} 1_{A_{1}} 1_{U_{2}}] \\ + \sum_{m \neq k, l} E [1_{(G_{i j}^{(U)} = H_{j k} H_{j m})} 1_{(G_{i j}^{(A)} = H_{j l} H_{j m})} 1_{A_{1}} 1_{U_{2}}] \\ + \sum_{m \neq k, l} E [1_{(G_{i j}^{(U)} = H_{j k} H_{j m})} 1_{(G_{i j}^{(A)} = H_{j l} H_{j k})} 1_{A_{1}} 1_{U_{2}}] \\ + \sum_{m \neq k, l} \sum_{n \neq m, k, l} E [1_{(G_{i j}^{(U)} = H_{j k} H_{j m})} 1_{(G_{i j}^{(A)} = H_{j l} H_{j n})} 1_{A_{1}} 1_{U_{2}}] \\ + E [1_{(G_{i j}^{(U)} = H_{j k} H_{j l})} 1_{(G_{i j}^{(A)} = H_{j l} H_{j k})} 1_{A_{1}} 1_{U_{2}}] \\ + \sum_{n \neq k, l} E [1_{(G_{i j}^{(U)} = H_{j k} H_{j l})} 1_{(G_{i j}^{(A)} = H_{j l} H_{j n})} 1_{A_{1}} 1_{U_{2}}] . \end{array}

(9)

First, one may calculate the expectation

\begin{array}{l} E [1_{(G_{i j}^{(U)} = H_{j k} H_{j k})} 1_{(G_{i j}^{(A)} = H_{j l} H_{j l})} 1_{A_{1}} 1_{U_{2}}] \\ = P [G_{i j}^{(U)} = H_{j k} H_{j k}, G_{i j}^{(A)} = H_{j l} H_{j l}, \\ A_{1}, U_{2}, (2 IBD) \cup (1 IBD) \cup (0 IBD)] \\ = \frac{1}{4} \sum_{s, t \in \{D, d\}} {\bar{f}}_{s t} f_{s t} P [G_{i j}^{(U)} = H_{j k} H_{j k}, G_{i j}^{(A)} = H_{j l} H_{j l}, \\ G_{i D}^{(U)} = s t, G_{i D}^{(A)} = s t] \\ + \frac{1}{2} \sum_{s, t, q \in \{D, d\}} {\bar{f}}_{s t} f_{t q} P [G_{i j}^{(U)} = H_{j k} H_{j k}, G_{i j}^{(A)} = H_{j l} H_{j l}, \\ G_{i D}^{(U)} = s t, G_{i D}^{(A)} = t q] \\ + \frac{1}{4} \sum_{s, t, q, r \in \{D, d\}} {\bar{f}}_{s t} f_{q r} P [G_{i j}^{(U)} = H_{j k} H_{j k}, G_{i j}^{(A)} = H_{j l} H_{j l}, \\ G_{i D}^{(U)} = s t, G_{i D}^{(A)} = q r] \\ = \frac{1}{4} \sum_{s, t, q, r \in \{D, d\}} {\bar{f}}_{s t} f_{q r} P (H_{j k} s) P (H_{j k} t) P (H_{j l} q) P (H_{j l} r) \end{array}

(10)

For n ≠ k, l, one may have the following expectation

\begin{array}{l} E [1_{(G_{i j}^{(U)} = H_{j k} H_{j k})} 1_{(G_{i j}^{(A)} = H_{j l} H_{j n})} 1_{A_{1}} 1_{U_{2}}] \\ = P [G_{i j}^{(U)} = H_{j k} H_{j k}, G_{i j}^{(A)} = H_{j l} H_{j n}, \\ A_{1}, U_{2}, (2 IBD) \cup (1 IBD) \cup (0 IBD)] \\ = \frac{1}{4} \sum_{s, t \in \{D, d\}} {\bar{f}}_{s t} f_{s t} P [G_{i j}^{(U)} = H_{j k} H_{j k}, G_{i j}^{(A)} = H_{j l} H_{j n}, \\ G_{i D}^{(U)} = s t, G_{i D}^{(A)} = s t] \\ + \frac{1}{2} \sum_{s, t, q \in \{D, d\}} {\bar{f}}_{s t} f_{t q} P [G_{i j}^{(U)} = H_{j k} H_{j k}, G_{i j}^{(A)} = H_{j l} H_{j n}, \\ G_{i D}^{(U)} = s t, G_{i D}^{(A)} = t q] \\ + \frac{1}{4} \sum_{s, t, q, r \in \{D, d\}} {\bar{f}}_{s t} f_{q r} P [G_{i j}^{(U)} = H_{j k} H_{j k}, G_{i j}^{(A)} = H_{j l} H_{j n}, \\ G_{i D}^{(U)} = s t, G_{i D}^{(A)} = q r] \\ = \frac{1}{4} \sum_{s, t, q, r \in \{D, d\}} {\bar{f}}_{s t} f_{q r} P (H_{j k} s) P (H_{j k} t) [P (H_{j l} q) P (H_{j n} r) \\ + P (H_{j l} r) P (H_{j n} q)] . \end{array}

(11)

For m ≠ k, l, one may have the following expectation

\begin{array}{l} E [1_{(G_{i j}^{(U)} = H_{j k} H_{j m})} 1_{(G_{i j}^{(A)} = H_{j l} H_{j l})} 1_{A_{1}} 1_{U_{2}}] \\ = P [G_{i j}^{(U)} = H_{j k} H_{j m}, G_{i j}^{(A)} = H_{j l} H_{j l}, \\ A_{1}, U_{2}, (2 IBD) \cup (1 IBD) \cup (0 IBD)] \\ = \frac{1}{4} \sum_{s, t \in \{D, d\}} {\bar{f}}_{s t} f_{s t} P [G_{i j}^{(U)} = H_{j k} H_{j m}, G_{i j}^{(A)} = H_{j l} H_{j l}, \\ G_{i D}^{(U)} = s t, G_{i D}^{(A)} = s t] \\ + \frac{1}{2} \sum_{s, t, q \in \{D, d\}} {\bar{f}}_{s t} f_{t q} P [G_{i j}^{(U)} = H_{j k} H_{j m}, G_{i j}^{(A)} = H_{j l} H_{j l}, \\ G_{i D}^{(U)} = s t, G_{i D}^{(A)} = t q] \\ + \frac{1}{4} \sum_{s, t, q, r \in \{D, d\}} {\bar{f}}_{s t} f_{q r} P [G_{i j}^{(U)} = H_{j k} H_{j m}, G_{i j}^{(A)} = H_{j l} H_{j l}, \\ G_{i D}^{(U)} = s t, G_{i D}^{(A)} = q r] \\ = \frac{1}{4} \sum_{s, t, q, r \in \{D, d\}} {\bar{f}}_{s t} f_{q r} [P (H_{j k} s) P (H_{j m} t) \\ + P (H_{j k} t) P (H_{j m} s)] P (H_{j l} q) P (H_{j l} r) . \end{array}

(12)

For m ≠ k, l, n ≠ m, k, l, one way have the following expectation:

\begin{array}{l} E [1_{(G_{i j}^{(U)} = H_{j k} H_{j m})} 1_{(G_{i j}^{(A)} = H_{j l} H_{j n})} 1_{A_{1}} 1_{U_{2}}] \\ = P [G_{i j}^{(U)} = H_{j k} H_{j m}, G_{i j}^{(A)} = H_{j l} H_{j n}, \\ A_{1}, U_{2}, (2 IBD) \cup (1 IBD) \cup (0 IBD)] \\ = \frac{1}{4} \sum_{s, t \in \{D, d\}} {\bar{f}}_{s t} f_{s t} P [G_{i j}^{(U)} = H_{j k} H_{j m}, G_{i j}^{(A)} = H_{j l} H_{j n}, \\ G_{i D}^{(U)} = s t, G_{i D}^{(A)} = s t] \\ + \frac{1}{2} \sum_{s, t, q \in \{D, d\}} {\bar{f}}_{s t} f_{t q} P [G_{i j}^{(U)} = H_{j k} H_{j m}, G_{i j}^{(A)} = H_{j l} H_{j n}, \\ G_{i D}^{(U)} = s t, G_{i D}^{(A)} = t q] \\ + \frac{1}{4} \sum_{s, t, q, r \in \{D, d\}} {\bar{f}}_{s t} f_{q r} P [G_{i j}^{(U)} = H_{j k} H_{j m}, G_{i j}^{(A)} = H_{j l} H_{j n}, \\ G_{i D}^{(U)} = s t, G_{i D}^{(A)} = q r] \\ \times \frac{1}{4} \sum_{s, t, q, r \in \{D, d\}} {\bar{f}}_{s t} f_{q r} [P (H_{j k} s) P (H_{j m} t) \\ + P (H_{j k} t) P (H_{j m} s)] [P (H_{j l} q) P (H_{j n} r) \\ + P (H_{j l} r) P (H_{j n} q)] . \end{array}

(13)

Using equations (5) (6), (7), (8), (9), (10), (11) and (13), we may calculate terms of equation (7).

Appendix C

For j ≠ g, the expectation

\begin{array}{l} E [z_{i j k}^{(U)} z_{i g h}^{(A)} 1_{A_{1}} 1_{U_{2}}] \\ = E [(2 \cdot 1_{(G_{i j}^{(U)} = H_{j k} H_{j k})} + \sum_{k^{'} \neq k} 1_{(G_{i j}^{(U)} = H_{j k} H_{j k^{'}})}) \\ \times (2 \cdot 1_{(G_{i g}^{(A)} = H_{g h} H_{g h})} + \sum_{h^{'} \neq h} 1_{(G_{i g}^{(A)} = H_{g h} H_{g h^{'}})}) 1_{A_{1}} 1_{U_{2}}] . \end{array}

(14)

Suppose that blocks/markers H_j and H_g flank disease locus D in the order H_jDH_g. The expectation

\begin{array}{l} E [1_{(G_{i j}^{(U)} = H_{j k} H_{j k})} 1_{(G_{i g}^{(A)} = H_{g h} H_{g h})} 1_{A_{1}} 1_{U_{2}}] \\ = P [G_{i j}^{(U)} = H_{j k} H_{j k}, G_{i g}^{(A)} = H_{g h} H_{g h}, \\ A_{1}, U_{2}, (2 IBD) \cup (1 IBD) \cup (0 IBD)] \\ = \frac{1}{4} \sum_{s, t \in \{D, d\}} {\bar{f}}_{s t} f_{s t} P [G_{i j}^{(U)} = H_{j k} H_{j m}, G_{i j}^{(A)} = H_{g h} H_{g h}, \\ G_{i D}^{(U)} = s t, G_{i D}^{(A)} = s t] \\ + \frac{1}{2} \sum_{s, t, q \in \{D, d\}} {\bar{f}}_{s t} f_{t q} P [G_{i j}^{(U)} = H_{j k} H_{j k}, G_{i g}^{(A)} = H_{g h} H_{g h}, \\ G_{i D}^{(U)} = s t, G_{i D}^{(A)} = t q] \\ + \frac{1}{4} \sum_{s, t, q, r \in \{D, d\}} {\bar{f}}_{s t} f_{q r} P [G_{i j}^{(U)} = H_{j k} H_{j k}, G_{i g}^{(A)} = H_{g h} H_{g h}, \\ G_{i D}^{(U)} = s t, G_{i D}^{(A)} = q r] \\ = \frac{1}{4} \sum_{s, t \in \{D, d\}} {\bar{f}}_{s t} f_{s t} P (H_{j k} s H_{g h}) P (H_{j k} t H_{g h}) \\ + \frac{1}{2} \sum_{s, t, q \in \{D, d\}} {\bar{f}}_{s t} f_{t q} P (H_{j k} t H_{g h}) P (H_{j k} s) P (q H_{g h}) \\ + \frac{1}{4} \sum_{s, t, q, r \in \{D, d\}} {\bar{f}}_{s t} f_{q r} P (H_{j k} s) P (H_{j k} t) P (q H_{g h}) P (r H_{g h}) . \end{array}

(15)

If h' ≠ h, the expectation

\begin{array}{l} E [1_{(G_{i j}^{(U)} = H_{j k} H_{j k})} 1_{(G_{i g}^{(A)} = H_{g h} H_{g h^{'}})} 1_{A_{1}} 1_{U_{2}}] \\ = P [G_{i j}^{(U)} = H_{j k} H_{j k}, G_{i g}^{(A)} = H_{g h} H_{g h^{'}}, \\ A_{1}, U_{2}, (2 IBD) \cup (1 IBD) \cup (0 IBD)] \\ = \frac{1}{4} \sum_{s, t \in \{D, d\}} {\bar{f}}_{s t} f_{s t} P [G_{i j}^{(U)} = H_{j k} H_{j k}, G_{i g}^{(A)} = H_{g h} H_{g h^{'}}, \\ G_{i D}^{(U)} = s t, G_{i D}^{(A)} = s t] \\ + \frac{1}{2} \sum_{s, t, q \in \{D, d\}} {\bar{f}}_{s t} f_{t q} P [G_{i j}^{(U)} = H_{j k} H_{j k}, G_{i g}^{(A)} = H_{g h} H_{g h^{'}}, \\ G_{i D}^{(U)} = s t, G_{i D}^{(A)} = t q] \\ + \frac{1}{4} \sum_{s, t, q, r \in \{D, d\}} {\bar{f}}_{s t} f_{q r} P [G_{i j}^{(U)} = H_{j k} H_{j k}, G_{i g}^{(A)} = H_{g h} H_{g h^{'}}, \\ G_{i D}^{(U)} = s t, G_{i D}^{(A)} = q r] \\ = \frac{1}{4} \sum_{s, t \in \{D, d\}} {\bar{f}}_{s t} f_{s t} [P (H_{j k} s H_{g h}) P (H_{j k} t H_{g h^{'}}) \\ + P (H_{j k} t H_{g h}) P (H_{j k} s H_{g h^{'}})] \\ + \frac{1}{2} \sum_{s, t, q \in \{D, d\}} {\bar{f}}_{s t} f_{t q} [P (H_{j k} t H_{g h}) P (H_{j k} s) P (q H_{g h^{'}}) \\ + P (H_{j k} t H_{g h^{'}}) P (H_{j k} s) P (q H_{g h})] \\ + \frac{1}{4} \sum_{s, t, q, r \in \{D, d\}} {\bar{f}}_{s t} f_{q r} P (H_{j k} s) P (H_{j k} t) [P (q H_{g h}) P (r H_{g h^{'}}) \\ + P (r H_{g h}) P (q H_{g h^{'}})] . \end{array}

(16)

If k ≠ k', the expectation

\begin{array}{l} E [1_{(G_{i j}^{(U)} = H_{j k} H_{j k})} 1_{(G_{i g}^{(A)} = H_{g h} H_{g h})} 1_{A_{1}} 1_{U_{2}}] \\ = P [G_{i j}^{(U)} = H_{j k} H_{j k^{'}}, G_{i g}^{(A)} \\ = H_{g h} H_{g h^{'}}, A_{1}, U_{2}, (2 IBD) \cup (1 IBD) \cup (0 IBD)] \\ = \frac{1}{4} \sum_{s, t \in \{D, d\}} {\bar{f}}_{s t} f_{s t} P [G_{i j}^{(U)} = H_{j k} H_{j k^{'}}, G_{i g}^{(A)} = H_{g h} H_{g h^{'}}, \\ G_{i D}^{(U)} = s t, G_{i D}^{(A)} = s t] \\ + \frac{1}{2} \sum_{s, t, q \in \{D, d\}} {\bar{f}}_{s t} f_{t q} P [G_{i j}^{(U)} = H_{j k} H_{j k^{'}}, G_{i g}^{(A)} = H_{g h} H_{g h}, \\ G_{i D}^{(U)} = s t, G_{i D}^{(A)} = t q] \\ + \frac{1}{4} \sum_{s, t, q, r \in \{D, d\}} {\bar{f}}_{s t} f_{q r} P [G_{i j}^{(U)} = H_{j k} H_{j k^{'}}, G_{i g}^{(A)} = H_{g h} H_{g h}, \\ G_{i D}^{(U)} = s t, G_{i D}^{(A)} = q r] \\ = \frac{1}{4} \sum_{s, t \in \{D, d\}} {\bar{f}}_{s t} f_{s t} [P (H_{j k} s H_{g h}) P (H_{j k^{'}} t H_{g h}) \\ + P (H_{j k} t H_{g h}) P (H_{j k^{'}} s H_{g h})] \\ + \frac{1}{2} \sum_{s, t, q \in \{D, d\}} {\bar{f}}_{s t} f_{t q} [P (H_{j k} t H_{g h}) P (H_{j k^{'}} s) P (q H_{g h}) \\ + P (H_{j k^{'}} t H_{g h}) P (H_{j k} s) P (q H_{g h})] \\ + \frac{1}{4} \sum_{s, t, q, r \in \{D, d\}} {\bar{f}}_{s t} f_{q r} [P (H_{j k} s) P (H_{j k^{'}} t) + P (H_{j k} t) P (H_{j k^{'}} s)] \\ \times P (q H_{g h}) P (r H_{g h}) . \end{array}

(17)

If k ≠ k', h ≠ h' the expectation

\begin{array}{l} E [1_{(G_{i j}^{(U)} = H_{j k} H_{j k^{'}})} 1_{(G_{i g}^{(A)} = H_{g h} H_{g h^{'}})} 1_{A_{1}} 1_{U_{2}}] \\ = P [G_{i j}^{(U)} = H_{j k} H_{j k^{'}}, G_{i g}^{(A)} = H_{g h} H_{g h^{'}}, \\ A_{1}, U_{2}, (2 IBD) \cup (1 IBD) \cup (0 IBD)] \\ = \frac{1}{4} \sum_{s, t \in \{D, d\}} {\bar{f}}_{s t} f_{s t} P [G_{i j}^{(U)} = H_{j k} H_{j k^{'}}, G_{i g}^{(A)} = H_{g h} H_{g h^{'}}, \\ G_{i D}^{(U)} = s t, G_{i D}^{(A)} = s t] \\ + \frac{1}{2} \sum_{s, t, q \in \{D, d\}} {\bar{f}}_{s t} f_{t q} P [G_{i j}^{(U)} = H_{j k} H_{j k^{'}}, G_{i g}^{(A)} = H_{g h} H_{g h^{'}}, \\ G_{i D}^{(U)} = s t, G_{i D}^{(A)} = t q] \\ + \frac{1}{4} \sum_{s, t, q, r \in \{D, d\}} {\bar{f}}_{s t} f_{q r} P [G_{i j}^{(U)} = H_{j k} H_{j k^{'}}, G_{i g}^{(A)} = H_{g h} H_{g h^{'}}, \\ G_{i D}^{(U)} = s t, G_{i D}^{(A)} = q r] \\ = \frac{1}{4} \sum_{s, t \in \{D, d\}} {\bar{f}}_{s t} f_{s t} [P (H_{j k} s H_{g h}) P (H_{j k^{'}} t H_{g h^{'}}) \\ + P (H_{j k} t H_{g h}) P (H_{j k^{'}} s H_{g h^{'}}) + P (H_{j k} s H_{g h^{'}}) P (H_{j k^{'}} t H_{g h}) \\ + P (H_{j k} t H_{g h^{'}}) P (H_{j k^{'}} s H_{g h})] \\ + \frac{1}{2} \sum_{s, t, q \in \{D, d\}} {\bar{f}}_{s t} f_{t q} [P (H_{j k} t H_{g h}) P (H_{j k^{'}} s) P (q H_{g h^{'}}) \\ + P (H_{j k^{'}} t H_{g h}) P (H_{j k} s) P (q H_{g h^{'}}) + P (H_{j k} t H_{g h^{'}}) P (H_{j k^{'}} s) P (q H_{g h}) \\ + P (H_{j k^{'}} t H_{g h^{'}}) P (H_{j k} s) P (q H_{g h})] \\ + \frac{1}{4} \sum_{s, t, q, r \in \{D, d\}} {\bar{f}}_{s t} f_{q r} [P (H_{j k} s) P (H_{j k^{'}} t) \\ + P (H_{j k} t) P (H_{j k^{'}} s)] [P (q H_{g h}) P (r H_{g h^{'}}) + P (r H_{g h}) P (q H_{g h^{'}})] . \end{array}

(18)

Appendix D

For the 'genotype coding' method, the coding vector of the affected sibling in the i-th sib-pair is $X_{i j}^{(A)} = {(x_{i j 1}^{(A)}, \dots, x_{i j (n_{j} - 1)}^{(A)}, x_{i j 12}^{(A)}, \dots, x_{i j 1 n_{j}}^{(A)}, \dots, x_{i j (n_{j} - 1) n_{j}}^{(A)})}^{τ}$ , j = 1,..., J. Similarly, $Y_{i j}^{(U)} = {(x_{i j 1}^{(U)}, \dots, x_{i j (n_{j} - 1)}^{(U)}, x_{i j 12}^{(U)}, \dots, x_{i j 1 n_{j}}^{(U)}, \dots, x_{i j (n_{j} - 1) n_{j}}^{(U)})}^{τ}$ j = 1,..., J is the coding vector of the normal sibling in the i-th sib-pair. Using the expectations $E [X_{i j}^{(A)} | A_{1}, U_{2}]$ and $E [Y_{i j}^{(U)} | A_{1}, U_{2}]$ given in equations (1) and (2), one may calculate the following variance-covariance matrices:

\begin{gathered} Var (X_{i j}^{(A)} | A_{1}, U_{2}) \\ = d i a g (a_{j 11}, \dots, a_{j (n_{j} - 1) (n_{j} - 1)}, a_{j 12}, \dots, a_{j 1 n_{j}}, \dots a_{j (n_{j} - 1) n_{j}}) \\ - [X_{i j}^{(A)} | A_{1}, U_{2}] E {[X_{i j}^{(A)} | A_{1}, U_{2}]}^{τ}, \\ Var (Y_{i j}^{(U)} | A_{1}, U_{2}) \\ = d i a g (ā_{j 11}, \dots, ā_{j (n_{j} - 1) (n_{j} - 1)}, ā_{j 12}, \dots, ā_{j 1 n_{j}}, \dots ā_{j (n_{j} - 1) n_{j}}) \\ - E [Y_{i j}^{(U)} | A_{1}, U_{2}] E {[Y_{i j}^{(U)} | A_{1}, U_{2}]}^{τ} . \end{gathered}

(19)

The covariances between x_ijk, x_ijkk' and x_igh, x_ighh' are given by

\begin{array}{l} Cov (x_{i j k}^{(A)}, x_{i g h}^{(A)} | A_{1}, U_{2}) & = g_{k k h h}^{(A, j g)} - a_{j k k} a_{g h h}, \\ Cov (x_{i j k}^{(A)}, x_{i g h h^{'}}^{(A)} | A_{1}, U_{2}) & = g_{k k h h^{'}}^{(A, j g)} - a_{j k k} a_{g h h^{'}}, \\ Cov (x_{i j k k^{'}}^{(A)}, x_{i g h}^{(A)} | A_{1}, U_{2}) & = g_{k k^{'} h h}^{(A, j g)} - a_{j k k^{'}} a_{g h h}, \\ Cov (x_{i j k k^{'}}^{(A)}, x_{i g h h^{'}}^{(A)} | A_{1}, U_{2}) & = g_{k k^{'} h h^{'}}^{(A, j g)} - a_{j k k^{'}} a_{g h h^{'}} . \end{array}

(20)

Similarly,

\begin{array}{l} Cov (x_{i j k}^{(U)}, x_{i g h}^{(U)} | A_{1}, U_{2}) & = ḡ_{k k h h}^{(U, j g)} - ā_{j k k} ā_{g h h}, \\ Cov (x_{i j k}^{(U)}, x_{i g h h^{'}}^{(U)} | A_{1}, U_{2}) & = ḡ_{k k h h^{'}}^{(U, j g)} - ā_{j k k} ā_{g h h^{'}}, \\ Cov (x_{i j k k^{'}}^{(U)}, x_{i g h}^{(U)} | A_{1}, U_{2}) & = ḡ_{k k^{'} h h}^{(U, j g)} - ā_{j k k^{'}} ā_{g h h}, \\ Cov (x_{i j k k^{'}}^{(U)}, x_{i g h h^{'}}^{(U)} | A_{1}, U_{2}) & = g_{k k^{'} h h^{'}}^{(U, j g)} - ā_{j k k^{'}} ā_{g h h^{'}} . \end{array}

(21)

Using results of equations (19), (20) and (21), one may calculate $Var (X_{i}^{(A)} | A_{1}, U_{2})$ and $Var (Y_{i}^{(U)} | A_{1}, U_{2})$ for the 'genotype coding' method.

Appendix E

In this Appendix, we calculate the following covariance matrix for the 'genotype coding' method

\begin{array}{l} Cov (Y_{i}^{(U)}, X_{i}^{(A)} | A_{1}, U_{2}) & = E [Y_{i}^{(U)} X_{i}^{{(A)}^{τ}} | A_{1}, U_{2}] \\ - E [Y_{i}^{(U)} | A_{1}, U_{2}] E [X_{i}^{{(A)}^{τ}} | A_{1}, U_{2}] \\ = \frac{E [Y_{i}^{(U)} X_{i}^{{(A)}^{τ}} 1_{A_{1}} 1_{U_{2}}]}{P (A_{1}, U_{2})} \\ - E [Y_{i}^{(U)} | A_{1}, U_{2}] E [X_{i}^{{(A)}^{τ}} | A_{1}, U_{2}] . \end{array}

The probability P(A₁, U₂) is given in the Appendix to the manuscript, and the components of expectations $E [X_{i}^{(A)} | A_{1}, U_{2}]$ and $E [Y_{i}^{(U)} | A_{1}, U_{2}]$ are given in equations (1) and (2). For $E [Y_{i}^{(U)} X_{i}^{{(A)}^{τ}} 1_{A_{1}}, 1_{U_{2}}]$ , we note the following results:

the expectation $E [1_{(G_{i j}^{(U)} = H_{j k} H_{j k})} 1_{(G_{i j}^{(A)} = H_{j k} H_{j k})} 1_{A_{1}} 1_{U_{2}}]$ is given by (4); For l ≠ k, the expectation $E [1_{(G_{i j}^{(U)} = H_{j k} H_{j k})} 1_{(G_{i j}^{(A)} = H_{j k} H_{j l})} 1_{A_{1}} 1_{U_{2}}]$ is given by (5); For l ≠ k, $E [1_{(G_{i j}^{(U)} = H_{j k} H_{j l})} 1_{(G_{i j}^{(A)} = H_{j k} H_{j k})} 1_{A_{1}} 1_{U_{2}}]$ is given by (6); For l ≠ k, $E [1_{(G_{i j}^{(U)} = H_{j k} H_{j l})} 1_{(G_{i j}^{(A)} = H_{j k} H_{j l})} 1_{A_{1}} 1_{U_{2}}]$ is given by (7); For l₁ ≠ l₂, l₁ ≠ k, l₂ ≠ k, $E [1_{(G_{i j}^{(U)} = H_{j k} H_{j l_{1}})} 1_{(G_{i j}^{(A)} = H_{j k} H_{j l_{2}})} 1_{A_{1}} 1_{U_{2}}]$ is given by (8); For l ≠ k, $E [1_{(G_{i j}^{(U)} = H_{j k} H_{j k})} 1_{(G_{i j}^{(A)} = H_{j l} H_{j l})} 1_{A_{1}} 1_{U_{2}}]$ is given by (10); For l ≠ k, n ≠ k, l, $E [1_{(G_{i j}^{(U)} = H_{j k} H_{j k})} 1_{(G_{i j}^{(A)} = H_{j l} H_{j n})} 1_{A_{1}} 1_{U_{2}}]$ is given by (11); For l ≠ k, m ≠ k, l, $E [1_{(G_{i j}^{(U)} = H_{j k} H_{j m})} 1_{(G_{i j}^{(A)} = H_{j l} H_{j l})} 1_{A_{1}} 1_{U_{2}}]$ is given by (12); For l ≠ k, m ≠ k, l, n ≠ m, k, l, $E [1_{(G_{i j}^{(U)} = H_{j k} H_{j m})} 1_{(G_{i j}^{(A)} = H_{j l} H_{j n})} 1_{A_{1}} 1_{U_{2}}]$ is given by (13). In addition, $E [1_{(G_{i j}^{(U)} = H_{j k} H_{j k})} 1_{(G_{i j}^{(A)} = H_{g h} H_{g h})} 1_{A_{1}} 1_{U_{2}}]$ is given by (15); $E [1_{(G_{i j}^{(U)} = H_{j k} H_{j k})} 1_{(G_{i j}^{(A)} = H_{g h} H_{g h^{'}})} 1_{A_{1}} 1_{U_{2}}]$ is given by (16); $E [1_{(G_{i j}^{(U)} = H_{j k} H_{j k^{'}})} 1_{(G_{i j}^{(A)} = H_{g h} H_{g h})} 1_{A_{1}} 1_{U_{2}}]$ is given by (17); Finally, $E [1_{(G_{i j}^{(U)} = H_{j k} H_{j k^{'}})} 1_{(G_{i j}^{(A)} = H_{g h} H_{g h^{'}})} 1_{A_{1}} 1_{U_{2}}]$ is given by (18).

Supplementary information: Simulation study

In order to evaluate the accuracy of the non-centrality parameter approximations, we performed simulations for power curves in Figures 1, 2, 3 and 4 of the paper. To do this, we divided the interval (0, 0.065) (or (0, 0.045)) of the LD measure Δ₁₁ of LD uniformly into 20 subintervals for Figures 1 and 2 (or Figures 3 and 4). Correspondingly, the 20 subintervals lead to 21 endpoints. For each endpoint, there is a set of parameters for each power curve. Using the set of parameters, 2,500 datasets are simulated for each endpoint. For each dataset, the empirical statistics T_H, T_G, T_H1and T_G1were calculated. The simulated power is the proportion of the 2,500 simulated datasets for which the empirical statistic is larger than the cut-off point of the corresponding χ²-distribution at a 0.05 significance level.

From Figures 1, 2, 3 and 4, it can be seen that the theoretical power curves of T_H, T_G, T_H1and T_G1are perfectly close to the simulated power curves. Thus, the non-centrality parameter approximations are very accurate.

Figure 1. The simulated power curves T_H, T_G, T_H1and T_G1are plotted. The corresponding parameters are the same as those in Figure 1 of the paper. Abbreviation: LD = linkage disequilibrium.

Figure 2. The simulated power curves T_H, T_G, T_H1and T_G1are plotted. The corresponding parameters are the same as those in Figure 2 of the paper. Abbreviation: LD = linkage disequilibrium.

Figure 3. The simulated power curves T_H and T_G are plotted. The corresponding parameters are the same as those of Figure 3 in the paper. Abbreviation: LD = linkage disequilibrium.

Figure 4. The simulated power curves T_H and T_G are plotted. The corresponding parameters are the same as those of Figure 4 of the paper. Abbreviation: LD = linkage disequilibrium.

References

Botstein D, Risch N: Discovering genotypes underlying human phenotypes: Past successes for Mendelian disease, future approaches for complex disease. Nat Genet. 2003, 33 (Suppl): 228-237.
Article CAS PubMed Google Scholar
Cordell HJ, Clayton DG: A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: Application to HLA in type 1 diabetes. Am J Hum Genet. 2002, 70: 124-141. 10.1086/338007.
Article PubMed Central CAS PubMed Google Scholar
Rannala B, Reeve JP: High-resolution multipoint linkage-disequilibrium mapping in the context of a human genome sequence. Am J Hum Genet. 2001, 69: 159-178. 10.1086/321279. p. 672
Article PubMed Central CAS PubMed Google Scholar
Risch N: Implications of multilocus inheritance for gene-disease association studies. Theor Popul Biol. 2001, 60: 215-220. 10.1006/tpbi.2001.1538.
Article CAS PubMed Google Scholar
Risch N, Merikangas K: The future of genetic studies of complex human diseases. Science. 1996, 273: 1516-1517. 10.1126/science.273.5281.1516.
Article CAS PubMed Google Scholar
Spielman RS, McGinnis RE, Ewens WJ: Transmission test for linkage disequilibrium: The insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am J Hum Genet. 1993, 52: 506-516.
PubMed Central CAS PubMed Google Scholar
Chapman NH, Wijsman EM: Genome screens using linkage disequilibrium tests: Optimal marker characteristics and feasibility. Am J Hum Genet. 1998, 63: 1872-1885. 10.1086/302139.
Article PubMed Central CAS PubMed Google Scholar
Olson JM, Wijsman EM: Design and sample size considerations in the detection of linkage disequilibrium with a disease locus. Am J Hum Genet. 1994, 55: 574-580.
PubMed Central CAS PubMed Google Scholar
Kaplan N, Martin ER: Power calculations for a general class of tests of linkage and association that use nuclear families with affected and unaffected sibs. Theor Popul Biol. 2001, 60: 193-201. 10.1006/tpbi.2001.1541.
Article CAS PubMed Google Scholar
Kaplan N, Morris R: Issues concerning association studies for fine mapping a susceptibility gene for a complex disease. Genet Epidemiol. 2001, 20: 432-457. 10.1002/gepi.1012.
Article CAS PubMed Google Scholar
Nielsen DM, Ehm MG, Weir BS: Detecting marker-disease association by testing for Hardy-Weinberg disequilibrium at a marker locus. Am J Hum Genet. 1998, 63: 1531-1540. 10.1086/302114.
Article PubMed Central CAS PubMed Google Scholar
Ott J: Analysis of human genetic linkage. 1999, Johns Hopkins University Press, Baltimore and London, 3
Google Scholar
The International HapMap Consortium: The International HapMap Project. Nature. 2003, 426: 789-796. 10.1038/nature02168.
Article Google Scholar
The International SNP Map Working Group: A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature. 2001, 409: 928-933. 10.1038/35057149.
Article Google Scholar
Kong A, Gudbjartsson DF, Sainz J, et al: A high resolution recombination map of the human genome. Nat Genet. 2002, 31: 241-247.
CAS PubMed Google Scholar
Hotelling H: The generalization of Student's ratio. Ann Math Stat. 1931, 2: 360-378. 10.1214/aoms/1177732979.
Article Google Scholar
Xiong MM, Zhao J, Boerwinkle E: Generalized T² test for genome association studies. Am J Hum Genet. 2002, 70: 1257-1268. 10.1086/340392.
Article PubMed Central CAS PubMed Google Scholar
Chapman JM, Cooper JD, Todd J, Clayton D: Detecting disease associations due to linkage disequilibrium using haplotype tags: A class of tests and the determinants of the statistical power. Hum Hered. 2003, 56: 18-31. 10.1159/000073729.
Article PubMed Google Scholar
Fan RZ, Knapp M: Genome association studies of complex diseases by case-control designs. Am J Hum Genet. 2003, 72: 850-868. 10.1086/373966.
Article PubMed Central CAS PubMed Google Scholar
Fan RZ, Knapp M, Wjst M, et al: High resolution T² association tests of complex diseases based on family data. Ann Hum Genet. 2005, 69: 187-208. 10.1046/j.1469-1809.2004.00151.x.
Article CAS PubMed Google Scholar
Loukola A, Chadha M, Penn SG, et al: Comprehensive evaluation of the association between prostate cancer and genotype/haplotypes in CYP17A1, CYP3A4, and SRD5A2. Eur J Hum Genet. 2004, 12: 321-332. 10.1038/sj.ejhg.5201101.
Article CAS PubMed Google Scholar
Spielman RS, Ewens WJ: A sibship test for linkage in the presence of association: The sib transmission/disequilibrium test. Am J Hum Genet. 1998, 62: 450-458. 10.1086/301714.
Article PubMed Central CAS PubMed Google Scholar
Anderson TW: An introduction to multivariate statistical analysis. 1984, Wiley, New York, 2
Google Scholar
Thomson G, Baur MP: Third order linkage disequilibrium. Tissue Antigens. 1984, 24: 250-255. 10.1111/j.1399-0039.1984.tb02134.x.
Article CAS PubMed Google Scholar
Ewens WJ, Spielman RS: The transmission/disequilibrium test: History, subdivision, and admixture. Am J Hum Genet. 1995, 57: 455-464.
Article PubMed Central CAS PubMed Google Scholar
Allen AS, Rathouz PJ, Satten GA: Informative missingness in genetic association studies: Case-parent designs. Am J Hum Genet. 2003, 72: 671-680. 10.1086/368276.
Article PubMed Central CAS PubMed Google Scholar
Curtis D: Use of siblings as controls in case-control association studies. Ann Hum Genet. 1997, 61: 319-333. 10.1017/S000348009700626X.
Article CAS PubMed Google Scholar
Falk CT, Rubinstein P: Haplotype relative risk: An easy reliable way to construct a proper control sample for risk calculations. Ann Hum Genet. 1987, 51: 227-233. 10.1111/j.1469-1809.1987.tb00875.x.
Article CAS PubMed Google Scholar
Zhao HY, Zhang SL, Merikangas KR, et al: Transmission/disequilibrium tests using multiple tightly linked markers. Am J Hum Genet. 2000, 67: 936-946. 10.1086/303073.
Article PubMed Central CAS PubMed Google Scholar

Download references

Acknowledgements

M. Knapp was supported by grant KN 370/1-1 (Project D1 of FOR 423) from the Deutsche Forschungsgemeinschaft. R. Fan was supported by the National Science Foundation Grant DMS-0505025.

Author information

Authors and Affiliations

Department of Statistics, The Texas A&M University, 447 Blocker Building, College Station, Texas, 77843-3143, USA
Ruzong Fan
Institute of Medical Biometry, Informatics and Epidemiology, University of Bonn, Sigmund Freud Strasse 25, D-53105, Bonn, Germany
Michael Knapp

Authors

Ruzong Fan
View author publications
You can also search for this author in PubMed Google Scholar
Michael Knapp
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruzong Fan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fan, R., Knapp, M. Sibship T²association tests of complex diseases for tightly linked markers. Hum Genomics 2, 90 (2005). https://doi.org/10.1186/1479-7364-2-2-90

Download citation

Received: 28 April 2005
Accepted: 28 April 2005
Published: 01 June 2005
DOI: https://doi.org/10.1186/1479-7364-2-2-90

Sibship T2association tests of complex diseases for tightly linked markers

Abstract

Introduction

Methods

Paired Hotelling's T2test statistics

Non-centrality parameters

Results

Type I errors

Power calculation and comparison

Simulation study

Discussion

Appendix

Supplementary information: Non-centrality parameters

Appendix A

Appendix B

Appendix C

Appendix D

Appendix E

Supplementary information: Simulation study

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Sibship T²association tests of complex diseases for tightly linked markers

Paired Hotelling's T²test statistics