Detecting genes contributing to longevity using twin data

Begun, Alexander

doi:10.1186/1479-7364-4-2-73

Primary research
Published: 01 December 2009

Detecting genes contributing to longevity using twin data

Alexander Begun¹

Human Genomics volume 4, Article number: 73 (2009) Cite this article

1068 Accesses
1 Citations
Metrics details

Abstract

Searching for genes contributing to longevity is a typical task in association analysis. A number of methods can be used for finding this association -- from the simplest method based on the technique of contingency tables to more complex algorithms involving demographic data, which allow us to estimate the genotype-specific hazard functions. The independence of individuals is the common assumption in all these methods. At the same time, data on related individuals such as twins are often used in genetic studies. This paper proposes an extension of the relative risk model to encompass twin data. We estimate the power and also discuss what happens if we treat the twin data using the univariate model.

Introduction

Most common diseases and traits have a complex structure, for which the phenotype is determined by interactions between genetic and environmental factors. As any individual genetic variant can have a relatively modest effect on a disease or trait, linkage analysis has less power than association analysis. Classical association studies in their simplest form compare the frequency of alleles or genotypes for candidate genes between cases and controls. These candidate genes are usually chosen on the basis of biological hypotheses or from previous linkage analyses.

To identify genes associated with longevity, information on genotype frequencies for two or more age groups is needed. A significant trend of genotype frequencies being associated with age can indicate a gene-longevity association. In the basic 'gene frequency method', only the genotype frequencies in different age groups are compared [1–3] Some extensions of this method involve the use of demographic information about the population under study and allow the estimation of initial frequencies, relative risks and the age trajectories of mortality for candidate genes. These methods are known in the literature as the 'parametric method', the 'semi-parametric method', the 'non-parametric method' and the 'relative risk method' [4]. The use of these methods, however, has two limitations. First, the initial gene frequencies in all cohorts represented in the study must be the same. Secondly, the mortalities for genotypes do not depend on the birth year of the cohort. In two recently published papers,[5, 6] the authors exclude the first limitation, assuming a time trend in the genetic frequencies of subsequent birth cohorts. In principle, the time and the cohort covariates influencing mortality can be incorporated into the models too. The flexible parameterisation in the extended relative risk model [6] also allows detection of the antagonistic pleiotropic effect.

The methods mentioned above have been developed for datasets consisting of independent individuals. In this paper, we propose a method for detecting longevity genes for the dataset consisting of twin pairs. This method retains all the advantages of the relative risk model for univariate data described previously [6].

Materials and methods

To analyse the gene-longevity association, two datasets are needed: the genotype data and the univariate survival data for the individuals involved in the study. To improve the accuracy and power of the study, the longevity data for twins can additionally be analysed. Denoting longevity and non-longevity alleles at an autosomal locus by a and A, respectively, assume that the frequencies P_g of genotypes AA, aA or Aa, and aa at the moment of birth are P_AA, P_Aa and P_aa, respectively. If the Hardy-Weinberg equilibrium holds, then P_AA = (1-P_a)², P_Aa = 2P_a(1-P_a) and $P_{a a} = P_{a}^{2}$ , where P_a is the frequency of the allele a at the moment of birth. We parameterise P_a as follows:

P_{a} = 1 - 1 / (1 + e^{v + δ x + R φ (x, x_{0})}), x = T - t .

(1)

In accordance with (1), the logit of P_a is a linear function of unknown parameters R, ν and δ with domain of definition R³. This parameterisation includes the sudden change in the allele frequency by the value Rφ(x, x₀) in the cohort T - x₀ and the slow linear cohort effect ν + δx of the allele frequency. Here, T stands for the year of data collection, x for the age, and t for the cohort. We assume that the value of x₀ is known. The step function φ(x, x₀) is defined by the interval equations φ(x, x₀) = 1 for 0 ≤ x ≤ x₀ and φ(x, x₀) = 0 for x >x₀.

To estimate the genotype frequencies for twin pairs, we need to calculate the bivariate survival functions. One possible approach to doing so is to use the correlated gamma-frailty model, which provides simple analytical expressions for the bivariate survival functions [7]. Assume that that individual's instantaneous risk of death μ for genotype g∈{aa, Aa, AA} at age x, as measured by the hazard of mortality, is μ(x, Z, g) = Zμ_{0, g}(x), where Z is the gamma distributed frailty (non-observed risk of mortality) with mean 1 and variance σ², and μ_{0, g}(x) is the baseline hazard. The univariate survival function $S_{g} (x) = E e^{- Z H_{g} (x)} = {(1 + σ^{2} H_{g} (x))}^{- 1 / σ^{2}}$ is the Laplace transform for the gamma probability density function at the point $H_{g} (x) = \int_{0}^{x} μ_{0, g} (t) d t$ (cumulative hazard function). For related individuals, we assume that life spans T₁ and T₂ are conditionally independent, given frailties Z₁, Z₂ and genotypes g₁, g₂. In general, frailties Z₁ and Z₂ have unequal variances. Below, we shall assume, for simplicity, that Z₁ and Z₂ are identically distributed. If Corr(Z₁, Z₂) = ρ, E(Z₁) = E(Z₂) = 1 and Var(Z₁) = Var(Z₂) = σ², then:

\begin{gathered} P {T_{1} > x_{1}, T_{2} > x_{2}} = S_{g 1, g 2} (x_{1}, x_{2}) \\ = \frac{S_{g 1} {(x_{1})}^{1 - ρ} S_{g 2} {(x_{2})}^{1 - ρ}}{{(S_{g 1} {(x_{1})}^{- σ^{2}} + S_{g 2} {(x_{2})}^{- σ^{2}} - 1)}^{ρ / σ^{2}}} \end{gathered}

(2)

Here, S_{g1, g2}(x₁, x₂) is the bivariate survival function at ages x₁ and x₂ for twins with genotypes g₁ and g₂, respectively. We relate cumulative hazard functions with some unknown function H₀(x) as follows:

H_{g} (x) = c_{g} x + a_{g} H_{0} {(x)}^{b_{g}}

(3)

with unknown a_g ≥ 0, b_g ≥ 0 and c_g ≥ 0. Such parameterisation, where cumulative hazards H_g(x) rather than survival functions S_g(x) for different genotypes are parametrically related (eg $S_{g} (x) = S_{0} {(x)}^{b_{g}}$ , is more flexible and allows us to detect the antagonistic pleiotropic effect [6]. Without loss of generality, we can assume that a_AA = b_AA = 1.

For univariate and bivariate survival functions in the whole population, it holds that:

\begin{gathered} S (x) = \sum_{g} P_{g} S_{g} (x), \\ S^{M Z} (x_{1}, x_{2}) = \sum_{g, g} P_{g, g}^{M Z} S_{g, g}^{M Z} (x_{1}, x_{2}) \\ S^{D Z} (x_{1}, x_{2}) = \sum_{g 1, g 2} P_{g 1, g 2}^{D Z} S_{g 1, g 2}^{D Z} (x_{1}, x_{2}) \end{gathered}

(4)

Here, P_g, $P_{g, g}^{M Z}$ and $P_{g 1, g 2}^{D Z}$ are the univariate and the bivariate genotype frequencies for monozygotic (MZ) and dizygotic (DZ) twin pairs, respectively, at the moment of birth. Since the frailty correlation ρ can be different for MZ and DZ twins, we use the upper index MZ or DZ in the notation for bivariate survival. Given univariate survival S(x) and parameters, we calculate the baseline cumulative hazard H₀(x) using the simple bisectional procedure [6]. For univariate genotype frequencies, we will use the values given above. To calculate the bivariate genotype frequencies, note that for MZ twin pairs, g₁ = g₂ = g and $P_{g, g}^{M Z} = P_{g}$ . Assuming independent transmission of the maternal and paternal alleles to the offspring, we can use the standard formulae for DZ twin pairs:

\begin{gathered} P_{a a, a a}^{D Z} = P_{a a}^{2} + (1 / 2) P_{a a} P_{A a} + (1 / 16) P_{A a}^{2} \\ P_{a a, A a}^{D Z} = P_{A a, a a}^{D Z} = (1 / 2) P_{a a} P_{A a} + (1 / 8) P_{A a}^{2} \\ P_{a a, A A}^{D Z} = P_{A A, a a}^{D Z} = (1 / 16) P_{A a}^{2} \\ P_{A a, A A}^{D Z} = P_{A A, A a}^{D Z} = (1 / 2) P_{A a} P_{A A} + (1 / 8) P_{A a}^{2} \\ P_{A a, A a}^{D Z} = (1 / 2) P_{a a} P_{A a} + 2 P_{a a} P_{A A} \\ + (1 / 2) P_{A a} P_{A A} + (1 / 4) P_{A a}^{2} \\ P_{A A, A A}^{D Z} = P_{A A}^{2} + (1 / 2) P_{A A} P_{A a} + (1 / 16) P_{A a}^{2} \end{gathered}

(5)

The frequencies $π_{g, g}^{M Z} (x)$ and $π_{g 1, g 2}^{D Z} (x)$ of the genotype (g, g) and (g₁, g₂) at any age x for MZ and DZ twin pairs can be calculated from the formulae:

π_{g, g}^{M Z} (x) = \frac{P_{g} S_{g, g}^{M Z} (x, x)}{S^{M Z} (x, x)}

(6)

π_{g 1, g 2}^{D Z} (x) = \frac{P_{g 1, g 2}^{D Z} S_{g 1, g 2}^{D Z} (x, x)}{S^{D Z} (x, x)}

(7)

Assuming that the variance σ² does not depend on genotype and zygosity, we have the following unknown vector parameter:

\begin{gathered} θ = (R, δ, ν, a_{a a}, a_{a A + A a}, b_{a a}, b_{a A + A a}, c_{a a}, c_{a A + A a}, \\ c_{A A}, σ^{2}, ρ_{M Z}, ρ_{D Z}) . \end{gathered}

Here, ρ_MZ and ρ_DZ are the frailty correlations for MZ and DZ twins. We estimate unknown vector parameter θ maximising the likelihood function:

L i k_{g} = (\prod_{i = 1}^{N_{g}^{M Z}} π_{g_{i}, g_{i}}^{M Z} (x_{i}, θ)) (\prod_{i = 1}^{N_{g}^{D Z}} π_{g_{i 1, i 2}}^{D Z} (x_{i}, θ))

(8)

(the maximum likelihood estimates [MLE]), where x_i is the age of twin pair i at the moment of data collection, $N_{g}^{M Z}$ and $N_{g}^{D Z}$ are the observed numbers of MZ and DZ twin pairs in the genetic dataset (twin pairs with known genotypes and ages), respectively. To choose the optimal model, we can use the likelihood ratio test for nested models and either the Akaike information criterion (AIC) or the Bayesian information criterion (BIC) for non-nested models. Under the null hypothesis, we assume that a_aa = a_Aa = b_aa = b_Aa = 1 and c_aa = c_Aa = c_AA = 0. Significant deviation from this hypothesis can indicate a gene-longevity association.

If, in addition to genetic data, the data on the longevity of related individuals such as twins are also available, we can use this information to improve the accuracy of statistical estimates and to increase the power. Denote the life spans of the twin pair i in the demographic dataset by (x_i1, x_i2), where $I = 1, \dots, N_{d}^{M Z}$ for MZ twin pairs and $I = 1, \dots, N_{d}^{D Z}$ for DZ twin pairs. We assume that twin pairs in the sample are chosen at random and that all twins are deceased. Although the censored data are less informative than non-censored data, they can be also included in the analysis. The bivariate probability density function for a twin pair with longevities x_i1and x_i2can be calculated as follows:

\begin{gathered} \frac{\partial^{2} S^{j} (x_{i 1}, x_{i 2})}{\partial x_{i 1} \partial x_{i 2}} = \sum_{g 1, g 2} P_{g 1, g 2}^{j} \frac{S_{g 1, g 2}^{j} (x_{i 1}, x_{i 1}) \partial S_{g 1} (x_{1})}{S_{g 1} (x_{1}) S_{g 2} (x_{2}) \partial S (x_{1})} \\ \times \frac{\partial S_{g 2} (x_{2})}{\partial S (x_{2})} \frac{\partial S (x_{1})}{\partial x_{1}} \frac{\partial S (x_{2})}{\partial x_{2}} \\ \times (1 - ρ_{j}^{2} + \frac{ρ_{j} (1 - ρ_{j}) (S_{g 1} {(x_{1})}^{- σ^{2}} + S_{g 2} {(x_{2})}^{- σ^{2}})}{(S_{g 1} {(x_{1})}^{- σ^{2}} + S_{g 2} {(x_{2})}^{- σ^{2}} - 1)} \\ + \frac{ρ_{j} (ρ_{j} + σ^{2}) S_{g 1} {(x_{1})}^{- σ^{2}} S_{g 2} {(x_{2})}^{- σ^{2}}}{{(S_{g 1} {(x_{1})}^{- σ^{2}} + S_{g 2} {(x_{2})}^{- σ^{2}} - 1)}^{2}}) \end{gathered}

(9)

with j = MZ, DZ. We can write the likelihood function for the demographic dataset in the form:

\begin{gathered} L i k_{d} = (\prod_{i = 1}^{N_{d}^{M Z}} \frac{\partial^{2} S^{M Z} (x_{i 1}, x_{i 2})}{\partial x_{i 1} \partial x_{i 2}} (x_{i 1}, x_{i 2}, θ)) \\ \times (\prod_{i = 1}^{N_{d}^{D Z}} \frac{\partial^{2} S^{D Z} (x_{i 1}, x_{i 2})}{\partial x_{i 1} \partial x_{i 2}} (x_{i 1}, x_{i 2}, θ)) \end{gathered}

(10)

Now, unknown parameters can be found through maximising the joint likelihood function Lik_g × Lik_d.

Results

To carry out the numerical experiments, we used simulated data. To generate datasets with a sample size of $N_{g}^{M Z} = 1000$ , $N_{g}^{D Z} = 2000$ for genotype data and of $N_{d}^{M Z} = 150$ , $N_{d}^{D Z} = 300$ for longevity data we assumed that:

The action of the dominant allele a on longevity can be characterised by parameters a_AA = b_AA = 1, c_aa = c_AA = 0, a_aa = a_Aa = 0.8, b_aa = b_Aa = 1.2;
The survival function for genotype AA has a form
$\begin{gathered} \tilde{S} (x) = {(1 + s^{2} \tilde{H} (x))}^{- 1 / s^{2}}, \\ \tilde{H} (x) = \tilde{c} x + ã (e^{\tilde{b} x} - 1) / \tilde{b} \end{gathered}$
(11)

with ã = 2.5·10^-5, $\tilde{b} = 0.1$ , $\tilde{c} = 0$ and ln s² = -4.5;

Individual frailty for twins are gamma-distributed, with mean 1 and variance σ² = 1. Frailty correlations ρ_MZ and ρ_DZ are equal to 0.5 and 0.25, respectively;
The Hardy-Weinberg equilibrium at the moment of conception holds. There is no genotype selection before birth;
The slow continuous component of the cohort effect has parameters ν = -2 and δ = 0.005. This corresponds to the frequency P_a ≈ 0.182 for individuals born in year T (the year of data collection) and decreases in the frequency by 0.4 per cent per year. The sudden jump of P_a with parameter R = 0.5 occurred in the cohort T-50;
The birth dates of all twin pairs from the longevity dataset are uniformly distributed over the cohort interval [T-110, T-100]. The ages of the twins from the genetic dataset at the moment of data collection are uniformly distributed over the age interval [0-105] years.

Nearly one in every 100 deliveries is a twin birth, and the DZ/MZ ratio is approximately equal to 2. From this, it follows that in the stationary population consisting of 300,000 individuals with crude birth and death rates q₀ equal to 15 per 1,000, the life expectancy at birth e₀ is equal to 1,000/15 ≈ 66.7 years and we will find approximately (1/300) × (300,000 × q₀e₀) = 1,000 MZ and (2/300) × (300,000 × q₀e₀) = 2,000 DZ twin pairs. We will also find 150 MZ and 300 DZ newborn twin pairs over the ten-year cohort interval. Since the influence of a decrease in child mortality before the age of 11-13 years on the univariate survival and, therefore, selection is relatively small, we have not included this effect in the simulated data. In general, chosen simulation parameters produce a bivariate lifespan distribution which is similar to the true one.

The estimates of unknown parameters and of the power for 1,000 simulations are given in Table 1. The power was calculated at the 5 per cent significance level. We have used the bivariate and the univariate models applied to the joint bivariate genetic and longevity data or to the bivariate genetic data only. The age dynamics of the hazard functions for genotypes with/without allele a and the age dynamics of the frequencies for the longevity allele/genotypes with the longevity allele are shown in Figures 1 and 2. To establish how often the true bivariate model applied to the bivariate genetic data turns out to be optimal compared with the false univariate model, we used the likelihood ratio test. Significant differences between two these models at a significance level of p < 0.05 were observed in 100 per cent of cases.

Table 1 Parameter estimates (sample means) and their standard deviations (in brackets) for 1,000 simulations, calculated using the bivariate (univariate) model applied to the joint bivariate genetic and longevity data* (***) or to the bivariate genetic data **(****)

Full size table

Discussion

The maximum likelihood method yields correct estimates if the model is correctly specified. In this case, the MLE of unknown parameters under certain regularity conditions are asymptotically unbiased, normal and efficient. If we treat the bivariate data in the same way as the univariate data, and the marginal model is correctly specified, then the robust Hubert-White 'sandwich' estimator of the covariance matrix of parameter estimates yields an asymptotically consistent covariance matrix [8–10]. As we see in Table 1, there is an increase in statistical power when using the more robust univariate model compared with the bivariate model. Nevertheless, the estimates of parameters a_aa and b_aa for the relative risk of the longevity genotype and the estimate of σ for the standard deviation of frailty are closer to their true values if we use the bivariate model. Including the information on longevity in the dataset, however, can substantially improve statistical estimates, increase the power and decrease the variance. It seems that implementation of the approach based on the more robust univariate model, compared with the bivariate model, is preferable for the sample sizes used in this study. Based on the correlation estimates in the MZ and DZ twins, we are able to estimate the contribution of the candidate gene to the heritability [11]. Under the null hypothesis (no heritability), we put ρ_MZ = ρ_DZ. The effect of antagonistic pleiotropy is clearly seen in Figure 1. The presence of allele a in an individual's genotype guarantees the lower hazard of mortality only up to the age of approximately 76 years. The hazard of individuals with genotype AA is then lower than that of individuals with allele a in the genotype. Similar to the univariate model, the bivariate model effectively identifies not only the slow cohort trend of P_g, including the antagonistic pleiotropic effect, but also the sudden change in this parameter. As expected, the frequencies of allele a and of the genotypes containing allele a increase continuously in the age intervals [0-50] and [50-80], fall abruptly at the age of 50 and decrease continuously after the age of 80 (see Figure 2). Univariate and bivariate (for twins) genotype frequencies at the longevity locus at the moment of conception depend on the genotype frequencies in the parental population and the transmission probabilities. In the model we have used, two assumptions were made relating to the longevity locus. First, that the Hardy-Weinberg equilibrium holds for the parental population. Secondly, that the segregation ratio does not deviate from 0.5 [12]. In principle, we can dispense with both of these assumptions and include them as null hypotheses in the study. Significant deviation from the null hypotheses can be tested using the likelihood ratio test. Rejection of the hypothesis about the Hardy-Weinberg equilibrium can indicate possible genotype selection during the gestation period. Significant deviation from Mendelian transmission can mean, for example, that longevity is not governed by the alleles at a single locus. Population admixture and stratification can lead to linkage disequilibrium between longevity and marker loci. In such situations, the study may reveal evidence for ('spurious') association with the marker, even if it is unlinked to the longevity locus. If the sub-population factors influencing the allele frequencies in the marker and longevity loci are identified (eg ethnicity, geographical origin, etc), they can be included in the study. Another solution for this problem is to partition the association effects into between- and within-family components [13, 14]. It was shown that admixture impacts the between-family component estimate, and that the within-family component estimate is independent of any 'spurious' effects when samples from a number of population strata are combined.

References

De Benedictis G, Carotenuto L, Carrieri G, Carrieri G, et al: 'Gene/longevity association studies at four autosomal loci (REN, THO, PARP, SOD2)'. Eur J Hum Genet. 1998, 6: 534-541.
Article CAS PubMed Google Scholar
Tan Q, Bathum L, Christiansen L, De Benedictis G, et al: 'Logistic regression models for polymorphic and antagonistic pleiotropic gene action on human aging and longevity'. Ann Hum Genet. 2003, 67: 598-607. 10.1046/j.1529-8817.2003.00051.x.
Article CAS PubMed Google Scholar
Garasto S, Rose G, Derango F, Berardelli M, et al: 'The study of APOA1, APOC3, and APOA4 variability in healthy ageing people reveals another paradox in the oldest old subjects'. Ann Hum Genet. 2003, 67: 54-62. 10.1046/j.1469-1809.2003.00008.x.
Article CAS PubMed Google Scholar
Yashin AI, De Benedictis G, Vaupel JW, Tan Q, et al: 'Genes, demography, and life span: The contribution of demographic data in genetic studies on aging and longevity'. Am J Hum Genet. 1999, 65: 1178-93. 10.1086/302572.
Article PubMed Central CAS PubMed Google Scholar
Yashin AI, Arbeev KG, Ukraintseva SV: 'The accuracy of statistical estimates in genetic studies of aging can be significantly improved'. Biogerontology. 2007, 8: 243-255. 10.1007/s10522-006-9072-4.
Article PubMed Central PubMed Google Scholar
Begun A: 'A modification of the relative risk model with heterogeneity component for detecting genes contributing to longevity'. Ann Hum Genet. 2008, 72: 111-114.
CAS PubMed Google Scholar
Yashin AI, Vaupel JW, Iachine IA: 'Correlated individual frailty: An advantageous approach to survival analysis of bivariate data'. Math Popul Stud. 1995, 5: 145-159. 10.1080/08898489509525394.
Article CAS PubMed Google Scholar
Huber PJ: 'The behavior of maximum likelihood estimation under nonstandard conditions'. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. 1967, University of California Press, Berkeley, CA, 1: 221-223.
Google Scholar
White H: 'Maximum likelihood estimation of misspecified models'. Econometrica. 1982, 50: 1-25. 10.2307/1912526.
Article Google Scholar
Williams RL: 'A note on robust variance estimation for cluster-correlated data'. Biometrics. 2000, 56: 645-646. 10.1111/j.0006-341X.2000.00645.x.
Article CAS PubMed Google Scholar
Sham P: Statistics in Human Genetics (Arnold Applications of Statistics Series). 1998, Edward Arnold, London
Google Scholar
Lalouel JM, Rao DC, Morton RE, Elston RC: 'A unified model for complex segregation analysis'. Am J Hum Genet. 1983, 35: 816-826.
PubMed Central CAS PubMed Google Scholar
Fulker DW, Cherny SS, Sham PC, Hewitt JK: 'Combined linkage and association analysis for quantitative traits'. Am J Hum Genet. 1999, 64: 259-267. 10.1086/302193.
Article PubMed Central CAS PubMed Google Scholar
Abecasis GR, Cardon LR, Coocson WOC: 'A general test of association for quantitative traits in nuclear families'. Am J Hum Genet. 2000, 66: 259-292.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Medical Informatics and Statistics, Brunswiker Strasse 10, D-24105, Kiel, Germany
Alexander Begun

Authors

Alexander Begun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexander Begun.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Begun, A. Detecting genes contributing to longevity using twin data. Hum Genomics 4, 73 (2009). https://doi.org/10.1186/1479-7364-4-2-73

Download citation

Received: 26 August 2009
Accepted: 26 August 2009
Published: 01 December 2009
DOI: https://doi.org/10.1186/1479-7364-4-2-73

Detecting genes contributing to longevity using twin data

Abstract

Introduction

Materials and methods

Results

Discussion

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Human Genomics

Contact us

Detecting genes contributing to longevity using twin data

Abstract

Introduction

Materials and methods

Results

Discussion

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Human Genomics

Contact us