Abstract
Candidate gene studies are generally motivated by some form of pathway reasoning in the selection of genes to be studied, but seldom has the logic of the approach been carried through to the analysis. Marginal effects of polymorphisms in the selected genes, and occasionally pairwise genegene or geneenvironment interactions, are often presented, but a unified approach to modelling the entire pathway has been lacking. In this review, a variety of approaches to this problem is considered, focusing on hypothesisdriven rather than purely exploratory methods. Empirical modelling strategies are based on hierarchical models that allow prior knowledge about the structure of the pathway and the various reactions to be included as 'prior covariates'. By contrast, mechanistic models aim to describe the reactions through a system of differential equations with rate parameters that can vary between individuals, based on their genotypes. Some ways of combining the two approaches are suggested and Bayesian model averaging methods for dealing with uncertainty about the true model form in either framework is discussed. Biomarker measurements can be incorporated into such analyses, and twophase sampling designs stratified on some combination of disease, genes and exposures can be an efficient way of obtaining data that would be too expensive or difficult to obtain on a full candidate gene sample. The review concludes with some thoughts about potential uses of pathways in genomewide association studies.
Keywords:
colorectal cancer; complex diseases; folate; geneenvironment interactions; genegene interactionsIntroduction
Molecular epidemiology has advanced from testing associations of disease with single polymorphisms, to exhaustive examination of all polymorphisms in a candidate gene using haplotype tagging single nucleotide polymorphisms (SNPs), to studying increasing numbers of candidate genes simultaneously. Often, geneenvironment and genegene interactions are considered at the same time. As the number of main effects and interactions proliferate, there is a growing need for a more systematic approach to model development [1].
In recognition of this need, the American Association for Cancer Research held a special conference [2] in May 2007, bringing together experts in epidemiology, genetics, statistics, computational biology, systems biology, toxicology, bioinformatics and other fields to discuss various multidisciplinary approaches to this problem.
A broad range of exploratory methods have been developed recently for identifying interactions, such as neural nets, classification and regression trees, multifactor dimension reduction, random forests, hierarchical clustering, etc. [37] Our focus here, however, is instead on hypothesisdriven methods based on prior understanding about the structure of biological pathways postulated to be relevant to a particular disease. Our primary purpose is to contrast mechanistic and empirical methods and explore ways of combining the two.
The folate pathway as an example
Folate metabolism provides a rich example to illustrate these challenges. Folate has been implicated in colorectal cancer, [8] coronary heart disease [9] and neural tube defects, [10,11] among other conditions. Several steps in the metabolism of folate could be involved in these various diseases (Figure 1) and could have quite different effects. The pathway is complex, involving 19 enzymes or carrier proteins, with various feedback loops and two main cycles, the folate and the methionine cycles. The former is involved in pyrimidine synthesis through the action of thymidylate synthase (TS), potentially leading to uracil misincorporation into DNA and subsequent DNA damage and repair or misrepair. The latter is involved in DNA methylation through the conversion of Sadenosyl methionine (SAM) to Sadenosyl homocysteine (SAH) by DNAmethyltransferase (DNMT). These two mechanisms in particular have been suggested as important links between folate and carcinogenesis, although other possibilities include purine synthesis (via the aminoimidazolecarboxamide ribonucleotide transferase [AICART] reaction) and homocysteine itself. Because polymorphisms that tend to increase one of these effects may decrease others, their effects on disease endpoints can be quite different, depending on which part of the pathway is more important. A detailed mathematical model for this system has been developed by Nijhout et al. [12,13] and Reed et al., [14,15] based on the equilibrium solution to a set of linked ordinary differential equations for MichaelisMenten kinetics and implemented in software available at http://metabolism.math.duke.edu/ webcite.
Figure 1. Biochemical diagram of folate metabolism (reproduced with permission from Reed et al. [14]). AICART, aminoimidazolecarboxamide ribonucleotide transferase; BHMT, betainehomocysteine methyltransferase; CBS, cystathionine bsynthase; DHFR, dihydrofolate reductase; DNMT, DNAmethyltransferase; dTMP, thymidine monophosphate; FTD, 10formyltetrahydrofolate dehydrogenase; FTS, 10formyltetrahydrofolate synthase; GAR, glycinamide ribotid; GNMT, glycine Nmethyltransferase; HCOOH; formic acid; H_{2}C = O, formaldehyde; HCY, homocysteine; MAT, methionine adenosyl transferase; MS, methionine synthase; MTCH, 5,10methylenetetrahydrofolate cyclohydrolase; MTD, 5,10methylenetetrahydrofolate dehydrogenase; MTHFR, 5,10methylenetetrahydrofolate reductase; NE, navenzymatic; PGT, phosphoribosyl glycinamide transferase; SAH, Sadenosylhomocysteine; SAHH, SAH hydrolase; SAM, Sadenosylmethionine; SHMT, serine hydroxymethyltransferase; THF, tetrahydrofolate; 5mTHF, 5methylTHF; 5,10CH_{2}THF, 5,10methyleneTHF; 10fTHF, 10formylTHF; TS, thymidylate synthase.
To illustrate the various approaches, we simulated some typical data in the form that might be available from a molecular epidemiology study  specifically data on genetic variants, various environmental exposures, a disease outcome or clinical trait, and, possibly, biomarker measurements on some or all subjects. We began with a population of 10,000 individuals with randomly generated values of intracellular folate E_{1 }(the total tetrahydrofolate [THF] concentration in the six compartments forming the closed loop shown on the lefthand side of Figure 1), methionine intake E_{2 }(METin, lognormally distributed) and 14 of the key genes G shown in Figure 1. For each gene, a personspecific value of the corresponding V_{max }was sampled from lognormal distributions with genotypespecific geometric means (GMs = 0.6, 0.8, or 1.1 times the overall GM) and common geometric standard deviations (GSD = 1.1) and K_{m }appropriate for that enzyme (see Table 1 in Reed et al. [14] for reference values for V_{max }and K_{m }for each gene). The differential equations were then evaluated to determine the steadystate solutions for ten intermediate metabolite concentrations and 14 reaction rates for each individual, based on their specific environmental variables and enzyme activity rates. The probability of disease was calculated under a logistic model for each of four scenarios for the causal biological mechanism  homocysteine concentration, the rate of DNA methylation reactions and the rates of purine and pyrimidine synthesis  and a binary disease indicator Y was sampled with the corresponding probability. Only the data on (Y, E, G) were retained from the first 500 cases and 500 controls for the first level of the epidemiological analysis. For some analyses, we also simulated biomarkers [16] on stratified subsamples of these subjects, as will be described later. Various summaries of the correlations among the (X, E, G) values for the remaining 9,000 subjects were deposited into what we shall call the 'external database' for use in constructing priors, as described below (no individual Y data were used for this purpose).
Table 1. Marginal odds ratios (ORs) for the association of each gene with disease under various choices of reaction rates or intermediate metabolite concentrations as the causal risk factor (ORs are expressed relative to the low enzyme activity rate genotype)
Table 1 shows the univariate associations of each gene with disease under each assumption about the causal risk factor. In these simulations, only one of these was taken as causal at a time, each scaled with the relative risk coefficient β = 2.0 per standard deviation of the respective risk factor. When homocysteine concentration was taken as the causal factor, the strongest association was with genetic variation in the cystathionine bsynthase (CBS) and Sadenosylhomocysteine hydrolase (SAHH) genes. The remaining three columns relate to various reaction rates as causal mechanisms. For pyrimidine synthesis (characterised here by the TS reaction rate), the strongest influence was seen for genetic variation in TS and the 5,10methyleneTHF dehydrogenase (MTD) gene. For purine synthesis (reflected in the AICART reaction rate), the strongest associations were with genetic variation in the phosphoribosyl glycinamide transferase (PGT) gene and somewhat weaker for MTD and 5,10methyleneTHF cyclohydrolase (MTCH) and serine hydroxymethyltransferase (SHMT) genes; interestingly, the disease risk is not particularly related to the AICART genotype itself. When DNA methylation (reflected by the DNMT reaction rate) was assumed to be causal, none of the genetic associations were as strong as for the other three causal mechanisms, the strongest being with the 5,10methyleneTHF reductase (MTHFR) gene, SAHH and MTD. Genetic variation in DNMT was not explicitly simulated, but the reaction rates for this enzyme were identical to those for methionine adenosyl transferase (MATII) and SAHH, reflecting a ratelimiting step. Thus, genetic variation in MATII had no effect on risk, the reaction rate being driven entirely by SAHH. Other ratelimited combinations included dihydrofolate reductase (DHFR) with TS, MTD with MTCH, and PGT with AICART. Methionine intake was the strongest environmental exposure factor for the simulation with homocysteine as the causal mechanism, whereas intracellular folate had a stronger effect under the other three mechanisms.
Mechanistic vs empirical models
For the four highlighted simulations, we also conducted multiple logistic regressions in a stepwise manner, offering methionine, folate, the 14 genotypes and all 91 pairwise G × G and 28 G × E interactions (Table 2). These are difficult to interpret, however, owing to the large numbers of comparisons and unstable regression coefficients, particularly in the models that include interaction terms. In an attempt to gain greater insight into mechanisms, attention will now be turned to more pathwaydriven modelling approaches, based on hierarchical or mechanistic models. The former extend the standard logistic models summarised in Table 2 by the addition of 'prior covariates' incorporating knowledge about the relative risk coefficients predicted by the pathway. The latter attempt to model the pathways explicitly, using simplified versions of physiologically based pharmacokinetic (PBPK) models, thereby requiring stronger assumptions about reaction dynamics and population distributions of rate parameters.
Table 2. Multiple stepwise logistic regression models, including only main effects or main effects and G × G/G × E interaction terms for four different choices of the causal variable (gene names are given in Table 1; E_{1 }= intracellular folate concentration; E_{2 }= methionine intake)
Hierarchical models for diseasepathway associations
In the first level, the epidemiological data are fitted using a conventional 'empirical' model for the main effects and interactions among the various input genotypes and exposures, here denoted generically as X = (X_{ip})_{p = 1 ... p }= (E, G, G × G, G × E, G × G × E, ...); for example, a logistic regression model of the form
the sum being taken over the range of terms included in the X vector. Note that all possible effects of some predetermined complexity (eg all main effects and twoway, or perhaps higher order, interactions possibly limited to subsets relevant to the hypothesised pathway structure) are included, rather than using some form of model selection, as was done in the stepwise analyses summarised in Table 2.
In the secondlevel model, each of the regression coefficients from Eq. (1) is in turn regressed on a vector Z_{p }= (Z_{pv})_{v = 1 ... V }of 'prior covariates' describing characteristics of the corresponding terms in X; for example,
There are many possibilities for what could be included in the set of prior covariates, ranging from indicator variables for which of several pathways each gene might act in, [17]in silico predictions of the functional significance of polymorphisms in each gene, [18,19] or genomic annotation from formal ontologies [20]. Summaries of the effects of genes on expression levels ('genetical genomics') or of associations of genes with relevant biomarkers might also be used as prior covariates. Rebbeck et al. [21] provide a good review of available tools that could be used for constructing prior covariates.
Alternatively, one could model the variances, for example:
For example, suppose the X vector comprised effects for different polymorphisms within each gene and one had some prior predictors of the effects of each polymorphism (eg in silico predictions of functional effects or evolutionary conservation) and other predictors of the general effects of genes (eg their roles in different pathways or the number of other genes that they are connected to in a pathway). Then, it might be appropriate to include the former in the π'Z part of the model for the means, and the latter in the φ'Z part of the model for the variances.
So far, the secondlevel models have assumed an independence prior for each of the regression coefficients; but now, suppose we have some prior information about the relationships among the genes, such as might come from networks inferred from gene coexpression data. Let A = (A_{pq})_{p, q = 1 ... P }denote a matrix of prior connectivities between pairs of genes  for example, taking the value 1 if the two are adjacent (connected) in some network or 0 otherwise. Then, one might replace the independence prior of Eq. (2) by a multivariate prior of the form:
This is known as the conditional autoregressive model, and is widely used in spatial statistics [22]. Sample WinBUGS code to implement these and other models described below are available in an online supplement.
In applications to the folate simulation, we tried two variants of this model. First, we considered three prior covariates in Z: an indicator for whether a gene is involved in the methionine cycle; whether it is involved in the folate cycle; and the number of other genes it is connected to in the entire network (a measure of the extent to which it might have a critical role as a 'hub' gene). The A matrix was specified in terms of whether a pair of genes had a metabolite in common, either as substrate or product.
Table 3 summarises the results of several models, including these three prior covariates in the means or variance model, as well as the connectivities in the covariance model. As would be expected, in the zero mean model, all the significant parameter estimates were shrunk towards zero because of the large number of genes with no true effect in the model. In general, none of the prior covariates significantly predicted the means. The estimates of the βs in all these models were much closer to the simple maximum likelihood estimates (the first column), however, and their standard errors were generally somewhat smaller, indicating the 'borrowing of strength' from each other. In the model with covariates in the prior variances, however, the number of connections for each gene was significantly associated with the variance. In the final model, with correlations between genes being given by indicators for whether they were connected in the graph, the posterior distribution for the parameter ρ is constrained by the requirement that the covariance matrix be positive definite, but showed strong evidence of genegene correlations following the pattern given by the connectivities in Figure 1. The generally weak effects of prior covariates in these models may simply reflect the crudeness of these classifications. Below, we will revisit these models with more informative covariates based on the quantitative predictions of the differential equations model.
Table 3. Summary of hierarchical modelling fits (parameter estimates [SEs]) for selected genetic effects (β_{G}), prior covariates (Z'π) and prior correlations (σ^{2}A) for simulation with homocysteine concentration as the causal variable
Mechanistic models
Whereas hierarchical models are generally applicable whenever one has external information about the genes and exposures available, in some circumstances the dynamics of the underlying biological process may be well enough understood to support mechanistic modelling. These are typically based on systems of ordinary differential equations (ODEs) describing each of the intermediate nodes in a graphical model as deterministic quantities given by their various inputs (exposures or previous substrates) with reaction rates determined by genotypes (Figure 2). For example, in a sequence j = 1, ..., J of linear kinetic steps, with conversion from metabolite M_{j }to M_{j+1 }at rate λj and removal at rate μ_{j}, the instantaneous concentration is given by the differential equation:
Figure 2. Schematic representation of simplified onecompartment model.
leading to the equilibrium solution for the final metabolite M_{J }as:
where X_{0 }denotes the concentration of exposure E. This predicted equilibrium concentration of the final metabolite in the graph is then treated as a covariate in a logistic model for the risk of disease:
If sufficient external knowledge about the genotypespecific reaction rates is available, these could be treated as fixed constants, but more likely they would need to be estimated jointly with the βs in the disease model using a nonlinear fitting program. More sophisticated nonlinear models are possible  for example, incorporating MichaelisMenten kinetics by replacing each of the λM terms in Eq. (4) by expressions of the form:
and similarly for the μM terms. The resulting equilibrium solutions for M_{J}(E, G) are now more complex solutions to a polynomial equation. For example, with only a single intermediate metabolite with one activation rate λ and one detoxification rate μ, the solution becomes:
where and denote the lowdose slopes of the two reactions. These solutions can be either upwardly or downwardly curvilinear in E, depending on whether the term in parentheses is positive or negative (basically, whether the creation of the intermediate exceeds the rate at which it can be removed). For the fitted values in the application below (third block of Table 4), the doseresponse relationship for ME is upwardly curved for all genotype combinations (not shown).
A more realistic and more flexible model would allow for stochastic variation in the reaction rates λ_{ij }and μ_{ij }for each individual i conditional on their genotypes G_{ij}; for example, and likewise for μ_{ij }[23] or similarly for their corresponding V_{max }and K_{m }[24]. The population genotypespecific rates are, in turn, assumed to have lognormal prior distributions (and similarly for the μs), with vague priors on the population means , interindividual variances and betweengenotype variances . The individual data might be further supplemented by available biomarker measurements B_{ij }of either the enzyme activity levels or intermediate metabolite concentrations, modelled as and respectively.
The WinBUGS software [25] has an addin called PKBUGS, [26] which implements a Bayesian analysis of population pharmacokinetic parameters [2731]. More complex models can, in principle, be fitted using the addin WBDIFF http://www.winbugsdevelopment.org.uk/wbdiff.html webcite, which allows userspecified differential equations as nodes in a Bayesian graphical model.
To illustrate the approach, we consider a highly simplified model with only a single intermediate metabolite M (homocysteine). We assume this is created at linear rate λ determined by SAHH and removed at linear rate μ determined by CBS. The ratios of λ and μ between genotypes are estimated jointly with β. The first two lines of Table 4 provide the results of fitting the linear kinetics model, with and without interindividual variability in the two rate parameters. Although, of course, many other genes are involved in the simulated model, the estimated homocysteine concentrations M are strongly predictive of disease, and both genes have highly significant effects on their respective rates. Allowing additional random variability in these rates slightly increased the population average genetic effects. For the MichaelisMenten models, we allowed the V_{max}s to depend on genotype, while keeping the K_{m}s fixed. Not all the parameters can be independently estimated, but only the ratios μ_{0}/λ_{0 }and , along with the genetic rate ratios λ_{1}/λ_{0 }and μ_{1}/μ_{0}. Allowing the V_{max}s and K_{m}s to vary between subjects leads to some instability, but did not substantially alter the population mean parameter estimates. Adding in biomarker measurements B_{i }as surrogates for M_{i }for even a subset of subjects, as described below, substantially improved the precision of estimation of all the model parameters (results not shown).
Table 4. Results of Markov chain Monte Carlo fitting of singlecompartment models with homocysteine as an unobserved intermediate metabolite, created at a rate depending on SAHH (λ) and removed at a rate depending on CBS (μ), applied to the simulation taking homocysteine concentration as the causal variable
Combining mechanistic and statistical models
Such an approach is likely to be impractical for complex looped pathways like folate, however. In this case, one might use the results of a preliminary exploratory or hierarchical model to simplify the pathway to a few key ratelimiting steps, so as to yield a simpler unidirectional model for which the differential equation steadystate solutions can be obtained in closed form.
Rather than taking M(E, G) as a deterministic node in the mechanistic modelling approach described above, a fully Bayesian treatment would use stochastic differential equations to derive Pr(ME, G). For example, suppose one postulated that the rate of change dM/dt depends on the rate at which it is created as a constant rate λ(G_{1})E and the rate at which it is removed at rate μ(G_{2})M. (Of course, the exposures E could be time dependent, in which case one would be interested in the longterm average of M rather than its steady state, but in most epidemiological studies there is little information available on shortterm variation in exposures, so the following discussion is limited to the case of timeconstant exposures.) Consider a discrete number of molecules and let p_{m}(t) = Pr(M = mT = t). Then, the resulting stochastic differential equation becomes:
The solution turns out to be simply a Poisson distribution for m with mean E(m) = λE/μ. This suggests as a distribution for continuous metabolite concentrations M in some volume of size N:
where N now controls the dispersion of the distribution. More complex solutions for MichaelisMenten kinetics with a finite number of binding sites have been provided by Kou et al., [32] who showed that the classical solutions still held in expectation, but other properties  like the distribution of waiting times in various binding states  were different, appearing to demonstrate a nonMarkov memory phenomenon, particularly at high substrate concentrations. Further stochastic variability arises from fluctuations in binding affinity due to continual changes in enzyme conformation [33].
To illustrate the general idea, we fitted this simplified version of the model, treating λ and μ as fixed genotypespecific population values, yielding the estimates shown in the last line of Table 4. The dispersion parameter N cannot be estimated, but the results for other parameters are relatively insensitive to this choice; the results in Table 4 are based on either a fixed value N = 10 or using an informative Γ(100,1) prior; as N gets very large, the estimates converge to those in the first line for linear kinetics with fixed genotypespecific λ and μ.
For more complex models, for which analytic solution of the differential equations may be intractable, the technique of approximate Bayesian computation [34] may be helpful. The basic idea is, at each Markov chain Monte Carlo cycle, to simulate data from the differential equations model using the current and proposed estimates of model parameters and evaluate the 'closeness' of the simulated data to the observed data in terms of some simple statistics. This is then used to decide whether to accept or reject the proposed new estimates, rather than having to compute the likelihood itself.
A simpler approach uses the output of a PBPK simulation model as prior covariates in a hierarchical model. Let Z_{ge }= E[M(G_{g}, E_{e})] denote the predicted steadystate concentrations of the final metabolite from a differential equations model for a particular combination of genes and/or exposures (thus, Z_{gg' }might represent the predicted effect of a G × G interaction between genes g and g'). As discussed above, other Zs could comprise variances of predicted Ms across a range of input values as a measure of the sensitivity of the output to variation in that particular combination of inputs. Z_{ge }could also be a vector of several different predicted metabolite concentrations if there were multiple hypotheses about which was the most aetiologically relevant.
For the folate application, the Z matrix was obtained by correlating the simulated intermediate phenotypes v (reaction rates or metabolite concentrations) with the 14 genotypes, 91 G × G and 28 G × E interaction terms. The resulting correlation coefficients for the four simulated causal variables were then used as a vector of in silico prior covariates Z_{p }= (Z_{pv})_{v = 1..4 }for the relative risk coefficients β_{p}. The full set of correlations Z_{pv }across all ten metabolites and nine nonredundant velocities were also used to compute an adjacency matrix as A_{pq }= corr_{v}(Z_{pv}, Z_{qv}), representing the extent to which a pair of genes had similar effects across the whole range of intermediate phenotypes. The effects of these in silico covariates (Table 5) were substantially stronger than for the simple indicator variables illustrated earlier. In each simulation, the prior covariate corresponding to the causal variable was the strongest predictor of the genetic main effects.
Table 5. Summary of hierarchical modelling fits for selected genetic effects (β_{G}), prior covariates (Z'π) and prior standard deviations (σ_{β }and σ_{π}) for simulation with different intermediates as the causal variable, using the Z matrix derived from independent data from the same simulation model (see text).
Table 6. Estimated log relative risk per unit change of true longterm homocysteine concentrations, treated as a latent variable in a single compartment linearkinetics model; data simulated assuming homocysteine is the causal variable.
Designs incorporating biomarkers
Ultimately, it may be helpful to incorporate various markers of the internal workings of a postulated pathway, perhaps in the form of biomarker measurements of intermediate metabolites, external bioinformatic knowledge about the structure and parameters of the network, or toxicological assays of the biological effects of the agents under study. For example, in a multicity study of air pollution, we are applying stored particulate samples from each city to cell cultures with a range of genes experimentally knocked down to assess their genotypespecific biological activities. We will then incorporate these measurements directly into the analysis of G × E interactions in epidemiological data [35]. See Thomas, [1] Thomas et al., [2] Conti et al. [20] and Parl et al. [36] for further discussion about approaches to incorporating biomarkers and other forms of biological knowledge into pathwaydriven analyses.
Typically biomarker measurements are difficult to obtain and are only feasible to collect on a subset of a large epidemiological study. While one might consider using a simple random sample for this purpose, greater efficiency can often be obtained by stratified sampling. Suppose the parent study is a casecontrol study with exposure information and DNA already obtained. One might then consider sampling on the basis of some combination of disease status, exposure and the genotypes of one or more genes thought to be particularly important for the intermediate phenotype(s) for which biomarkers are to be obtained. The optimal design would require knowledge of the true model (which, of course, is unknown), but a balanced design, selecting the subsample so as to obtain equal numbers in the various strata defined by disease and predictors is often nearly optimal [3739]. The analysis can then be conducted by full maximum likelihood, integrating the biomarkers for unmeasured subjects over their distribution (given the available genotype, exposure and disease data) or by some form of multiple imputation, quasilikelihood [40] or MCMC methods. Here, the interest is not in the association of disease with the biomarker B itself, but rather with the unobserved intermediate phenotype M it is a surrogate for. The disease model is thus of the form Pr(YM), with a latent process model for Pr(MG, E) and a measurement model for Pr(BM).
Again, using the folate simulation as the example, we simulated biomarkers for samples of ten or 25 individuals selected at random from each of the eight cells defined by disease status, the MTHFR genotype and high or low folate intake. A measurement B of either homocysteine concentration or the TS enzyme activity level was assumed to be normally distributed around their simulated equilibrium concentrations with standard deviations 10 per cent of that the true longterm average concentrations.
These data were analysed within a conventional measurement error framework [41,42] by treating the true longterm average values of homocysteine or TS activity as a latent variable X in a model given by the following equations:
For joint analyses of homocysteine and TS activity measurements, M and B were assumed to be bivariate normally distributed with M ~ N_{2}(X'A, Σ) and B ~ N_{2}(M, T), and Y as having a multiple logistic dependence on M. Only the main effects of the 14 genes and two environmental factors were included in X for this analysis. While the model can be fitted by maximum likelihood, it is convenient to use MCMC methods, which more readily deal with arbitrary patterns of missing B data. Thus, it is not essential for the different biomarkers to be measured on the same subset of subjects, but some overlap is needed to estimate the covariances Σ_{12 }and T_{12}. More complex mechanistic models could, of course, be used in place of the regression model MX. For this model to be identifiable, however, it is essential that distinct biomarkers be available for each of the intermediate phenotypes included in the disease model.
Estimates of the effects of both homocysteine and TS enzyme activity were highly significant in univariate analyses, even though the simulated causal variable is homocysteine. In bivariate analyses, however, the TS effect became nonsignificant, owing to the strong positive correlation (r_{Σ }= 0.45; 95 per cent confidence interval [CI] 0.21, 0.71) between the residuals of M, while correlation between the residuals of the measurement errors was not significant (
 r
Dealing with reverse causation: Mendelian randomisation
The foregoing development assumes that the biomarker measurement B or the underlying phenotype M of which it is a measurement is not affected by the disease process. While this may be a reasonable assumption in a cohort or nested casecontrol study where biomarker measurements are made on stored specimens obtained at entry to the cohort rather than after the disease has already occurred, it is a well known problem (known as 'reverse causation') in casecontrol studies. In this situation, one might want to restrict biomarker measurements only to controls and use marginal likelihood or imputation to deal with the unmeasured biomarkers for cases. Alternatively, one might consider using case measurements in a model that includes terms for differential error in the measurement model, Pr(BM, Y).
These ideas have been formalised in literature known as 'Mendelian randomisation' (MR), [4347] sometimes referred to as 'Mendelian deconfounding' [48]. Here, the focus of attention is not the genes themselves, but intermediate phenotypes (M) as risk factors for disease. The genes that influence M are treated as 'instrumental variables' (IVs) [4954] in an analysis that indirectly infers the MY relationship from separate analyses of the GM and GY relationships. The appeal of the approach is that uncontrolled confounding and reverse causation are less likely to distort these relationships than they are to distort the MY relationship if studied directly. In essence, the idea of imputing M values using G as an IV in a regression of Y on E(MG) is a form of MR argument. Nevertheless, the approach is not without its pitfalls, [5558] both as a means of testing the null hypothesis of no causal connection between M and Y and as a means of estimating the magnitude of its effect. Particularly key is the assumption that the effect of G on Y is mediated solely through M. For complex pathways, the simple MR approach is unlikely to be of much help, but the idea of using samples free of reverse causation to learn about parts of the model from biomarker measurements and incorporating these into the analysis of a latent variable model is promising.
To illustrate these methods, consider the scenario where homocysteine is the causal variable for disease. The logistic regression of disease directly on homocysteine yields a logRR coefficient β of 2.57 (SE 0.22) per SD change of homocysteine (Table 7). This estimate is, however, potentially subject to confounding and reverse causation, and indeed in this simulation we generated an upward bias in BM of 50 per cent of the SD of M, which produced a substantial overestimate of the simulated β = 2. An MR estimate could in principle be obtained by using any of the genes in the pathway as an IV, MTHRF being the most widely studied. The regression of homocysteine on MTHFR yields a regression coefficient of α = 0.216 (0.079) and a logistic regression of disease on MTHFR yields a regression coefficient of γ = 0.112 (0.142), to produce an MR estimate of β = γ/α = 0.52 (0.68). Since MTHRF is only a relatively weak predictor of homocysteine concentrations in this simulation, however, it is a poor instrumental variable, as reflected in the large SE of the ratio estimate. Several other genes, exposures and interactions have much stronger effects on both homocysteine and disease risk  notably, SAHH and CBS, which yield significant MR estimates, 1.27 (0.33) and 1.09 (0.20), respectively. These differences between estimates using different IVs and their underestimation of the simulated β suggest that simple Mendelian randomisation is inadequate to deal with complex pathways.
Table 7. Mendelian randomization estimates of the effect of homocysteine on disease risk
A stepwise multiple regression model for included 13 main effects and G × G interactions and attained an R^{2 }of 0.43. Treating these predicted homocysteine concentrations as the covariate yielded a single imputation estimate of the log RR for disease of 1.32 (0.16), only slightly less precise than that from the logistic regression of disease directly on the measured values. While robust to uncontrolled confounding, this approach is not robust to reverse causation or misspecification of the prediction model; for example, it fails to include any exposure effects, which we have excluded to avoid distortion by reverse causation. More importantly, it also assumes that the entire effect of the predictors is mediated through homocysteine; this is true for this simulation, but is unlikely to be in practice. While not quite as downwardly biased as the Mendelian randomisation estimates (resulting from the improved prediction of BG), the incompleteness of the model has still produced some underestimation.
Since we have simulated the case where the biomarker measurements are distorted by disease status, one might consider one of two alternative single imputation analyses. If both cases and controls have biomarker measurements available, one might include disease status in a model for , and then set Y = 0 in the fitted regression in order to estimate the predisease values for the cases. Alternatively, one could fit the model for using data only from controls and then apply the fitted model to all subjects, cases and controls. In either case, one would use only the predicted values for all subjects, not the actual biomarker measurements for those having them. In these simulated data, these approaches yield log RR estimates of 1.28 (0.20) and 1.31 (0.20), respectively. Either of these approaches avoids the circularity of using disease status to predict BG, Y and then using it again in the regression of Y on . While the first approach uses more of the data, it requires a stronger assumption that the effect of Y on B is correctly specified, including possible interactions with G. In this simulation, the estimate of δ is 1.33 (0.06), substantially biased away from the simulated value of 0.50 because it includes some of the causal effect of X on Y. A fully Bayesian analysis jointly estimates the bias term δY in the full model . In this simulation, the fully Bayesian analysis yielded an estimate of β = 2.95 (0.22) and δ = 0.02 (1.02). Obviously, δ is so poorly estimated and β so overestimated that this approach appears to suffer from problems of identifiability that require further investigation.
In the Colon Cancer Family Registries, [59] we have predisease biospecimens on several hundred relatives of probands who were initially unaffected and subsequently became cases themselves. In a currently ongoing substudy of biomarkers for the folate pathway, it will be possible to use these samples to estimate the effect of reverse causation directly. Of course, it would have been even more informative to have both pre and postdiagnostic biomarker measurements on incident cases to model reverse causation more accurately.
Incorporating external information: Ontologies
There are now numerous databases available that catalogue various types of genomic information. The Kyoto Encyclopedia of Genes and Genomes (KEGG) is perhaps the most familiar of these for knowledge about the structure of pathways and the parameters of each step therein. Others include the Gene Ontology, Biomolecular Interaction Network Database, Reactome, PANTHER, Ingenuity Pathway Analysis, BioCARTA, GATHER, DAVID and the Human Protein Reference Database, (see, for example, Meier and Gehring, [60] Thomas et al. [61] and Werner [62] for reviews). Literature mining is emerging as another tool for this purpose, [63] although potentially biased by the vagaries of research and publication trends. Such repositories form part of a system for organising knowledge known as an 'ontology' [64]. Representation of our knowledge via an ontology may provide a more useful and broadly informative platform to generate systemwide hypotheses about how variation in human genes ultimately impacts on organismlevel phenotypes via the underlying pathway or complex system. Since the biological and environmental knowledge relevant to most diseases spans many research fields, each with specific theories guiding ongoing research, expertise across the entire system by one individual scientist is limited. While the information that contributes to each knowledge domain may contain uncertainties and sources of error stemming from the underlying experiments and studies, biases in the selection of genes and pathways chosen to be included and lack of comparability across terms and databases, an ontology as a whole can generate hypotheses and links across research disciplines that may only arise when information is integrated from several disciplines across the entire span of suspected disease aetiology. An ontology should not be taken as the truth, but rather as the current representation of knowledge that can, and should, be updated as new findings arise and hypotheses are tested. Evaluation of the accuracy of ontologies is an active research area.
In our folate simulation, we considered three prior covariates for Z in Table 3. The creation of these priors followed directly from the network representation given in Figure 1, obtained from a previously published article representing one research group's interpretation of the folate pathway [14]. An ontology, such as Gene Ontology (GO), has the potential advantage of allowing for the construction of prior covariates across a range of biological mechanisms. For example, a very refined biological process captured by the GO term folic acid and derivative biosynthetic process indicates two genes (MTCH and MS) from our example set of genes. A more general term, methionine biosynthetic process, identifies three genes (MTCH, MTHFR and MS). Finally, a broad process, such as onecarbon compound metabolic process, identifies five genes (SAHH, DHFR, MATII, MTCH and SHMT). Since an ontology has a hierarchical structure in a easily computable format, one may consider more quantitative approaches in generating prior covariates, such as the distance between two genes in the ontology. Across the full range of 184 GO terms involving one or more of these 14 genes, positively correlated sets include (MTHFR, MTCH, MS), (MTD, CBS), (FTD, DHFR) and (AICART, TS), while PGT and MTCH are negatively correlated. Figure 3 represents these correlations using a complete agglomerative clustering.
Figure 3. Hierarchical clustering of folate genes based on 184 GO terms.
Although both approaches to building prior covariates, via either the visual interpretation of a network or the use of Gene Ontology, use knowledge of biological mechanisms, they lack a formal link of these mechanisms to disease risk or organismlevel phenotypes. Such links may be critical when generating hypotheses or informing statistical analyses using biological mechanisms. Many publicly available ontologies provide a vast amount of structural information on various biological processes, but interpretation or weighting of the importance of those processes in relation to specific phenotypes will only come when ontologies from biological domains are linked to ontologies characterising phenotypes. As one example, Thomas et al. [61] created a novel ontological representation linking smokingrelated phenotypes and response to smoking cessation treatments with the underlying biological mechanisms, mainly nicotine metabolism. Most of the ontological concepts created for this specific ontology were created using concept definitions from existing ontologies, such as SOPHARM and Gene Ontology. This ontology was used in Conti et al. [20] to demonstrate the use in pathway analysis as a systematic way of eliciting priors for a hierarchical model. Specifically, the ontology was used to generate quantitative priors to reduce the space of potential models and to inform subsequent analysis via a Bayesian model selection approach.
Dealing with uncertainty in pathway structure
A more general question is how to deal with model uncertainty in any of these modelling strategies. The general hierarchical modelling strategy was first extended by Conti et al. [65] to deal with uncertainty about the set of main effects and interactions to be included in X using stochastic search variable selection [66]. Specifically, they replaced the secondlevel model by a pair of models, a logistic regression for the probability that β_{p }= 0 and a linear regression of the form of Eq. (2) for the expected values of the coefficient, given that it was not zero. In turn, the pair of secondlevel models inform the probability that any given term will be included in the model at the current iteration of the stochastic search. Thus, over the course of MCMC iterations, variables are entered and removed, and one can then estimate the posterior probability or Bayes factor (1) for each factor or possible model (2), for whether each factor has a nonzero β averaging over the set of other variables in the model, or (3) the posterior mean of each β, given that it is nonzero. Other alternatives include the Lasso prior, [67] which requires only a single hyperparameter to accomplish both shrinkage and variable selection in a natural way, and the elastic net, [68] which combines the Lasso and normal priors and can be implemented in a hierarchical fashion combining variable selection at lower levels (eg among SNPs within a pathway) and shrinkage at higher levels (eg between genes within a pathway or between pathways) (Chen et al. Presented at the Eastern North American Region Meeting of the Biometric Society; San Antonio, TX: February 2009).
In an analysis, utilising the methods described by Conti et al., [20] of the simulated data when homocysteine is the causal variable (Table 5, first column) and incorporating an exchangeable prior structure in which all genes are treated equally (ie intercept only in the prior covariate matrix, Z), the posterior probabilities of including the two modestly significant genes TS and FTD are 0.57 and 0.48, respectively. By contrast, when the prior covariate matrix is derived from the 'external database' from the simulation model and is thus more informative of the underlying mechanism, these posterior probabilities change to 0.84 and 0.14, respectively. These changes in the posterior probabilities of inclusion reflect the covariate values for these genes in relation to homocysteine concentration and the AICART reaction velocity (the two prior covariates with the largest estimated secondlevel effects). In the case of TS, the velocities for these covariates are large, resulting in an increase in the posterior probability of inclusion. By contrast, for FTD these values are much smaller and there is a subsequent decrease.
For mechanistic models, the 'topology' of the model Λ and the corresponding vector of model parameters θ_{Λ }are treated as unknown quantities, about which we might have some general prior knowledge in the form of the 'ontology' Z. In the microarray analysis world, Bayesian network analysis has emerged as a powerful technique for inferring the structure of a complex network of genes [69]. Might such a technique prove helpful for epidemiological analysis?
One promising approach is 'logic regression', which considers a set of treestructured models relating measurable inputs (genes and exposures) to a disease trait through a network of unobserved intermediate nodes representing logical operators (AND, OR, XOR etc) [70]. To allow for uncertainty about model form, a MCMC method is used to update the structure of the graphical model by adding, deleting, moving or changing the types of the intermediate nodes [71]. Although appealing as a way of representing the biochemical pathways, logic regression does not exploit any external information about the form of network. It also treats all intermediate nodes as binary, so it is more suitable for modelling regulatory than metabolic pathways where the intermediate nodes would represent continuous metabolite concentrations.
To overcome some of these difficulties, we relaxed the restriction to binary nodes, parameterising the model as:
When both input nodes (the 'parents' p_{j }= [p_{j1}, p_{j2}]) are binary, various combinations of θs will yield the full range of possible logical operators (eg AND = [0,0], OR = [1,1]), but this framework allows great flexibility in modelling interactions between continuous nodes, while remaining identifiable. The Ms are treated as deterministic nodes, so the final metabolite concentration M_{J }(E, G; Λ, θ) can be calculated via a simple recursion. The disease risk is assumed to have a logistic dependence on M_{J}. Prior knowledge about the topology can be incorporated by use of a measure of similarity of each fitted network to the postulated true network (eg the proportion of connections in the true graph which are represented in the fitted one, minus the number of connections in the fitted graph which are not represented in the true one). In the spirit of Monte Carlo logic regression, the topology of the graph is modified by proposing to add or delete nodes or to move a connection between them using the MetropolisHastings algorithm [72]. Finally, the model parameters are updated conditional on the current model form. By postprocessing the resulting set of graphs, various kinds of inference can be drawn, such as the posterior probability that a given input appears in the fitted graphs, that a pair of inputs is represented by a node in the graph, or the marginal effect of any input or combination of inputs on the disease risk. In small simulations, we demonstrated that the model could correctly identify the true network structure (or logically equivalent ones) and estimate the parameters well, while not identifying any incorrect models. In an application to data on ten candidate genes from the Children's Health Study, we were able to replicate the interactions found by a purely exploratory technique [73] and identified several alternative networks with comparable Bayes factors.
The folate pathway poses difficulties for mechanistic modelling because it is not a directed acyclic graph (DAG); although each arrow in Figure 1 is directed, the graph contains numerous cycles (feedback loops), making direct computation of probabilities difficult. In some instances, such cycles can be treated as single composite nodes with complex deterministic or stochastic laws, thereby rendering the remainder of the graph acyclic, but when there are many interconnected cycles, as in the folate pathway, such decomposition may be difficult or impossible to identify. Might it be possible, however, to identify a simpler DAG that captures the key behaviour of the network? Since any DAG would be an oversimplification and there could be many such DAGs that provide a reasonable approximation, the problem of model uncertainty is important.
A further extension of the Baurley et al. approach to the folate simulation will now be summarised. As in their approach, we assume that each node has exactly two inputs, but now distinguish three basic types of nodes, G × G, G × M (or G × E) and M × M. G × G nodes are treated as logical operators, yielding a binary output as high or low risk. G × M and G × E nodes represent intermediate metabolite concentrations, treated as continuous variables with deterministic values given by MichaelisMenten kinetics with rate parameters V_{max}(G) and K_{m}. M × M nodes are regression expressions yielding a continuous output variable with the mean parameterised as in Eq. (5). Disease risk is assumed to have a logistic dependence on one or more of the Zs. Finally, each measured biomarker B is assumed to be lognormally distributed around one of the Ms, with some measurement error variance. Rather than treating the intermediate nodes as deterministic, the likelihood of the entire graph is now calculated by peeling over possible states of all the intermediate nodes.
Figure 4 shows the topologies discovered by the MCMC search. The largest Bayes factors are obtained when using no prior topologies. With a prior topology, essentially the same networks are found, with somewhat different Bayes factors.
Figure 4. Topranking topologies without incorporating priors: left, gene only; right, genes and exposures. With no priors, the two topologies have posterior probabilities 3.9 per cent and 2.3 per cent, respectively. Using a topology derived by hierarchical clustering of the A matrix from simulated data, the topranked geneonly topology was identical to that shown on the left, with posterior probability of 9.5 per cent. Using the GO topology shown in Figure 3, the same genes were included, but reordered as (((MTHR, SAHH), MTD), SHMT)with a posterior probability of 6.4 per cent.
Pathways in a genomewide context
Genomewide association studies (GWAS) are generally seen as 'agnostic'  the antithesis of hypothesisdriven pathwaybased studies. Aside from the daunting computational challenge, their primary goal is, after all, the discovery of novel genetic associations, possibly in genes with unknown function or even with genomic variation in 'gene desert' regions not known to harbour genes. How, then, could one hope to incorporate prior knowledge in a GWAS? The response has generally been to wait until the GWAS has been completed (after a multistage scan and independent replication) and then conduct various in vitro functional studies of the novel associations before attempting any pathway modelling.
The idea of incorporating prior knowledge from genomic annotation databases or other sources as a way of improving the power of a genomewide scan for discovery has, however, been suggested by several authors. Roeder et al., [74] Saccone et al., [75] Wakefield [7678] and Whittemore [79] introduced variants of a weighted false discovery rate, while Lewinger et al. [80] and Chen and Witte [81] described hierarchical modelling approaches for this purpose. These could be applied at any stage of a GWAS to improve the prioritisation of variants to be taken forward to the next stage. For example, Sebastiani et al. [82] used a Bayesian test to incorporate external information for prioritising SNP associations from the first stage of a GWAS using pooled DNA, to be subsequently tested using individual genotyping. Roeder et al. [74] originally suggested the idea of exploiting external information in the context of using a prior linkage scan to focus attention in regions of the genome more likely to harbour causal variants, but subsequent authors have noted that various other types of information, such as linkage disequilibrium, functional characterisation or evolutionary conservation, could be included as predictors. An advantage of hierarchical modelling is that multiple sources can be readily incorporated in a flexible regression framework, whereas the weighted FDR requires a priori choice of a specific weighting scheme.
A recent trend has been the incorporation of pathway inference in genomewide association scans, [75,8389] borrowing ideas from the extensive literature on network analysis of gene expression array data [90,91]. Currently, the most widely used tool for this purpose is gene set enrichment analysis, [92] which in GWAS applications aims to test whether groups of genes in a common pathway tend to rank higher in significance. Several published applications have yielded novel insights using this approach, [9396] although others have found that no specific pathway outranks the most significant single markers, [89,97,98] suggesting that the approach may not be ideal for all complex diseases. Many other empirical approaches have been used in the geneexpression field, including Bayesian network analysis, [69,99,100] neural networks, [101] support vector machines [102] and a variety of other techniques from the fields of bioinformatics, computational or systems biology and machine learning [103111]. Most of these are empirical, although in the sense of trying to reconstruct the unknown network structure from observational data, rather than using a known network to analyse the observational data. It is less obvious how such methods could be applied to mining singlemarker associations from a GWAS, but they could be helpful in mining G × G interactions. Even simple analyses of GWAS data can be computationally demanding, particularly if all possible G × G interactions are to be included, and analyses incorporating pathway information is likely to be even more daunting. Recent developments in computational algorithms for searching highdimensional spaces and parallel cluster computing implementations may, however, make this feasible.
Recently, several authors [112116] have undertaken analyses of the association of genomewide expression data with genomewide SNP genotypes in search of patterns of genetic control that would identify cis and transactivating factors and master regulatory regions. Ultimately, one could foresee using networks inferred from gene expression directly as priors in a hierarchical modelling analysis for GWAS data, or a joint analysis of the two phenotypes, but this has yet to be attempted. Other novel technologies, such as wholegenome sequencing, metabolomics, proteomics and so on may provide other types of data that will inform pathwaybased analysis on a genomewide scale.
Conclusions
As in any other form of statistical modelling, the analyst should be cautious in interpretation. An pointed out by Jansen: [117]
'So, the modeling of the interplay of many genes  which is the aim of complex systems biology  is not without danger. Any model can be wrong (almost by definition), but particularly complex (overparameterized) models have much flexibility to hide their lack of biological relevance' [emphasis added].
A good fit to a particular model does not, of course, establish the truth of the model. Instead, the value of models, whether descriptive or mechanistic, lies in their ability to organise a range of hypotheses into a systematic framework in which simpler models can be tested against more complex alternatives. The usefulness of the ArmitageDoll [118] multistage model of carcinogenesis, for example, lies not in our belief that it is a completely accurate description of the process, but rather in its ability to distinguish whether a carcinogen appears to act early or late in the process or at more than one stage. Similarly, the importance of the MoolgavkarKnudson twostage clonalexpansion model [119] lies in its ability to test whether a carcinogen acts as an 'initiator' (ie on the mutation rates) or a 'promoter' (ie on proliferation rates). Such inferences can be valuable, even if the model itself is an incomplete description of the process, as must always be the case.
Although mechanistic models do make some testable predictions about such things as the shape of the doseresponse relationship and the modifying effects of timerelated variables, testing such patterns against epidemiological data tends to provide only weak evidence in support of the alternative models, and only within the context of all the other assumptions involved. Generally, comparisons of alternative models (or specific submodels) can only be accomplished by direct fitting. Visualisation of the fit to complex epidemiological datasets can be challenging. Any mechanistic interpretations of model fits should therefore consider carefully the robustness of these conclusions to possible misspecification of other parts of the model.
Acknowledgements
This work was supported in part by NIH grants R01CA92562, P50ES07048, R01CA112237 and U01ES015090 (D.C.T., D.V.C., J.B.), R01CA105437, R01CA105145, R01CA59045 (C.M.U.) and NSF grants DMS0616710 and DMS0109872 (F.N., M.R.). The authors are particularly grateful to Wei Liang and Fan Yang for programming support.
References

Thomas DC: The need for a comprehensive approach to complex pathways in molecular epidemiology.
Cancer Epidemiol Biomarkers Prev 2005, 14:557559. PubMed Abstract  Publisher Full Text

Thomas DC, Baurley JW, Brown EE, Figueiredo J, et al.: Approaches to complex pathways in molecular epidemiology: Summary of an AACR special conference.
Cancer Res 2008, 68:1002810030. PubMed Abstract  Publisher Full Text

Cook NR, Zee RY, Ridker PM: Tree and spline based association analysis of genegene interaction models for ischemic stroke.
Stat Med 2004, 23:14391453. PubMed Abstract  Publisher Full Text

Ritchie MD, Hahn LW, Roodi N, Bailey LR, et al.: Multifactordimensionality reduction reveals highorder interactions among estrogenmetabolism genes in sporadic breast cancer.
Am J Hum Genet 2001, 69:138147. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hoh J, Ott J: Mathematical multilocus approaches to localizing complex human trait genes.
Nat Rev Genet 2003, 4:701709. PubMed Abstract  Publisher Full Text

Tamayo P, Slonim D, Mesirov J, Zhu Q, et al.: Interpreting patterns of gene expression with selforganizing maps: Methods and application to hematopoietic differentiation.
Proc Natl Acad Sci USA 1999, 96:29072912. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

TahriDaizadeh N, Tregouet DA, Nicaud V, Manuel N, et al.: Automated detection of informative combined effects in genetic association studies of complex traits.
Genome Res 2003, 13:19521960. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Potter JD: Colorectal cancer: Molecules and populations.
J Natl Cancer Inst 1999, 91:916932. PubMed Abstract  Publisher Full Text

Frosst P, Blom HJ, Milos R, Goyette P, et al.: A candidate genetic risk factor for vascular disease: A common mutation in methylenetetrahydrofolate reductase.
Nat Genet 1995, 10:1113. PubMed Abstract  Publisher Full Text

Ulrich CM, Potter JD: Folate supplementation: Too much of a good thing?
Cancer Epidemiol Biomarkers Prev 2006, 15:18993. PubMed Abstract  Publisher Full Text

Molloy AM, Brody LC, Mills JL, Scott JM, et al.: The search for genetic polymorphisms in the homocysteine/folate pathway that contribute to the etiology of human neural tube defects.
Birth Defects Res A Clin Mol Teratol 2009, 85:28594. PubMed Abstract  Publisher Full Text

Nijhout HF, Reed MC, Budu P, Ulrich CM: A mathematical model of the folate cycle: New insights into folate homeostasis.
J Biol Chem 2004, 279:5500816. PubMed Abstract  Publisher Full Text

Nijhout HF, Reed MC, Ulrich CM: Mathematical models of folatemediated onecarbon metabolism.
Vitam Horm 2008, 79:4582. PubMed Abstract  Publisher Full Text

Reed MC, Nijhout HF, Neuhouser ML, Gregory JF, et al.: A mathematical model gives insights into nutritional and genetic aspects of folatemediated onecarbon metabolism.
J Nutr 2006, 136:265361. PubMed Abstract  Publisher Full Text

Reed MC, Thomas RL, Pavisic J, James SJ, et al.: A mathematical model of glutathione metabolism.
Theor Biol Med Model 2008, 5:8. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Ulrich CM, Neuhouser M, Liu AY, Boynton A, et al.: Mathematical modeling of folate metabolism: Predicted effects of genetic polymorphisms on mechanisms and biomarkers relevant to carcinogenesis.
Cancer Epidemiol Biomarkers Prev 2008, 17:182231. PubMed Abstract  Publisher Full Text

Hung RJ, Brennan P, Malaveille C, Porru S, et al.: Using hierarchical modeling in genetic association studies with multiple markers: Application to a casecontrol study of bladder cancer.
Cancer Epidemiol Biomarkers Prev 2004, 13:10131021. PubMed Abstract  Publisher Full Text

Capanu M, Orlow I, Berwick M, Hummer AJ, et al.: The use of hierarchical models for estimating relative risks of individual genetic variants: An application to a study of melanoma.
Stat Med 2008, 27:19731992. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hung RJ, Baragatti M, Thomas D, McKay J, et al.: Inherited predisposition of lung cancer: A hierarchical modeling approach to DNA repair and cell cycle control pathways.
Cancer Epidemiol Biomarkers Prev 2007, 16:27362744. PubMed Abstract  Publisher Full Text

Conti DV, Lewinger JP, Swan GE, Tyndale RF, et al.: Using ontologies in hierarchical modeling of genes and exposures in biologic pathways. In Phenotypes and Endophenotypes: Foundations for Genetic Studies of Nicotine Use and Dependence. Edited by Swans GE. NCI Tobocco Control Monographs, Bethesda, MD; 2009:539584.

Rebbeck TR, Spitz M, Wu X: Assessing the function of genetic variants in candidate gene association studies.
Nat Rev Genet 2004, 5:589597. PubMed Abstract  Publisher Full Text

Besag J, York J, Mollie A: Bayesian image restoration with two applications in spatial statistics (with discussion).
Ann Inst Statist Math 1991, 43:159. Publisher Full Text

Cortessis V, Thomas DC: Toxicokinetic genetics: An approach to geneenvironment and genegene interactions in complex metabolic pathways. In Mechanistic Considerations in the Molecular Epidemiology of Cancer. Edited by Bird P, Boffetta P, Buffler P, Rice J. IARC Scientific Publications, Lyon, France; 2003:127150.

Du L, Conti DV, Thomas DC: Physiologicallybased pharmacokinetic modeling platform for genetic and exposure effects in metabolic pathways.

Lunn DJ, Thomas A, Best N, Spiegelhalter D: Winbugs  A Bayesian modelling framework: Concepts, structure, and extensibility.
Stat Comput 2000, 10:325337. Publisher Full Text

Lunn DJ, Best N, Thomas A, Wakefield J, Spiegelhalter D: Bayesian analysis of population PK/PD models: General concepts and software.
J Pharmacokinet Pharmacodyn 2002, 29(3):271307. PubMed Abstract  Publisher Full Text

RacinePoon A, Wakefield J: Statistical methods for population pharmacokinetic modelling.
Stat Meth Med Res 1998, 7:6384. Publisher Full Text

Bois FY: Applications of population approaches in toxicology.
Toxicol Lett 2001, 120:385394. PubMed Abstract  Publisher Full Text

Bennett JE, Wakefield JC: A comparison of a Bayesian population method with two methods as implemented in commercially available software.
J Pharmacokinet Biopharm 1996, 24:403432. PubMed Abstract  Publisher Full Text

Wakefield J: Bayesian individualization via samplingbased methods.
J Pharmacokinet Biopharm 1996, 24:103131. PubMed Abstract  Publisher Full Text

Best NG, Tan KK, Gilks WR, Spiegelhalter DJ: Estimation of population pharmacokinetics using the Gibbs sampler.
J Pharmacokinet Biopharm 1995, 23:407435. PubMed Abstract  Publisher Full Text

Kou SC, Cherayil BJ, Min W, English BP, et al.: Singlemolecule MichaelisMenten equations.
J Phys Chem B 2005, 109:1906819081. PubMed Abstract  Publisher Full Text

English BP, Min W, van Oijen AM, Lee KT, et al.: Everfluctuating single enzyme molecules: MichaelisMenten equation revisited.
Nat Chem Biol 2006, 2:8794. PubMed Abstract  Publisher Full Text

Marjoram P, Molitor J, Plagnol V, Tavare S: Markov chain Monte Carlo without likelihoods.
Proc Natl Acad Sci USA 2003, 100:1532415328. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Thomas DC: Using geneenvironment interactions to dissect the effects of complex mixtures.
J Expo Sci Environ Epidemiol 2007, 17(Suppl 2):S71S74. PubMed Abstract  Publisher Full Text

Parl F, Crooke P, Conti DV, Thomas DC: Pathwaybased methods in molecular cancer epidemiology. In Fundamentals of Molecular Epidemiology. Edited by Rebbeck TR, Ambrosone CB, Shields PG. Informa Healthcare, New York, NY; 2008:189204.

Spiegelman D, Carroll RJ, Kipnis V: Efficient regression calibration for logistic regression in main study/internal validation study designs with an imperfect reference instrument.
Stat Med 2001, 20:139160. PubMed Abstract  Publisher Full Text

Holcroft CA, Spiegelman D: Design of validation studies for estimating the odds ratio of exposuredisease relationships when exposure is misclassified.
Biometrics 1999, 55:11931201. PubMed Abstract  Publisher Full Text

Thomas DC: Multistage sampling for latent variable models.
Lifetime Data Anal 2007, 13:565581. PubMed Abstract  Publisher Full Text

Breslow NE, Chatterjee N: Design and analysis of twophase studies with binary outcome applied to Wilms tumor prognosis.
Appl Statist 1999, 48:457468. Publisher Full Text

Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM: Measurement Error in Nonlinear models: A Modern Perspective. 2nd edition. Chapman and Hall CRC Press, London, UK; 2006.

Thomas DC, Stram D, Dwyer J: Exposure measurement error: Influence on exposuredisease relationships and methods of correction.
Annu Rev Publ Health 1993, 14:6993. Publisher Full Text

Davey Smith G, Ebrahim S: "Mendelian randomization": Can genetic epidemiology contribute to understanding environmental determinants of disease?
Int J Epidemiol 2003, 32:122. PubMed Abstract  Publisher Full Text

Davey Smith G, Ebrahim S: Mendelian randomization: Prospects, potentials, and limitations.
Int J Epidemiol 2004, 33:3042. PubMed Abstract  Publisher Full Text

Davey Smith G, Ebrahim S: What can Mendelian randomisation tell us about modifiable behavioural and environmental exposures?
BMJ 2005, 330:10761079. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lewis SJ, Davey Smith G: Alcohol, aldh2, and esophageal cancer: A metaanalysis which illustrates the potentials and limitations of a Mendelian randomization approach.
Cancer Epidemiol Biomarkers Prev 2005, 14:19671971. PubMed Abstract  Publisher Full Text

Thompson JR, Minelli C, Abrams KR, Tobin MD, et al.: Metaanalysis of genetic studies using Mendelian randomization  A multivariate approach.
Stat Med 2005, 24:22412254. PubMed Abstract  Publisher Full Text

Tobin MD, Minelli C, Burton PR, Thompson JR: Commentary: Development of Mendelian randomization: From hypothesis test to "Mendelian deconfounding".
Int J Epidemiol 2004, 33:2629. PubMed Abstract  Publisher Full Text

Glynn RJ: Commentary. Genes as instruments for evaluation of markers and causes.
Int J Epidemiol 2006, 35:932934. PubMed Abstract  Publisher Full Text

Hernan MA, Robins JM: Instruments for causal inference: An epidemiologist's dream?
Epidemiology 2006, 17:360372. PubMed Abstract  Publisher Full Text

Brookhart MA, Wang PS, Solomon DH, Schneeweiss S: Instrumental variable analysis of secondary pharmacoepidemiologic data.
Epidemiology 2006, 17:373374. PubMed Abstract  Publisher Full Text

Buzas JS, Stefanski LA: Instrumental variable estimation in generalized linear measurement error models.

Greenland S: An introduction to instrumental variables for epidemiologists.
Int J Epidemiol 2000, 29:1102. PubMed Abstract  Publisher Full Text

Martens EP, Pestman WR, de Boer A, Belitser SV, et al.: Instrumental variables: Application and limitations.
Epidemiology 2006, 17:260267. PubMed Abstract  Publisher Full Text

Didelez V, Sheehan N: Mendelian randomization as an instrumental variable approach to causal inference.
Stat Meth Med Res 2007, 16:309330. Publisher Full Text

Nitsch D, Molokhia M, Smeeth L, DeStavola BL, et al.: Limits to causal inference based on Mendelian randomization: A comparison with randomized controlled trials.
Am J Epidemiol 2006, 163:397403. PubMed Abstract  Publisher Full Text

Bautista LE, Smeeth L, Hingorani AD, Casas JP: Estimation of bias in nongenetic observational studies using "Mendelian triangulation".
Ann Epidemiol 2006, 16:675680. PubMed Abstract  Publisher Full Text

Thomas DC, Conti DV: Commentary. The concept of "Mendelian randomization".
Int J Epidemiol 2004, 33:2125. PubMed Abstract  Publisher Full Text

Newcomb PA, Baron J, Cotterchio M, Gallinger S, et al.: Colon cancer family registry: An international resource for studies of the genetic epidemiology of colon cancer.
Cancer Epidemiol Biomarkers Prev 2007, 16:23312343. PubMed Abstract  Publisher Full Text

Meier S, Gehring C: A guide to the integrated application of online data mining tools for the inference of gene functions at the systems level.
Biotechnol J 2008, 3:13751387. PubMed Abstract  Publisher Full Text

Thomas PD, Mi H, Swan GE, Lerman C, et al.: A systems biology network model for genetic association studies of nicotine addiction and treatment.
Pharmacogenet Genomics 2009, 19:538551. PubMed Abstract  Publisher Full Text

Werner T: Bioinformatics applications for pathway analysis of microarray data.
Curr Opin Biotechnol 2008, 19:5054. PubMed Abstract  Publisher Full Text

Jensen LJ, Saric J, Bork P: Literature mining for the biologist: From information retrieval to biological discovery.
Nat Rev Genet 2006, 7:119129. PubMed Abstract  Publisher Full Text

Ashburner M, Ball CA, Blake JA, Botstein D, et al.: Gene ontology: Tool for the unification of biology.
Nat Genet 2000, 25:2529. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Conti DV, Cortessis V, Molitor J, Thomas DC: Bayesian modeling of complex metabolic pathways.
Hum Hered 2003, 56:8393. PubMed Abstract  Publisher Full Text

George EI, McCulloch RE: Variable selection via Gibbs sampling.

Park T, Casella G: The Bayesian lasso.
J Am Stat Assoc 2008, 103:681686. Publisher Full Text

Zou H, Hastie T: Regularization and variable selection via the elastic net.
J R Stat Soc Ser B 2005, 67:301320. Publisher Full Text

Friedman N: Inferring cellular networks using probabilistic graphical models.
Science 2004, 303:799805. PubMed Abstract  Publisher Full Text

Ruczinski I, Kooperberg C, LeBlanc ML: Exploring interactions in highdimensional genomic data: An overview of logic regression, with applications.
J Multivar Anal 2004, 90:178195. Publisher Full Text

Kooperberg C, Ruczinski I: Identifying interacting SNPs using Monte Carlo logic regression.
Genet Epidemiol 2005, 28:157170. PubMed Abstract  Publisher Full Text

Hastings W: Monte Carlo sampling methods using Markov chains and their applications.
Biometrika 1970, 57:97109. Publisher Full Text

Millstein J, Conti DV, Gilliland FD, Gauderman WJ: A testing framework for identifying susceptibility genes in the presence of epistasis.
Am J Hum Genet 2006, 78:1527. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Roeder K, Devlin B, Wasserman L: Improving power in genomewide association studies: Weights tip the scale.
Genet Epidemiol 2007, 31:741747. PubMed Abstract  Publisher Full Text

Saccone SF, Saccone NL, Swan GE, Madden PA, et al.: Systematic biological prioritization after a genomewide association study: An application to nicotine dependence.
Bioinformatics 2008, 24:18051811. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wakefield J: Bayes factors for genomewide association studies: Comparison with pvalues.

Wakefield J: Reporting and interpretation in genomewide association studies.
Int J Epidemiol 2008, 37:641653. PubMed Abstract  Publisher Full Text

Wakefield J: A Bayesian measure of the probability of false discovery in genetic epidemiology studies.
Am J Hum Genet 2007, 81:208227. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Whittemore AS: A Bayesian false discovery rate for multiple testing.
J Appl Statist 2007, 34:19. Publisher Full Text

Lewinger JP, Conti DV, Baurley JW, Triche TJ, et al.: Hierarchical Bayes prioritization of marker associations from a genomewide association scan for further investigation.
Genet Epidemiol 2007, 31:871882. PubMed Abstract  Publisher Full Text

Chen GK, Witte JS: Enriching the analysis of genomewide association studies with hierarchical modeling.
Am J Hum Genet 2007, 81:397404. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sebastiani P, Zhao Z, AbadGrau MM, Riva A, et al.: A hierarchical and modular approach to the discovery of robust associations in genomewide association studies from pooled DNA samples.
BMC Genet 2008, 9:6. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wang K, Li M, Bucan M: Pathwaybased approaches for analysis of genomewide association studies.
Am J Hum Genet 2007, 81:12781283. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Elbers CC, van Eijk KR, Franke L, Mulder F, et al.: Using genomewide pathway analysis to unravel the etiology of complex diseases.
Genet Epidemiol 2009, 33:419431. PubMed Abstract  Publisher Full Text

Chasman DI: On the utility of gene set methods in genomewide association studies of quantitative traits.
Genet Epidemiol 2008, 32:658668. PubMed Abstract  Publisher Full Text

Holden M, Deng S, Wojnowski L, Kulle B: GseaSNP: Applying gene set enrichment analysis to SNP data from genomewide association studies.
Bioinformatics 2008, 24:27842785. PubMed Abstract  Publisher Full Text

Bush WS, Dudek SM, Ritchie MD: Biofilter: A knowledgeintegration system for the multilocus analysis of genomewide association studies.

Rajagopalan D, Agarwal P: Inferring pathways from gene lists using a literaturederived network of biological relationships.
Bioinformatics 2005, 21:788793. PubMed Abstract  Publisher Full Text

Hong MG, Pawitan Y, Magnusson PK, Prince JA: Strategies and issues in the detection of pathway enrichment in genomewide association studies.

Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, et al.: Pgc1alpharesponsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes.
Nat Genet 2003, 34:267273. PubMed Abstract  Publisher Full Text

Pan W: Incorporating biological information as a prior in an empirical Bayes approach to analyzing microarray data.
Stat Appl Genet Mol Biol 2005., 4
Art. 12

Subramanian A, Tamayo P, Mootha VK, Mukherjee S, et al.: Gene set enrichment analysis: A knowledgebased approach for interpreting genomewide expression profiles.
Stat Appl Genet Mol Biol 2005., 4
Art. 12, Proc. Natl. Acad. Sci. USA Vol. 102, pp. 1554515550

Lesnick TG, Papapetropoulos S, Mash DC, FfrenchMullen J, et al.: A genomic pathway approach to a complex disease: Axon guidance and Parkinson disease.
PLoS Genet 2007, 3:e98. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Baranzini SE, Galwey NW, Wang J, Khankhanian P, et al.: Pathway and networkbased analysis of genomewide association studies in multiple sclerosis.
Hum Mol Genet 2009, 18:20782090. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Torkamani A, Topol EJ, Schork NJ: Pathway analysis of seven common diseases assessed by genomewide association.
Genomics 2008, 92:265272. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Vink JM, Smit AB, de Geus EJ, Sullivan P, et al.: Genomewide association study of smoking initiation and current smoking.
Am J Hum Genet 2009, 84:367379. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Perry JR, McCarthy MI, Hattersley AT, Zeggini E, et al.: Interrogating type 2 diabetes genomewide association data using a biological pathwaybased approach.
Diabetes 2009, 58:14631467. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kasperaviciute D, Weale ME, Shianna KV, Banks GT, et al.: Largescale pathwaysbased association study in amyotrophic lateral sclerosis.
Brain 2007, 130:22922301. PubMed Abstract  Publisher Full Text

Friedman N, Linial M, Nachman I, Pe'er D: Using Bayesian networks to analyze expression data.
J Comput Biol 2000, 7:601620. PubMed Abstract  Publisher Full Text

Yu J, Smith VA, Wang PP, Hartemink AJ, et al.: Advances to Bayesian network inference for generating causal networks from observational biological data.
Bioinformatics 2004, 20:35943603. PubMed Abstract  Publisher Full Text

Ritchie MD, White BC, Parker JS, Hahn CW, et al.: Optimization of neural network architecture using genetic programming improves detection and modeling of genegene interactions in studies of human diseases.
BMC Bioinformatics 2003, 4:28. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Byvatov E, Schneider G: Support vector machine applications in bioinformatics.

Schafer J, Strimmer K: An empirical Bayes approach to inferring largescale gene association networks.

Wu CC, Huang HC, Juan HF, Chen ST: Genenetwork: An interactive tool for reconstruction of genetic networks using microarray data.
Bioinformatics 2004, 20:36913693. PubMed Abstract  Publisher Full Text

Franke L, van Bakel H, Fokkens L, de Jong ED, et al.: Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes.
Am J Hum Genet 2006, 78:10111025. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Basso K, Margolin AA, Stolovitzky G, Klein U, et al.: Reverse engineering of regulatory networks in human b cells.
Nat Genet 2005, 37:382390. PubMed Abstract  Publisher Full Text

Kim TH, Ren B: Genomewide analysis of proteinDNA interactions.
Annu Rev Genom Hum Genet 2006, 7:81102. Publisher Full Text

Tu Z, Wang L, Arbeitman MN, Chen T, et al.: An integrative approach for causal gene identification and gene regulatory pathway inference.
Bioinformatics 2006, 22:e489e496. PubMed Abstract  Publisher Full Text

Yu H, Zhu X, Greenbaum D, et al.: Topnet: A tool for comparing biological subnetworks, correlating protein properties with topological statistics.
Nucleic Acids Res 2004, 32:328337. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Blais A, Dynlacht BD: Constructing transcriptional regulatory networks.
Genes Dev 2005, 19:14991511. PubMed Abstract  Publisher Full Text

Xie Y, Pan W, Jeong KS, Khodursky A: Incorporating prior information via shrinkage: A combined analysis of genomewide location data and gene expression data.
Stat Med 2007, 26:22582275. PubMed Abstract  Publisher Full Text

Dixon AL, Liang L, Moffatt MF, Chen W, et al.: A genomewide association study of global gene expression.
Nat Genet 2007, 39:12021207. PubMed Abstract  Publisher Full Text

Stranger BE, Forrest MS, Clark AG, Minichiello MJ, et al.: Genomewide associations of gene expression variation in humans.
PLoS Genet 2005, 1:e78. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Morley M, Molony CM, Weber TM, Devlin JL, et al.: Genetic analysis of genomewide variation in human gene expression.
Nature 2004, 430:743747. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Cheung VG, Spielman RS, Ewens KG, Weber TM, et al.: Mapping determinants of human gene expression by regional and genomewide association.
Nature 2005, 437:13651369. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Cheung VG, Conlin LK, Weber TM, Arcaro M, et al.: Natural variation in human gene expression assessed in lymphoblastoid cells.
Nat Genet 2003, 33:422425. PubMed Abstract  Publisher Full Text

Jansen RC: Studying complex biological systems using multifactorial perturbation.
Nat Rev Genet 2003, 4:145151. PubMed Abstract  Publisher Full Text

Armitage P, Doll R: The age distribution of cancer and multistage theory of carcinogenesis.
Br J Cancer 1954, 8(1):112. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Moolgavkar S, Knudson A: Mutation and cancer: A model for human carcinogenesis.
JNCI 1981, 66:10371052. PubMed Abstract