Skip to main content
  • Primary research
  • Open access
  • Published:

Mixture models for gene expression experiments with two species

Abstract

Cross-species research in drug development is novel and challenging. A bivariate mixture model utilizing information across two species was proposed to solve the fundamental problem of identifying differentially expressed genes in microarray experiments in order to potentially improve the understanding of translation between preclinical and clinical studies for drug development. The proposed approach models the joint distribution of treatment effects estimated from independent linear models. The mixture model posits up to nine components, four of which include groups in which genes are differentially expressed in both species. A comprehensive simulation to evaluate the model performance and one application on a real world data set, a mouse and human type II diabetes experiment, suggest that the proposed model, though highly structured, can handle various configurations of differential gene expression and is practically useful on identifying differentially expressed genes, especially when the magnitude of differential expression due to different treatment intervention is weak. In the mouse and human application, the proposed mixture model was able to eliminate unimportant genes and identify a list of genes that were differentially expressed in both species and could be potential gene targets for drug development.

Introduction

Background

Pharmaceutical medicine is an industry with huge up-front investment for rewards that may or may not come years later. A complete drug development process, including drug discovery, preclinical research (on animals) and clinical trials (on humans), is lengthy, expensive, and risky. Determined by the US Food and Drug Administration (FDA) [1], the average total cost per drug development is about $1.9 billion. The typical development time is 10 to 15 years. The overall attrition rate of a drug compound from first-in-man to registration is approximately 80%–90% [2, 3].

FDA [1] calls the preclinical and clinical research together as the ‘critical path’ development phase, where most investment required for a successful drug launch occurs. Currently, this development phase is inherently inefficient. The goal of preclinical research is to assess how a drug is absorbed, distributed, metabolized, and excreted in animals, and to use the findings to determine potential human outcomes before starting clinical trials. Yet the rate of success after a drug candidate enters Phase I is undesirably low. As mentioned in FDA [1] and Kola and Landis [3], animal models with poor clinical relevance may be accountable for this perplexity. Hence, improving translation between two species to increase the predictive power of animal models to human studies is of tremendous value to drug discovery and development.

Homology and multiple species gene expression analysis in drug development

Microarrays are tools for gene expression analysis and can be potentially useful for investigating the mechanism of drug activities that translates across species. The utility of microarray information in the drug development process is reviewed by Braxton and Bedilion [4] who embraced the idea that gene expression analysis can be a surrogate marker for the interaction between compounds and cells and should yield information about efficacy. Debouck and Goodfellow [5] believed that microarrays can be used to generate clues to patterns of gene function that can help improve the efficiency of drug development.

As stated before, one key challenge of drug development is to successfully translate the results of preclinical findings in animal models to human beings in the clinic. Pre-clinical experiments assume that the effect of the drug tested on animals is comparable to that on humans, which can only be true if a functional equivalent of the human drug target exists in the experimental species. Orthology [68] is a strong indication of functional conservation and therefore provides the best functional annotation of experimentally undetermined genes across species. Holbrook and Sanseau [9] remarked that the use of orthologs has the potential to improve the understanding of biological differences between species (animals and humans).

Many of the successful applications of cross-species microarray gene expression analysis involve orthology [1013]. Additionally, over the past decade, researchers have tried to use orthology and gene expression data to do cross-species comparison in order to understand how genes interact to perform particular biological processes [14, 15]. These studies support the idea that orthologs could be a useful tool for researchers to link experiments between species in drug development. Note that orthologous relationships can be one-to-one, one-to-many, or many-to-many [8].

The rest of the paper is structured as follows: Section ‘Joint modeling across species’ describes the proposed bivariate mixture model across species. Section ‘Simulation’ describes a simulation study undertaken to investigate the effects of different experimental designs on the power to detect important genes and on misclassification rates. Section ‘Application: the mouse and human type II diabetes experiment’ illustrates the methodology using an application to data collected in a mouse/human experiment. Section ‘Concluding remarks’ concludes.

Joint modeling across species

Let X a ij and X h il denote gene expression measurements from the ith orthologous gene pair for the jth animal and the lth human. The following independent linear models describe the association between gene expression and treatment:

X a ij = β 0 a i + β 1 a i T a j + e a ij ,
(1)
X h il = β 0 h i + β 1 h i T h l + e h il ,
(2)

where T a j and T h l are {0,1} treatment indicators, and e a ij and e h il are independent N(0, σ a 2 ) and N(0, σ h 2 ) random variables. σ a 2 and σ h 2 are variances for e a ij and e h il , respectively. In drug development, the animal research and human experiments are conducted independently - one’s results do not affect the other’s. However, the treatment effects are expected to have some kind of association between the two species. This results in our choice of using two independent models for the two species to capture the effects of treatment on gene expression.

A nine-component bivariate mixture model for two species experiments

β 1 a i and β 1 h i quantify the differential expression of the ith orthologous animal and human genes due to a treatment intervention. A given gene can be classified as non-differentially expressed (NDE) - showing no signs of treatment effects, positively differentially expressed (pDE) - showing positive treatment effects, or negatively differentially expressed (nDE) - showing negative treatment effects. Therefore, for a human and animal gene pair, there are nine possibilities for categorizing this pair of genes. Further, dependency is assumed between differentially expressed orthologs, i.e., existence of association posited only for ( β 1 a i , β 1 h i ) T in categories (1, 2, 3, 4) and zero correlation presumed for ( β 1 a i , β 1 h i ) T in categories (0, 5, 6, 7, 8). Table 1 illustrates the nine possible categories of ( β 1 a i , β 1 h i ) T . ( μ β 1 a i , μ β 1 h i ) T is the vector of population means of ( β 1 a i , β 1 h i ) T under each category.

Table 1 Possible categories of ( β 1 a i , β 1 h i ) T

In consequence of these possible patterns of ( β 1 a i , β 1 h i ) T , mixture models [16, 17] are adopted to deal with the correlation and distribution of each subgroup of genes across species. An additional advantage of mixture models is that, after prior weights for the components are specified, estimates of the posterior probabilities of population membership can be formed for each observation to give a probabilistic clustering. As a result, the pooling of information for genes across species can be exploited to better understand the underlying relationship between the treatment intervention for both species.

Tailoring the mixture model to two-species experiments with restrictions on the parameters made according to Table 1 and assuming that the treatment effects for non-differentially expressed genes are deterministically zero, i.e., ( β 1 a i , β 1 h i ) T = ( 0 , 0 ) T , the following bivariate normal mixture model is adopted as the prior distribution of the vector ( β 1 a i , β 1 h i ) T :

β 1 a i β 1 h i π 0 N μ a 0 μ h 0 , η a 0 2 ρ 0 η a 0 η h 0 ρ 0 η a 0 η h 0 η h 0 2 + π 1 N μ a 1 μ h 1 , η a 1 2 ρ 1 η a 1 η h 1 ρ 1 η a 1 η h 1 η h 1 2 + π 2 N μ a 2 μ h 2 , η a 2 2 ρ 2 η a 2 η h 2 ρ 2 η a 2 η h 2 η h 2 2 + π 3 N μ a 3 μ h 3 , η a 3 2 ρ 3 η a 3 η h 3 ρ 3 η a 3 η h 3 η h 3 2 + π 4 N μ a 4 μ h 4 , η a 4 2 ρ 4 η a 4 η h 4 ρ 4 η a 4 η h 4 η h 4 2 + π 5 N μ a 5 μ h 5 , η a 5 2 ρ 5 η a 5 η h 5 ρ 5 η a 5 η h 5 η h 5 2 + π 6 N μ a 6 μ h 6 , η a 6 2 ρ 6 η a 6 η h 6 ρ 6 η a 6 η h 6 η h 6 2 + π 7 N μ a 7 μ h 7 , η a 7 2 ρ 7 η a 7 η h 7 ρ 7 η a 7 η h 7 η h 7 2 + π 8 N μ a 8 μ h 8 , η a 8 2 ρ 8 η a 8 η h 8 ρ 8 η a 8 η h 8 η h 8 2 ,
(3)

where π k is the probability that an observation belongs to the kth component, with k = 0 8 π k =1and π k 0. The following restriction of the parameter space is imposed: μa0 = 0, μh0 = 0, μa1 ≥ 0, μh1 ≥ 0, μa2 ≤ 0, μh2 ≤ 0, μa3 ≥ 0, μh3 ≤ 0, μa4 ≤ 0, μh4 ≥ 0, μa5 = 0, μh5 ≥ 0, μa6 = 0, μh6 ≤ 0, μa7 ≥ 0, μh7 = 0, μa8 ≤ 0, μh8 = 0, ηa0 = 0, ηh0 = 0, ηa5 = 0, ηa6 = 0, ηh7 = 0, ηh8 = 0, ρ0 = 0, ρ5 = 0, ρ6 = 0, ρ7 = 0, and ρ8 = 0.

According to the theory of least squares, the marginal distribution of ( β ˆ 1 a i , β ˆ 1 h i ) T , the parameter estimates, has means equal to the prior means of ( β 1 a i , β 1 h i ) T and variances involving contributions from the prior distribution of ( β 1 a i , β 1 h i ) T and the conditional distribution of ( β ˆ 1 a i , β ˆ 1 h i ) T given ( β 1 a i , β 1 h i ) T . The marginal distribution of ( β ˆ 1 a i , β ˆ 1 h i ) T is as follows:

β ˆ 1 a i β ˆ 1 h i π 0 N 0 0 , σ a 0 2 0 0 σ h 0 2 + π 1 N μ a 1 μ h 1 , σ a 1 2 ρ 1 σ a 1 σ h 1 ρ 1 σ a 1 σ h 1 σ h 1 2 + π 2 N μ a 2 μ h 2 , σ a 2 2 ρ 2 σ a 2 σ h 2 ρ 2 σ a 2 σ h 2 σ h 2 2 + π 3 N μ a 3 μ h 3 , σ a 3 2 ρ 3 σ a 3 σ h 3 ρ 3 σ a 3 σ h 3 σ h 3 2 + π 4 N μ a 4 μ h 4 , σ a 4 2 ρ 4 σ a 4 σ h 4 ρ 4 σ a 4 σ h 4 σ h 4 2 + π 5 N 0 μ h 5 , σ a 5 2 0 0 σ h 5 2 + π 6 N 0 μ h 6 , σ a 6 2 0 0 σ h 6 2 + π 7 N μ a 7 0 , σ a 7 2 0 0 σ h 7 2 + π 8 N μ a 8 0 , σ a 8 2 0 0 σ h 8 2 ,
(4)

An EM algorithm is developed to accomplish the nontrivial likelihood maximization, along with methodology for handling singular covariance matrices that arise during the implementation of the algorithm. (See the Appendix for details). Gene membership is determined according to the maximum posterior probability that an observation ( β ˆ 1 a i , β ˆ 1 h i ) T comes from the kth component of the mixture.

Simulation

The following Monte Carlo simulation studies investigated the performance of the proposed mixture model using information across two species in comparison to the traditional microarray method using just one-species information when identifying genes associated with treatment stimulus under several different scenarios.

Several factors influence the sampling properties of the estimated treatment effects ( β ˆ 1 a i , β ˆ 1 h i ) T using the mixture model were considered in the simulation:

  •  Replicates (number of arrays) per treatment for each species: n a and n h for animals and humans, respectively.

  •  Number of orthologous genes in each category: n k , k = 0,…,8, the kth category.

  •  Array noise: e a ij and e h ij in (1) and (2). Also recall that, by assumption, e a ij and e h il are independent N(0, σ a 2 ) and N(0, σ h 2 ) random variables.

  •  Parameters in (3) by which the sampling distribution of ( β ˆ 1 a i , β ˆ 1 h i ) T is determined.

With so many variables, it is impractical to study the sampling properties of the fitted model without fixing some variables. That is, an experimental design for the simulation in which these factors are completely considered is not feasible. The simulation study instead focused on three aspects. First, although high-density microarrays provide useful genome-wide data, they are often associated with a substantial amount of experimental noise that could affect the performance of the analysis. Hence, it is of interest to investigate how the array noise would affect the model efficiency on gene identification across species.

Second, the sample size of cross-species experiment is likely to be different, and may be one of the deciding factors of the power associated with the modeling approach. In particular, the efficiency of gene identification, whether the proposed model gains power over one-species experiments through pooling information across species, should be examined carefully, especially when the sample size of the experiments is small.

Third, over-fitting may be of concern. The proposed mixture model is, by its nature, highly structured and data driven. If the data are not driven by all nine categories as the model suggests, will the mixture model fail? Is the proposed model flexible enough to handle different types of data structure? To examine the model performance systematically and to test if the proposed model will fail when too many components are used to fit the data where there are actually fewer clusters, two types of data were generated: all nine categories non-empty (case I) and some of the nine components empty (case II).

In simulation studies case I and case II, two methods of gene identification were implemented: the proposed mixture model, utilizing information across two different species, and the traditional t statistics adjusting for multiple comparisons based on single-species data. Five hundred data sets were generated for each different scenario under each simulation study.

Parameter determination and data generation

Theoretically, the number of genes in category 0 (non-differentially expressed in both species) should dominate others, and every other category may comprise some genes. Orthology information from HomoloGene of the National Center for Biotechnology Information (NCBI) (http://www.ncbi.nlm.nih.gov/homologene) and Mouse Genome Informatics (MGI) of Jackson Laboratory [18] and the practical experience gained from analysis of two-species gene expression experiments at GlasoSmithKline (GSK) was used as reference to determine a reasonable number of genes in each category.

The vectors of the number of genes in each category (n0,n1,…,n8)T determined for simulation studies case I and case II are categories (6,000,30, 30,30,30,100,100,100,100)T and categories (6,000,30, 30,0,0,100,0,100,0)T, respectively. For experiments across species, sample sizes may differ. Considering that this proposed bivariate method could benefit from pooling information across species, especially when the sample size is small, and the practical situation, two scenarios were implemented: the number of replicates per treatment for each species is equal and small, and the number of replicates per treatment for animals is greater than humans. In addition, to evaluate how robust the proposed method is against array noise, two situations were considered: the two experiments are equally noisy and the human data are noisier than the animal’s. Furthermore, values of parameters in (3) for the sampling distribution of ( β ˆ 1 a i , β ˆ 1 h i ) T were predetermined. Variances of ( β 1 a i , β 1 h i ) T in each component were assumed to be the same: η a 1 2 = η a 2 2 = η a 3 2 = η a 4 2 = η a 7 2 = η a 8 2 = η h 1 2 = η h 2 2 = η h 3 2 = η h 4 2 = η h 5 2 = η h 6 2 =0.25. The correlation between ( β 1 a i , β 1 h i ) T in categories (1,2,3,4) was assumed to be 0.9, i.e., ρ1 = ρ2 = ρ3 = ρ4 = 0.9. Nonzero component means (μ ak ,μ hk )T, k = 0,…,8, were determined so that |μ/η| = 0.5 or 1.5. The combination of these parameters resulted in eight different scenarios for each case as presented in Table 2. Note that | μ β 1 a i | and | μ β 1 h i | represent the absolute value of the mean vector of ( β 1 a i , β 1 h i ) T .

Table 2 Combination of parameters for simulation studies case I and case II

After generating ( β 1 a i , β 1 h i ) T accordingly, the next step is to simulate the two species gene expression data X a and X h based on linear models (1) and (2). Note that β 0 a i and β 0 h i are independent N(8,1) random variables for differentially expressed genes and deterministically 0 for non-differentially expressed genes.

Simulation results

It is of interest to compare how effective the mixture model is on gene identification using information across two species with the conventional one-species approach. The conventional two-sample t test for gene selection was performed using just single species data (animals or humans) and a multiplicity adjustment was made according to the procedure proposed by Benjamini and Hochberg [19].

The results are presented in Table 3. The first section of Table 3, categories (1, 2, 3, 4, 5, 6), manifests the number of genes classified into categories (1, 2, 3, 4, 5, 6) using the the mixture model (Mixture) and the number of genes selected using human data only (Human only), with the corresponding nominal FDR controlled at FDR I . For each simulated data set, FDR I was calculated as (number of genes that are erroneously classified into categories (1, 2, 3, 4, 5, 6))/(total number of genes classified into categories (1, 2, 3, 4, 5, 6)). This was to ensure a fair comparison between the mixture model and the conventional one-species method. FDR II was calculated in the same fashion. Avg FDR I and Avg FDR II are simply the averaged values of FDR I and FDR II across the 500 simulated data sets. When the estimated nominal FDR = 0, i.e., the mixture model did not falsely categorize any genes, the nominal FDR for the single-species method was controlled at 0.0001. The second section of Table 3, categories (1,2,3,4,7,8), represents the number of genes classified into categories (1,2,3,4,7,8) using the mixture model (Mixture) and number of genes selected using animal data alone (Animal only). Beneath each set of eight simulation cases is Tukey’s Honestly Significant Difference (HSD) for a familywise error rate of 0.05 for the results obtained using the proposed mixture model. Tukey’s HSD was calculated as q0.05(8,3,992) × (MS(Error) /500)1/2, since there were eight simulation cases (500 simulated data sets in each situation) in each simulation study (case I and case II) and the error degree of freedom was 3,992. MS(Error) denotes the error mean square (= Error sum of squares/Error degree of freedom, see Table 4). q0.05(8,3,992) = 4.29.

Table 3 The number of genes selected based on (a) bivariate mixture model, (b) conventional one-species approach
Table 4 ANOVA table to quantify variability

The effect of array noise on the mixture model can be easily seen from the column of Avg FDR I and by comparing the results of sim1 vs. sim3, sim5 vs. sim7, sim9 vs. sim11, and sim13 vs. sim15. These are the cases with smaller means and variances for humans which change from 0.1 to 0.3 for each pair of comparisons. The differences between the observed FDRs for these four groups were at least 0.2 and 0.3 for case I and case II, respectively. In contrast, when means were larger, changes of variances did not seem to affect the results in the sense that the corresponding observed FDRs had barely changed while the variances of human increased. The observed FDRs for animals were not as sensitive to the array noise on humans as the observed FDRs for humans.

Under both simulation studies case I and case II, increasing the number of replicates in the animal experiment helped the gene identification for animals: more animal genes were identified for sim5 to sim8 than for sim1 to sim4, and for sim13 to sim16 than for sim9 to sim12. The corresponding FDRs were also lower for sim5 to sim8 and sim13 to sim16. In contrast, increasing the number of replicates in the animal experiment did not significantly improve the results of gene identification for humans.

Table 4 is the summary of a three-way analysis of variance (ANOVA). Three factors, each with two levels, were used in the analysis: replicates ((n a ,n h )T = (10,10)T or (100,10)T), mean magnitude ( ( | μ β 1 a i | , | μ β 1 h i | ) T = ( 0.25 , 0.25 ) T or (0.75,0.75)T), and array noise ( ( σ a 2 , σ h 2 ) T = ( 0.1 , 0.1 ) T or (0.1,0.3)T). This analysis, performed independently, quantifies the variability among the results (gene counts and observed FDRs) obtained using the proposed mixture model in Table 3 for the 16 different simulated situations under simulation studies case I and case II.

Throughout the 16 simulation cases, with nominal FDR controlled at FDR I , the bivariate mixture model outperformed the single-species method for human gene identification by always recognizing more genes with lower observed FDRs. For the animal part, the mixture model performed at least as well as the single-species method on gene selection by identifying at least as many genes. Notice that selecting genes related to humans (categories (1, 2, 3, 4, 5, 6)) seemed to be associated with higher false discovery rate than selecting genes related to animals (categories (1, 2, 3, 4, 7, 8)). Furthermore, the observed FDRs were lower for case I than for case II, regardless of the type of genes interested (differentially expressed for humans or animals).

The comprehensive simulation study suggested that the proposed model, though highly structured, offered advantages over single-species analyses, especially when the magnitude of differential expression due to different treatment intervention was weak.

Application: the mouse and human type II diabetes experiment

Background introduction

A systems biology study was completed by GlaxoSmithKline (GSK) to study the efficacy of type II diabetes drugs in both preclinical (mice) and clinical (humans) experiments. The mouse and human data were collected and preprocessed using Affymetrix MAS 5.0 at the probe set level on Affymetrix MOE430A array and Human Genome U133 Plus 2.0 array, respectively. For mice, the total number of probe sets was 22,690. Mice were fed a diet enriched in fat (58% kcal from fat) for 8 weeks prior to treatment. Most of the mice of this susceptible strain developed obesity and mild hyperglycemia and hyperinsulinemia. Control mice on an 11% low fat diet remained normal. The mouse treatment arm consisted of a diabetes drug at multiple dose levels with vehicle controls over a 2-week period. The study was a full factorial design, where 40 animals in high fat diet and 40 animals in low fat diet were randomized to receive either placebo, or different dosages of the type II diabetes drug (low, medium or high). Note that the results for mice were measured at one time point, the end of the study. For simplicity, only integration results on animals treated with placebo (10 mice) or high dose of the drug (9 mice) were demonstrated.

On the other hand, there were 54,676 probe sets in the human data set and all 59 subjects were type II diabetes patients. The gene expression measurements for human subjects were collected twice during the experiment, one at baseline before the treatment started (week 0) and the other one at the completion of the study (week 8). Among the 59 subjects, 14 only had data for only one time point and hence were not included in further data analysis. Subjects in the clinical trial were treated by either placebo, or three other type II diabetes drugs, including the one given to the mice. Titrated dosing was implemented to ensure that each person received his/her dose based on his/her body profile. Human subjects treated with placebo (number of subjects = 11) and the same type II diabetes drug given to mice (number of subjects = 13) were used in this analysis. Besides being measured at two time points, other information contained in the human data and used in the data analysis included: prior therapy (four categorical levels) and concomitant medication status (two categorical levels).

Data for both species were logarithmic transformed (base equal to 2). Methods for combining information from multiple probe sets that were not identical have been discussed in many publications [20, 21]. A gene-level transcript value in order to pair the mouse and human genes through orthology was obtained by averaging probe sets across a gene. This resulted in 13,483 and 20,252 genes for mouse and human, respectively. Missing values were replaced with array means in both data sets.

The mouse and human orthology information, MGI release 4.32, was used to map the mouse and human orthologous genes. Human and mouse orthologs (17,834) were included in release 4.32. Combining the information among MGI release 4.32 and mouse and human genes from the GSK data gave a total of 11,922 orthologs.

In addition to the measurements of gene expression for both mice and humans, different efficacy endpoints, such as blood glucose, insulin, hemoglobin A1c (HbA1c), and others were also measured during the preclinical and clinical experiments by GSK in order to evaluate the effect of treatment intervention on both species.

The purpose of this data analysis was to evaluate the capability of the proposed mixture model, an approach utilizing the information across two species, to identify genes that may help scientists reveal the biological similarity between two species (ex: mouse and human), essentially genes in categories (1,2,3,4), and so to improve the efficiency of drug development by decreasing the compound attrition rate from preclinical trials to clinical trials.

Data analysis

The estimated treatment effects for mice and humans are obtained by the following independent simple linear models:

X a ij = β 0 a i + β 1 a i T a j + e a ij ,
(5)
X h il week 8 = β 0 h i + X h il week 0 + β 1 h i T h l + β 2 h i PriorTherapy + β 3 h i ConMed + e h il ,
(6)

where i = 1,…,11,922 (number of orthologs); j = 1,…,19 (number of mice); l = 1,…,24 (number of humans). X a ij , X h il week 0 and X h il week 8 are gene expressions from the ith orthologous gene for the jth mouse, gene expression from the ith orthologous gene for the lth human before treatment intervention, and gene expression from the ith orthologous gene for the lth human at the completion of the clinical experiment, respectively. T a j and T h l are {0,1} treatment indicators, and e a ij and e h il are independent N(0, σ a 2 ) and N(0, σ h 2 ) random variables. Additionally, there are two more covariates in the human model: PriorTherapy (a four-level categorical variable indicating patients’ therapy prior to the clinical trial) and ConMed (a two-level categorical variable for concomitant medication status).

Equation (4) was used to model the distribution of the estimated treatment effects. Genes in categories (1, 2, 3, 4) are believed to be potential biomarkers that can greatly improve the design of the process of drug development as these orthologous genes interact with drugs in a way that shows some relationship between two different species (in this case, human and mouse) and so studying the behavior of these genes in preclinical trials might help scientists better understand the mechanism of drug activities in clinical trials.

Results

Parameter estimation

The maximum likelihood estimates of the parameters in the bivariate normal mixture model using the EM algorithm are given in Table 5. The estimated mixture weight for category 0 was π ˆ 0 =0.889 indicating approximately 10,599 (11,922 × 0.889) pairs of uninteresting mouse and human orthologs. (μ a ,μ h )T denotes the mean vector of each mixture component. The estimated treatment effect means of mice were in general larger than that of humans, indicating that the magnitude of the difference of expression in genes due to treatment intervention in mice tended to be larger than in genes where differential expression is exhibited in humans. The estimated variance for mice σ ˆ ak 2 tended to be larger than that for humans σ ˆ hk 2 , suggesting that overall, the variability in the mouse data set was larger than the human data set. The estimated correlation coefficients between the treatment effects for both species were (-0.531,0.380,-1,0.102) for categories (1,2,3,4), respectively. The estimated correlation for category 3 was based on only two observations. Standard errors of the parameter estimates from the two-species experiment were obtained by bootstrapping with 1,000 bootstrap replicates. Based on the bootstrap standard errors, there was little evidence of bias for the parameter estimates.

Table 5 Parameter estimates of the bivariate mixture model

Gene identification

The vector of the number of genes identified for each category was ( n ˆ 0 , n ˆ 1 , , n ˆ 8 ) T = (10,814, 41, 12, 2, 20, 168, 578, 12, 275)T. Figure 1 displays the scatter plots of ( β ˆ 1 a i , β ˆ 1 h i ) T before and after gene membership identification, including the scatter plot of ( β ˆ 1 a i , β ˆ 1 h i ) T for all orthologs, orthologs after eliminating the uninteresting ones (orthologs not in category 0), and orthologs reacting to the treatment stimulus for both species (orthologs in categories (1, 2, 3, 4)).

Figure 1
figure 1

Scatter plots of the estimated treatment effects ( β ˆ 1 a i , β ˆ 1 h i ) T before and after gene membership identification. From left to right: all orthologs, orthologs differentially expressed in either species (categories (1,2,3,4,5,6,7,8)), and orthologs differentially expressed in both species (categories (1,2,3,4)).

Based on the bivariate mixture model, among the 11,922 pairs of orthologs, 10,814 pairs did not react to the drug treatment for either species, 75 pairs of orthologs (sum of the gene counts in categories 1 through 4) showed evidence of differentiation between treatments for both species. Genes in categories (5,6,7,8) are also potential candidates for further investigation to improve the process of drug development since studying these genes might uncover the myth of the overall high attrition rate of a drug compound from preclinical trials to clinical trials.

In comparison, an attempt was made to identify differentially expressed human genes using solely the human data, i.e., traditional t statistics were used to test whether or not β 1 h i =0 and the approach of Benjamini and Hochberg [19] was used to adjust for multiple comparisons. With the nominal FDR controlled at 0.01, 0.1, 0.5, and 0.9, this single-species method failed to identify any differentially expressed human genes at any levels of FDR. With the p values histograms in Figure 2 showing an obvious difference between the observed significance of differential expression for mice and humans, specifically, the p values were nearly uniformly distributed for humans, the result was not surprising.

Figure 2
figure 2

Histograms of p values from tests of no treatment effects. From left to right: p value histogram for mice and p value histogram for humans.

Concluding remarks

This research was motivated by a fundamental yet still not well-understood problem in the drug development process. The results obtained in preclinical animal trials do not seem to translate well enough to make inferences for human clinical trials, resulting in an undesirably high attrition rate in human experiments. A bivariate mixture model which utilizes information across two species was proposed to identify genes that exhibit similar patterns of expression across species, with the hope that studying genes could help understand biological differences across species at the molecular level and ultimately help reduce attrition in drug development. It is also of great interest to identify genes active in animal, but not in human since studying this group of genes might lead to answers that explain why some drug trials fail in translation to humans. The comprehensive simulation study suggested that the proposed model, though highly structured, can accommodate various configurations of differential gene expression, especially when the magnitude of differential expression due to different treatment intervention was weak.

In the application of the bivariate mixture model on the GSK type II diabetes experiment, the mixture model was able to separate differentially expressed genes from non-differentially expressed genes. A potential multi-gene predictor may be developed according to the genes identified by the bivariate mixture model to benefit patients in therapeutic decision making.

The mixture model is highly structured, with strong but somewhat simplifying assumptions. The grouping of all genes for which the expression difference is positive in both species into a single category parameterized by a single bivariate mean may be somewhat of a simplistic approach. In practice, the data may not be normally distributed or be comprised by exactly nine groups and lead to bias and inefficiency. Forcing the normality assumption and the grouping may be inefficient. However, by modeling the least squares estimates of the expression differences as bivariate realizations from a distribution with a single mean vector, some flexibility in the model is retained. It is perhaps less important to precisely quantify the magnitude of the expression difference than to determine whether it is positive or negative and whether or not the direction is preserved across species.

Using the proposed mixture model for gene identification is completely data-driven. The simulation study in Section ‘Simulation’ indicated that the mixture model at times report a poor observed false discovery rate due to large variability in measurement of expression. Currently, gene membership classification is determined by maximizing the posterior probability that observation y i belongs to the kth cluster. It is possible to choose costs to attach weight to different types of misclassification and then choose a classification rule to minimize expected cost. This rule may lead to a more desirable list that can better accommodate goals for future research.

The mixture model currently only handles data with two treatments/cancer types/drugs, i.e., data containing only two-level variables. In practice, it is often the case that an experiment is run under multiple conditions. An extension of the model to handle experiments with factors with three-or-more levels is an important future work.

This approach focuses on identifying orthologs that show same/opposite mechanism between two species. However, genes identified by the bivariate mixture model (genes in categories (1, 2, 3, 4)) may not lead to the most powerful model for prediction of cancer types/response status for either one of the species. The most powerful model may be based on those human genes that lie in categories 5 and 6. For those genes, the corresponding animal genes show no signs of differential expression. Hence, incorporating prediction ability into the model or developing a prediction model that can help utilize the genes selected by the mixture model is of great interest and is an obvious candidate for future research.

Appendix

Mixture models and the EM algorithm

Estimating the parameters in (4) is nontrivial. Redner and Walker [22] offer an excellent review of estimating the parameters which determine a mixture density. In particular, the paper is devoted to a particular iterative procedure for numerically approximating maximum likelihood estimates (MLE) of the parameters in mixture densities. This method was formalized by Dempster et al. [23] and termed the EM (Expectation-Maximization) algorithm, and is used for numerically approximating the maximum likelihood estimates for (4).

The EM algorithm with no constraints

For a finite mixture model with C components, given data y with independent multivariate observations y1,…,y g , each y i is taken to be a realization of the mixture probability density function,

f ( y i | Ψ ) = k = 1 C π k f k ( y i | θ k ) ,

where Ψ = (θ1,…,θ C ,π1,…,π C )T, a vector of unknown parameters. f k and θ k are the density and parameters of the kth component in the mixture, respectively. π k is the probability that an observation arises from the kth component. Note that π k  ≥ 0 and k = 1 C π k =1. For classification purpose and to achieve minimum misclassification rates, y i is assigned to the population (category) for which the posterior probability that y i belongs to the kth cluster (the kth component of the mixture) is maximized. The posterior probability is given by

τ k ( y i ; Ψ ) = π k f k ( y i | θ k ) h = 1 C π h f h ( y i | θ h ) .

In summary, the EM algorithm may be implemented to maximize the likelihood of a multivariate normal mixture model by following these two steps:

E-stepThe E-step on the (j+1) iteration takes the conditional expectation of the complete-data log likelihood, given the observed data (Q( Ψ|Ψ(j))).

At the (j + 1) iteration, the E-step results in

Q(Ψ| Ψ ( j ) )= i = 1 g k = 1 C τ k ( y i ; Ψ ( j ) )(log π k +log f k ( y i | θ k )),
(7)

where f k ( y i | θ k )= ( 2 π ) - p 2 | Σ k | - 1 2 e - 1 2 ( y i - μ k ) T Σ k - 1 ( y i - μ k ) and τ k ( y i ; Ψ ( j ) )= π k ( j ) f k ( y i | θ k ( j ) ) h = 1 C π h ( j ) f h ( y i | θ h ( j ) ) .

M-stepThe M-step on the (j+1) iteration requires the global maximization of Q(Ψ|Ψ(j)) with respect to Ψ over the parameter space to give the updated estimate Ψ(j+1).

π ˆ k ( j + 1 ) = i = 1 g τ k ( y i ; Ψ ( j ) ) g ;
(8)
μ ˆ k ( j + 1 ) = i = 1 g τ k ( y i ; Ψ ( j ) ) y i i = 1 g τ k ( y i ; Ψ ( j ) ) ;
(9)
Σ ˆ k ( j + 1 ) = i = 1 g τ k ( y i ; Ψ ( j ) ) ( y i - μ k ) T ( y i - μ k ) i = 1 g τ k ( y i ; Ψ ( j ) ) .
(10)

The iterations of the EM algorithm continue until some stopping criterion is met, such as the difference of the conditional expectation of the complete-data log likelihood at the (j + 1) step and the conditional expectation of the complete-data log likelihood at the j step is sufficiently small, i.e.,

Q ( Φ ( j + 1 ) | Φ ( j ) ) - Q ( Φ ( j ) | Φ ( j ) ) ε .

Throughout this research, ε = 0.0001.

The EM Algorithm with constraints

With constraints on the parameter space, the EM algorithm derived in Section ‘The EM algorithm with no constraints’ cannot be used directly. To accommodate the restricted parameter space introduced for (4), first note that (4) is a bivariate normal mixture with nine components and the density for each component can be written as

f k y i | θ k = e - 1 2 1 σ a k 2 σ h k 2 1 - ρ k 2 σ h k 2 y a i - μ a k 2 - 2 ρ k σ a k σ h k y a i - μ a k y h i - μ h k + σ a k 2 y h i - μ h k 2 2 π σ a k 2 σ h k 2 1 - ρ k 2 ,

where y i = ( y a i , y h i ) T = ( β ˆ 1 a i , β ˆ 1 h i ) T and θ k = ( μ a k , μ h k , σ a k 2 , σ h k 2 , ρ k ) T .

The solution for the E-step remains the same. The membership probability τ k (y i ;Ψ(j)) can be obtained by taking the conditional expectation of the complete-data log likelihood, given the observed data.

As for the M-step, the MLE for π k remains the same, π ˆ k ( j + 1 ) = i = 1 g τ k ( y i ; Ψ ( j ) ) g . However, the MLEs for μ k and Σ k need to be modified according to the constrained parameter space for each mixture component.

Essentially, the MLEs on the (j + 1) iteration for (μ ak ,μ hk )T, k = 1,…,4, are

μ ˆ a 1 ( j + 1 ) = i = 1 g τ 1 ( y i ; Ψ ( j ) ) y a i i = 1 g τ 1 ( y i ; Ψ ( j ) ) if i = 1 g τ 1 ( y i ; Ψ ( j ) ) y a i 0 ; 0 otherwise. μ ˆ h 1 ( j + 1 ) = i = 1 g τ 1 ( y i ; Ψ ( j ) ) y h i i = 1 g τ 1 ( y i ; Ψ ( j ) ) if i = 1 g τ 1 ( y i ; Ψ ( j ) ) y h i 0 ; 0 otherwise. μ ˆ a 2 ( j + 1 ) = i = 1 g τ 2 ( y i ; Ψ ( j ) ) y a i i = 1 g τ 2 ( y i ; Ψ ( j ) ) if i = 1 g τ 2 ( y i ; Ψ ( j ) ) y a i 0 ; 0 otherwise. μ ˆ h 2 ( j + 1 ) = i = 1 g τ 2 ( y i ; Ψ ( j ) ) y h i i = 1 g τ 2 ( y i ; Ψ ( j ) ) if i = 1 g τ 2 ( y i ; Ψ ( j ) ) y h i 0 ; 0 otherwise. μ ˆ a 3 ( j + 1 ) = i = 1 g τ 3 ( y i ; Ψ ( j ) ) y a i i = 1 g τ 3 ( y i ; Ψ ( j ) ) if i = 1 g τ 3 ( y i ; Ψ ( j ) ) y a i 0 ; 0 otherwise. μ ˆ h 3 ( j + 1 ) = i = 1 g τ 3 ( y i ; Ψ ( j ) ) y h i i = 1 g τ 3 ( y i ; Ψ ( j ) ) if i = 1 g τ 3 ( y i ; Ψ ( j ) ) y h i 0 ; 0 otherwise. μ ˆ a 4 ( j + 1 ) = i = 1 g τ 4 ( y i ; Ψ ( j ) ) y a i i = 1 g τ 4 ( y i ; Ψ ( j ) ) if i = 1 g τ 4 ( y i ; Ψ ( j ) ) y a i 0 ; 0 otherwise. μ ˆ h 4 ( j + 1 ) = i = 1 g τ 4 ( y i ; Ψ ( j ) ) y h i i = 1 g τ 4 ( y i ; Ψ ( j ) ) if i = 1 g τ 4 ( y i ; Ψ ( j ) ) y h i 0 ; 0 otherwise.

The MLEs of Σ k , k = 1,…,4, at the (j + 1) M-step remain unchanged as in (10).

The MLEs on the (j + 1) iteration for (μ ak ,μ hk )T, k = 5,…,8, are

μ ˆ h 5 ( j + 1 ) = i = 1 g τ 5 ( y i ; Ψ ( j ) ) y h i i = 1 g τ 5 ( y i ; Ψ ( j ) ) if i = 1 g τ 5 ( y i ; Ψ ( j ) ) y h i 0 ; 0 otherwise. μ ˆ h 6 ( j + 1 ) = i = 1 g τ 6 ( y i ; Ψ ( j ) ) y h i i = 1 g τ 6 ( y i ; Ψ ( j ) ) if i = 1 g τ 6 ( y i ; Ψ ( j ) ) y h i 0 ; 0 otherwise. μ ˆ a 7 ( j + 1 ) = i = 1 g τ 7 ( y i ; Ψ ( j ) ) y a i i = 1 g τ 7 ( y i ; Ψ ( j ) ) if i = 1 g τ 7 ( y i ; Ψ ( j ) ) y a i 0 ; 0 otherwise. μ ˆ a 8 ( j + 1 ) = i = 1 g τ 8 ( y i ; Ψ ( j ) ) y a i i = 1 g τ 8 ( y i ; Ψ ( j ) ) if i = 1 g τ 8 ( y i ; Ψ ( j ) ) y a i 0 ; 0 otherwise.

The MLE for Σ k , k = 0,5,6,7,8, on the (j + 1) iteration is

Σ ˆ k = i = 1 g τ k ( y i ; Ψ ( j ) ) y a i 2 i = 1 g τ k ( y i ; Ψ ( j ) ) 0 0 i = 1 g τ k ( y i ; Ψ ( j ) ) y h i 2 i = 1 g τ k ( y i ; Ψ ( j ) ) .

Regularized covariance matrices in the EM algorithm

The component covariance matrices of a mixture model may be singular or near singular in the EM iterative process. When the covariance matrices corresponding to one or more components are ill-conditioned (singular or near singular), the EM algorithm breaks down. Particularly, applying the EM algorithm for a mixture model with large numbers of components when there are actually fewer groups often results in the failure of the EM algorithm due to ill-conditioning [24, 25]. Indeed, this break-down of the EM algorithm may imply that clusters contain insufficient observations and too many components are used to fit the data set where there are actually fewer clusters, or clusters contain points that are of very little variation compared to other clusters. Hence, an intuitive solution to this is to decrease the number of the mixture components. However, this immediately leads to another question: how many clusters are needed. Though an active area of research, it is beyond the scope of this study. Nonetheless, various approaches have been proposed to generate numerically non-singular covariance matrices [2633]. Among such, the regularization method proposed by Sato and Ishii [32] has been adopted to obtain numerically non-singular covariance matrices throughout this research. In Sato and Ishii [32], the regularized covariance matrix for the kth mixture component in the (j + 1) M-step is

Σ Rk ( j + 1 ) = Σ k ( j + 1 ) +α tr ( Σ k ( j + 1 ) ) p I p ,
(11)

where 0 ≤ α ≤ 1 is a small constant and I p is a p-dimensional identity matrix. If Σ k ( j + 1 ) equals 0, then tr( Σ k ( j + 1 ) ) is set to be a small threshold value (0.0001 in this research).

References

  1. FDA: Innovation and stagnation: challenge and opportunity on the critical path to new medical products. FDA White Paper. 2004

    Google Scholar 

  2. Bolten B, DeGregorio T: Trends in development cycles. Nat Rev Drug Discov. 2002, 1: 335-336.

    Article  CAS  PubMed  Google Scholar 

  3. Kola I: Landis J: Can the pharmaceutical industry reduce attrition rates?. Nat Rev Drug Discov. 2004, 3: 711-715.

    Article  CAS  PubMed  Google Scholar 

  4. Braxton S, Bedilion T: The integration of microarray information in the drug development process. Curr Opin Biotechnol. 1998, 9: 643-649.

    Article  CAS  PubMed  Google Scholar 

  5. Debouck C, Goodfellow P: DNA microarrays in drug discovery and development. Nat Genet. 1999, 21: 48-50.

    Article  CAS  PubMed  Google Scholar 

  6. Koonin E: An apology for orthologs - or brave new memes. Genome Biol. 2001, 2 (4): 1005.1-1005.2.

    Article  Google Scholar 

  7. Sonnhammer E, Koonin E: Orthology, paralogy and proposed classification for paralog subtypes. Trends Genet. 2002, 18: 619-620.

    Article  CAS  PubMed  Google Scholar 

  8. Theiben G: Secret life of genes. Nature. 2002, 415: 741-

    Article  Google Scholar 

  9. Holbrook J, Sanseau P: Drug discovery and computational evolutionary analysis. Drug Discov Today. 2007, 12: 826-832.

    Article  CAS  PubMed  Google Scholar 

  10. Grigoryev D, Ma S, Irizarry R, Ye S, Quackenbush J, Garcia J: Orthologous gene-expression profiling in multi-species models: search for candidate genes. Genome Biol. 2004, 5: R34.1-R34.13.

    Article  Google Scholar 

  11. Batzoglou S, Pachter L, Mesirov J, Berger B, Lander E: Human and mouse gene structure: comparative analysis and application to exon prediction. Genome Res. 2008, 10: 950-958.

    Article  Google Scholar 

  12. Taher L, Rinner O, Garg S, Sczyrba A, Morgenstern B: AGenDA: gene prediction by cross-species sequence comparison. Nucleic Acids Res. 2004, 1: W305-W308.

    Article  Google Scholar 

  13. Ogorek B: Orthology-based multilevel modeling of differentially expressed mouse and human gene pairs. PhD thesis. 2008

    Google Scholar 

  14. Lelandais G, Vincens P, Badel-Chagnon A, Vialette S, Jacq C, Hazout S: Comparing gene expression networks in a multi-dimensional space to extract similarities and differences between organisms. Bioinformatics. 2006, 22: 1359-1366.

    Article  CAS  PubMed  Google Scholar 

  15. Park D, Park J, Park S, Park T, Choi S: Analysis of human disease genes in the context of gene essentiality. Genomics. 2008, 92: 414-418.

    Article  CAS  PubMed  Google Scholar 

  16. McLachlan G, Basford K: Mixture Models: Inference and Applications to Clustering. 1988, New York: Marcel Dekker, Inc.

    Google Scholar 

  17. McLachlan G, Peel D: Finite Mixture Models. 2000, New York: Wiley

    Book  Google Scholar 

  18. Blake J, Richardson J, Bult C, Kadin J: Eppig J, the Mouse Genome Database Group: The Mouse Genome Database (MGD): the model organism database for the laboratory mouse. Nucleic Acids Res. 2002, 30: 113-115.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  19. Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc B. 1995, 57: 289-300.

    Google Scholar 

  20. Li C, Wong W: Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci USA. 2001, 98: 31-36.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  21. Li C, Zhu D, Cook M: A statistical framework for consolidating sibling probe sets for Affymetrix GeneChip data. BMC Genomics. 2008, 9: 188-

    Article  PubMed Central  PubMed  Google Scholar 

  22. Redner R, Walker H: Mixture densities, maximum likelihood and the EM algorithm. J Soc Ind Appl Math. 1984, 26 (2): 195-239.

    Google Scholar 

  23. Dempster A, Laird N, Rubin D: Maximum likelihood from incomplete data via the EM algorithm. J Roy Stat Soc B. 1977, 39: 1-38.

    Google Scholar 

  24. Fraley C, Raftery A: How many clusters? Which cluster method? Answer via model-based cluster analysis. Comput J. 1998, 41 (8): 578-588.

    Article  Google Scholar 

  25. Fraley C, Raftery A: Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc. 2002, 97 (458): 611-631.

    Article  Google Scholar 

  26. Robotka Z, Zempleni A, Hajas C, Seres C, Balazs S: Genetic algorithms and grid technologies in clustering, an example: clustering of images. Qual Reliab Eng Int. 2008, 24: 693-703.

    Article  Google Scholar 

  27. Vlassis N, Likas A: A greedy EM algorithm for Gaussian mixture. Neural Process Lett. 2002, 15: 77-87.

    Article  Google Scholar 

  28. Hsieh P, Landgrebe D: Statistics enhancement in hyperspectral data analysis using spectral-spatial labeling, the EM algorithm, and the leave-one-out covariance estimator. SPIE International Symposium on Optical Science, Engineering, and Instrumentation. 1999, Denver, Colorado, 19-24.

    Google Scholar 

  29. Snoussi H, Mohammad-Djafari A: Penalized maximum likelihood for multivariate Gaussian mixture. AIP Conference Proceedings. 2002, 36-46.

    Chapter  Google Scholar 

  30. Mao J, Jain A: A self-organizing network for hyperellipsoidal clustering (HEC). IEEE Trans Neural Netw. 1996, 7: 16-29.

    Article  CAS  PubMed  Google Scholar 

  31. Archambeau C, Verleysen M: Fully nonparametric probability density function estimation with finite gaussian mixture models. Fifth International Conference on Advances in Pattern Recognition. 10-13 Dec 2003. 2003, Calcultta, India, 81-84.

    Google Scholar 

  32. Sato M, Ishii S: On-line EM algorithm for the normalized Gaussian network. Neural Comput. 2000, 12: 407-432.

    Article  CAS  PubMed  Google Scholar 

  33. Lee K: A new, EM algorithm for resource allocation network. Pattern Recognition and Data Mining. 2005, Berlin: Springer

    Google Scholar 

Download references

Acknowledgements

The authors thank GlaxoSmithKline for sponsoring the systems biology study and collaborating in the data analyses.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuhua Su.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

YS, LZ, AM and JO have participated in the development of the proposed model. YS has drafted the manuscript. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Su, Y., Zhu, L., Menius, A. et al. Mixture models for gene expression experiments with two species. Hum Genomics 8, 12 (2014). https://doi.org/10.1186/1479-7364-8-12

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1479-7364-8-12

Keywords