Development of an integrated genome informatics, data management and workflow infrastructure: A toolbox for the study of complex disease genetics

Burren, Oliver S.; Healy, Barry C.; Lam, Alex C.; Schuilenburg, Helen; Dolman, Geoffrey E.; Everett, Vincent H.; Laneri, Davide; Nutland, Sarah; Rance, Helen E.; Payne, Felicity; Smyth, Deborah; Lowe, Chris; Barratt, Bryan J.; Twells, Rebecca C.J.; Rainbow, Daniel B.; Wicker, Linda S.; Todd, John A.; Walker, Neil M.; Smink, Luc J.

doi:10.1186/1479-7364-1-2-98

Primary research
Published: 01 January 2004

Development of an integrated genome informatics, data management and workflow infrastructure: A toolbox for the study of complex disease genetics

Oliver S. Burren¹,
Barry C. Healy¹,
Alex C. Lam¹,
Helen Schuilenburg¹,
Geoffrey E. Dolman¹,
Vincent H. Everett¹,
Davide Laneri¹,
Sarah Nutland¹,
Helen E. Rance¹,
Felicity Payne¹,
Deborah Smyth¹,
Chris Lowe¹,
Bryan J. Barratt¹,
Rebecca C.J. Twells¹,
Daniel B. Rainbow¹,
Linda S. Wicker¹,
John A. Todd¹,
Neil M. Walker¹ &
…
Luc J. Smink¹

Human Genomics volume 1, Article number: 98 (2004) Cite this article

10k Accesses
7 Citations
Metrics details

Abstract

The genetic dissection of complex disease remains a significant challenge. Sample-tracking and the recording, processing and storage of high-throughput laboratory data with public domain data, require integration of databases, genome informatics and genetic analyses in an easily updated and scaleable format. To find genes involved in multifactorial diseases such as type 1 diabetes (T1D), chromosome regions are defined based on functional candidate gene content, linkage information from humans and animal model mapping information. For each region, genomic information is extracted from Ensembl, converted and loaded into ACeDB for manual gene annotation. Homology information is examined using ACeDB tools and the gene structure verified. Manually curated genes are extracted from ACeDB and read into the feature database, which holds relevant local genomic feature data and an audit trail of laboratory investigations. Public domain information, manually curated genes, polymorphisms, primers, linkage and association analyses, with links to our genotyping database, are shown in Gbrowse. This system scales to include genetic, statistical, quality control (QC) and biological data such as expression analyses of RNA or protein, all linked from a genomics integrative display. Our system is applicable to any genetic study of complex disease, of either large or small scale.

Introduction

The availability of the genome sequences for human and mouse [1–3], and for other species, has provided one of the essential reagents for identifying the primary or causal polymorphisms contributing to the inherited risk of common multifactorial disease. The other prerequisite is substantial numbers of samples of affected individuals and controls, in the order of thousands.

The large amount of data from the Human Genome Project (HGP) has necessitated the use of comprehensive data repositories such as EMBL, GenBank and DDBJ, and specific subsets of genomic information such as the Single Nucleotide Polymorphism Database (dbSNP) and the database of Expressed Sequence Tags (dbEST) [4–6]. Increasingly, however, other information relevant to genomics and genetics has become available, such as protein domains [7, 8], Gene Ontology (GO; The Gene Ontology Consortium, 2001) and pathways (KEGG) [9]. This expansion of data provided the need and opportunity for databases which integrate genome sequence, homologies, SNPs, proteins, protein domains and annotations, and allow visualisation in a single integrated view [5, 10–13]. These tools have aided scientists in establishing the content of regions of interest with regard to genes, SNPs, homologies and any other features of the genome. Data warehousing strategies, such as EnsMart, have made answering complex biological queries possible without the need for computing skills and a large computer setup [12].

An essential prerequisite in our effort to find genes involved in type 1 diabetes (T1D) in both human and mouse has been the development of a modular informatics infrastructure based on freely available tools such as Gbrowse [14] ACeDB [15, 16], and Ensembl. All local genomic data are stored in a feature database, the genotyping data are stored in a separate genotyping database. The databases are custom relational databases (MySQL) [17]. Local features can be visualised and integrated with public domain data using Gbrowse. All parts of our system are linked together with Perl and Bioperl [18]. This, together with the Gbrowse feature that allows web pages to be linked to genomic features, has allowed the integration of different types of genetic and genomic data using a single visualisation platform. Our solution will be of interest to any research group working on complex disease, providing flexibility and scalability from single gene-based analyses to genome-wide investigations.

Materials and methods

Databases

The barcode management system

The barcode management system (BMS) was developed on a Dell Latitude C600^(TM)with a Pentium^(TM)III processor and 256MB of RAM under Microsoft Windows 2000^(TM)(SP3). Coding and compilation was carried out using Microsoft Visual Basic (VB) 6.0^(TM)and Microsoft Access 2000^(TM). Piccolink (RF600) handheld radio barcode scanners and base stations were obtained from Nordic ID [19]. Cryo-viable labels and print ribbons were sourced from Partnered Print Solutions. Labels were printed on a Zebra TLP 2742 thermal barcode printer [20] using EnLabel 2.61 print software available from Image Computer Systems Ltd [21]. Further detailed information on hardware and software dependencies, along with detailed documentation and source code, is available from the BMS website [22].

Feature database

The feature database has been developed largely using Open Source components. The primary development environment is Linux^(TM)(Red Hat^(TM)9.0), with a MySQL database backend (3.23.56) and Apache webserver (1.3.29). A Sun Enterprise 450^(TM)(SunOs 5.8) is the main database and intranet webserver. All programming was done in Perl (5.6.0) using the standard libraries and Bioperl (1.0.2).

Genotyping database

The genotyping database uses the same components as the feature database, with additional graphics generated by the perl GD::Graph modules. Web forms were generated with CGI:FormBuilder. The data-loaders are written in Tcl and Bourne and Korn shell with embedded SQL.

Freezer management system

The freezer managment system (FMS) uses the same front-end components as BMS and the same backend components as the genotyping database, all linked together through MySQL connector/ODBC (3.51).

Annotation

ACeDB Version 4.9f is run on a gene by gene basis to perform annotation. In short, manual curators make a local copy of an empty ACeDB database. Coordinates for the region of interest are obtained from Ensembl, the information extracted in ace format and loaded into the ACeDB database. The fmap display is used to verify the gene structure. In case of disagreements between the Ensembl-predicted gene structure and the curators, new structures can be annotated based on an mRNA sequence using BLAT. The new structure is read into ACeDB for verification before extraction to the feature database.

SNP detection

PCR products from unrelated individuals are sequenced and gap4 sequence alignments produced. SNPs are detected by gap4 and the traces inspected manually to verify the SNP calls. As SNPs are verified, they are changed to the corresponding International Union of Biochemistry (IUB) codes. A perl script is then used to scan the alignment and register the IUB characters as SNPs, producing four output files: a genotype file containing genotypes of each individual at each SNP position: a file with flanking sequences of SNPs to facilitated genotype assay design: a SNP file for uploading into the database: and a file with the consensus of the sequence reads. The SNP file and the genotype file are uploaded via a web form into the database. The form also provides an interface for additional SNP information. The consensus sequence file is uploaded to the SRS server and into the feature database and Gbrowse.

Gbrowse

Generic Gbrowse version 1.50 and perl version 5.8.0 were installed on Intel^(R)Xeon^(TM)2 X CPU 2.80 GHz with 2 Gb RAM running the RedHat 9 Linux operating system. Features of interest were obtained via the Ensembl Perl-API and converted into GFF using in-house perl scripts. GFF data describing plots for exon, repeat and SNP density and percentage GC content were based on downloaded Ensembl data and generated by perl scripts. The GFF data was loaded into MySQL version 3.23.56 via the Gbrowse load_gff.pl and bulk_load_gff.pl scripts. The information was visualised using Apache web server version 2.0.46.

Results

Strategy

The genetic strategy dataflow is shown in Figure 1 and the information dataflow is illustrated in Figure 2. All regions and/or genes targeted for genetic analysis are chosen based on linkage information, published literature and animal model data and known gene functions. For all regions [23], a chromosome-based coordinate system is used rather than a clone-based coordinate system. This limits recalculations and allows straightforward communication of regions, genes, primers and any other mapped features of interest, both internally and with collaborators. Initially, homology searches were performed locally using WU-BLAST [24], since Ensembl provides only the top matching homologies; however, performing homology searches locally for large regions became too resource intensive. Currently, all genomic information is extracted from a local installation of the Ensembl databases. For all target regions, sequence is stored from the 5' and 3' ends of the regions in the feature database. This allows the regions to be remapped once a new genome build is released. All Ensembl queries can be run remotely on the server made available by Ensembl; however, a local installation gives a speed advantage and less vulnerability to limitations with the Ensembl server, ie high loads from multiple large queries.

For each chromosome region, exons of candidate genes and the 3 kb flanking sequence are resequenced in 32 or 96 unrelated individuals (usually affected individuals) from 500-600 bp polymerase chain reaction (PCR) amplicons, for both strands for SNP identification. SNPs are identified in the sequences, extracted and stored in the feature database. SNPs are remapped against the current genome build and the sequence panel's genotypes collated in genomic order so that haplotype-tag SNPs (htSNPs) [25] can be chosen -- essentially a subset of SNPs that best predict the other SNPs, given that SNPs tend to be in strong linkage disequilibrium (LD) within a gene or small region. A multistage design is optimal for large-scale genetic studies [26]. The htSNPs and other candidate SNPs (by position or from literature) are genotyped initially in about 25 per cent of the clinical samples, in our case, 4,000 individuals. This panel contains the same DNAs that were genotyped by sequencing of PCR products to crosscheck sequence-based and locus-specific genotyping results. A global test for association between the whole set of htSNPs and disease [26] is performed, and a low probablity threshold (P-values < 0.2) set as a criterion for additional genotyping in a further collection of cases/controls and families. Stage 1 and 2 (or even stage 3) genotyping data are then analysed together. Overall, there is little loss of power in such a design compared with genotyping all available families from the outset. It does, however, result in an overall saving of genotyping of approximately 70 per cent in approximately 90 per cent of non-associated genes, in addition to the saving made by gen-otyping htSNPs (Lowe et al. unpublished) [27],.