|Home | About | Journals | Submit | Contact Us | Français|
Determining the genetic factors in a disease is crucial to elucidating its molecular basis. This task is challenging due to a lack of information on gene function. The integration of large-scale functional genomics data has proven to be an effective strategy to prioritize candidate disease genes. Mitochondrial disorders are a prevalent and heterogeneous class of diseases that are particularly amenable to this approach. Here we explain the application of integrative approaches to the identification of mitochondrial disease genes. We first examine various datasets that can be used to evaluate the involvement of each gene in mitochondrial function. The data integration methodology is then described, accompanied by examples of common implementations. Finally, we discuss how gene networks are constructed using integrative techniques and applied to candidate gene prioritization. Relevant public data resources are indicated. This report highlights the success and potential of data integration as well as its applicability to the search for mitochondrial disease genes.
A central task in elucidating the molecular basis of a genetic disease is identification of the causative defect—that is, the gene or genes whose mutations result in the disease. This knowledge is also crucial for developing effective therapeutic strategies that target the key molecular players rather than simply alleviating the symptoms. For the majority of Mendelian or suspected Mendelian diseases, however, the genetic basis remains undetermined ; complex diseases driven by multiple genetic and environmental factors have proven even more challenging .
The most common approach used for disease gene identification is positional cloning to pinpoint the disease locus. In this approach, linkage analysis using polymorphic genetic markers isolates a region of the genome that segregates with the disease phenotype . Candidate genes are then selected from this region and screened for mutations in a patient population. This approach has helped to identify the causative defect of approximately 2400 genetic diseases (Online Mendelian Inheritance in Man (OMIM) [1, 4]). However, its success is limited by the genetic complexity of most diseases, which exacerbates issues such as inadequate sample sizes, a lack of informative meiotic crossover events, genetic heterogeneity, misdiagnosis, epistatis, and incomplete penetrance [2, 5–7]. Consequently, positional cloning frequently fails to identify a disease locus. When a locus is identified, it often contains more candidate genes than one laboratory can feasibly screen . Theoretically, these candidates can be filtered according to their biological annotation. In practice, however, annotation is usually insufficient, and therefore informed prioritization requires a more complete understanding of gene function.
High-throughput technologies generate extensive data on gene function: these technologies include genome sequencing, gene expression arrays, protein-protein interaction screens, RNA interference, mass spectrometry, and metabolite profiling. An advantage of these technologies is that all measurements in a dataset are made under uniform conditions, enabling quantitative comparison. Furthermore, data is generated on uncharacterized genes, providing indications of their function. Orthology allows this data to be transferred across species, and text mining consolidates information from decades of single-gene studies. Each genome-scale approach is biased towards different functional subsets of genes and prone to certain errors; consequently, overlap among different datatypes is often limited . In order to effectively predict gene function, a combined analysis of these datasets is required to capitalize on their strengths and compensate for their limitations. Data integration techniques are efficient at extracting information from multiple datasets , and are thus an invaluable tool for candidate gene prioritization [8, 10, 11].
Integrative approaches have been particularly useful in the functional characterization of mitochondria . In addition to producing the majority of cellular ATP through respiration, this organelle plays a central role in metabolism, ion storage, oxidative stress management, signal transduction, antiviral response, and apoptosis [12, 13]. Due to the diverse and fundamental nature of these processes, mitochondrial dysfunction can impair multiple organ systems . Mitochondrial diseases primarily affect tissues with high energy requirements (e.g., central nervous system, muscle, liver) [13, 15], and include myopathies, dystrophies, and neurodegenerative disorders. (Mitochondrial disease descriptions can be found on the United Mitochondrial Disease Foundation website www.umdf.org as well as in the OMIM database .) The frequency of mitochondrial diseases is significantly higher than expected based on the estimated number of protein components . Mitochondrial diseases follow either maternal, Mendelian, or complex inheritance since some proteins (13 in human) are encoded by the mitochondrial genome, while the vast majority are nuclear-encoded [14, 16]. For all of these reasons, it is believed that many diseases with an unknown molecular basis are mitochondrial [12, 17].
The identification of mitochondrial disease genes is limited by the insufficient characterization of the mitochondrial proteome. To date, only half the 1500 proteins expected to localize within human mitochondria have been identified [16, 18]. Since mitochondrial and cellular functions are tightly integrated, a full characterization of mitochondria requires complementing this set with extraorganellar proteins involved in, for example, mitochondrial transcriptional regulation, biogenesis, metabolic branches, and signalling. Systematic approaches have been directed at identifying these components (i.e., predicting “parts lists”) [18–20], generating interaction networks , and developing mathematical models . Several catalogs of mitochondrial genes have been predicted (Table 1): the most comprehensive of these is the MitoP2 database, generated using data integration techniques .
By predicting genes involved in mitochondrial function, data integration techniques have proven successful at prioritizing candidate mitochondrial disease genes. For example, a screen of multiple parameters in the Saccharomyces cerevisiae deletion collection identified 466 genes whose deletion impairs respiration. The high conservation of yeast and human mitochondria allowed these genes to be mapped to their human counterparts and prioritized as mitochondrial disease candidates . Another study on Leigh syndrome (a cytochrome c oxidase deficiency) integrated RNA and protein expression data to select LRPPRC as the top candidate in the disease locus. Mutations discovered in this gene confirmed its causative role in the disease . A later integration of eight datasets established that mutations in MPV17 cause an infantile mitochondrial DNA depletion disorder, despite the gene’s previous peroxisomal annotation [19, 28]. Such success stories illustrate the power of integrative genomics.
In this report, we explain how data integration approaches are used to prioritize candidate genes according to mitochondrial function. We describe various datasets that can be used to determine mitochondrial function, compare data integration strategies, and present applications of computational network models.
A typical data integration procedure aimed at prioritizing candidate mitochondrial disease genes can be divided into three major steps (Figure 1). The first step consists of collecting multiple datasets on mitochondrial function (previous efforts have used up to 25 [16, 19, 21]), including a reference set containing known mitochondrial genes. In the second step, a discrimination analysis method  is trained on the reference set to classify genes as mitochondrial or not; it is then used to optimally integrate the input datasets into a score reflecting the probability that each gene in the genome is mitochondrial. This score is used in the third step to prioritize candidate genes from a disease locus.
Here, we highlight several types of data that can be used in an integrative approach to predict proteins physically residing in mitochondria and genes functionally related to the organelle. Both classes of genes must be considered for the study of mitochondrial disease, because of the interdependence of mitochondrial and cellular processes. Selecting complementary datatypes will maximize the information captured by the integration. While the accuracy of input datasets may vary, the discrimination analysis algorithm will compensate for these variations if high-quality positive and negative reference sets are supplied. Each dataset can be evaluated by estimating its sensitivity and specificity; this is usually done by selecting a threshold and calculating the sensitivity as the fraction of reference proteins captured, and the specificity as the fraction of negative reference proteins excluded. Ranges of sensitivity and specificity calculated in the construction of the human MitoP2 database are indicated for applicable datasets (otherwise, MitoP2-Yeast calculations are shown) .
For the purpose of mitochondrial gene prediction, it is advisable to build the reference set from genes with definitive mitochondrial function based on single-gene studies. A good example is the manually-curated reference set of 870 mitochondria-localized human proteins used to construct the MitoP2 database (Table 1) . In addition, discrimination analysis methods use a negative reference set, which in this case contains non-mitochondrial proteins. This set can be built from proteins that localize to other cellular compartments  or have annotation that does not include mitochondrial function. However, potential drawbacks of these options are dual localization and bias of the integration against well-annotated proteins. Another option is to use the implicit negative reference set—that is, proteins not in the positive reference set. This set is enriched for non-mitochondrial proteins (at least 90% expected) and is large enough to prevent bias. The drawback in this case is the underestimation of specificity.
The human mitochondrion is a complex organelle with an estimated 1500 proteins that are highly conserved in sequence and localization. Due to experimental advantages, mitochondria of many model organisms are more thoroughly characterized . Sequence similarity to known mitochondrial proteins in other organisms is therefore an indicator of mitochondrial function in human. Indeed, relatively high ranges of sensitivity (40–48%) and specificity (35–47%) were obtained for reciprocal best BLAST hits with yeast and Neurospora mitochondrial proteins in the construction of the human MitoP2 database . This approach only detects human mitochondrial proteins conserved in other organisms; for example, the yeast mitochondrial proteome is about half the size of the human one . In addition to inferring mitochondrial function through sequence similarity, orthology permits the transfer of genome-wide datasets relevant to mitochondrial function from model organisms to human.
Transcription profiles in relevant conditions can indicate mitochondrial function. For example, yeast can grow by fermentation or by respiration, the latter requiring active mitochondria for ATP production. Genes are potentially mitochondrial if they show differential expression between the two conditions, in response to the diauxic shift from fermentation to respiration , or in response to overexpression of Hap4, a transcription factor of nuclear-encoded mitochondrial genes . In mouse and human, genes upregulated in response to overexpression of PGC-1α, one of several activators of mitochondrial proliferation , may play a role in mitochondrial biogenesis . Mitochondrial genes can also be predicted based on coexpression (similarity of expression profiles to those of known mitochondrial genes) , which suggests coregulation. The specificity of transcription profiling is limited by the potential coregulation of mitochondrial and unrelated processes (15–43% in MitoP2-Yeast); its sensitivity is limited due to posttranscriptional and posttranslational regulation (19–43%). Gene expression profiles for conditions relevant to mitochondria can be accessed via the Gene Expression Omnibus (www.ncbi.nlm.nih.gov/geo), ArrayExpress (www.ebi.ac.uk/arrayexpress), and the Stanford Microarray Database (genome-www5.stanford.edu).
Subcellular localization in yeast has been globally assessed using strain collections expressing GFP fusion proteins  and epitope tagged-proteins . In mammalian cells, a new tagging technique has been developed whereby full-length GFP is synthesized only if the encoded protein is imported into mitochondria, eliminating the need to examine mitochondria themselves for fluorescence . These approaches will only accurately classify proteins that physically reside in the organelle and whose localization is unaffected by the tag (specificity 26%, sensitivity 80%).
Mitochondrial function can be inferred by phenotypes in the absence of the gene. The yeast deletion collection has been used to screen for mitochondrial genes by identifying mutants with fitness defects in respiratory growth conditions [24, 35]. Data generated with the yeast deletion collection is available from the Yeast Deletion Project and Proteomics of Mitochondria Database (www-deletion.stanford.edu/YDPM). In higher organisms, gene deletion is more difficult, and a common alternative is gene silencing using RNA interference (RNAi). Mitochondrial proteins have been identified by screening for knockdown phenotypes such as altered citrate synthase activity , mitochondrial morphology, oxygen consumption, or metabolism . Genes lacking a knockout or knockdown phenotype because they are redundant or essential (the latter being the case for the mitochondrial outer membrane transporters, for example) will not be detected by these analyses (specificity 50%, sensitivity 37–45% in MitoP2-Yeast).
Mitochondrial proteins encoded by nuclear genes must be imported into the organelle. This process is believed to be triggered by N-terminal signal peptides that are recognized by the import machinery and cleaved once the protein reaches its destination. Several algorithms have thus been developed to predict signal peptides based on the patterns observed in the N-terminus of mitochondrial proteins . They have achieved modest success, in part because the signature amphipathic helix is primarily a matrix-localization signal and targeting sequences for other mitochondrial compartments are not well-defined . In addition, signal sequences do not follow straight-forward patterns and likely involve more of the protein than the N-terminus, especially in higher eukaryotes [40, 41]. In some cases, proteins are localized by binding to another protein (a ‘hitchhiking’ mechanism) or based on signals in their mRNA sequences . Some examples of these algorithms include TargetP  (www.cbs.dtu.dk/services/TargetP/), SubLoc  (www.bioinfo.tsinghua.edu.cn/SubLoc/), PSORT [44, 45] (psort.nibb.ac.jp), Predotar  (urgi.versailles.inra.fr/predotar), and pTARGET  (http://bioapps.rit.albany.edu/pTARGET/) (specificity 5–15%, sensitivity 44–61%).
Many mRNAs encoding mitochondrial proteins have been shown to associate with mitochondrion-bound polysomes in yeast and human . Once transported by these polysomes, these transcripts are cotranslationally imported into mitochondria. By isolating mitopolysomes and hybridizing the associated sequences to a microarray, the identity and abundance of the genes involved has been investigated in yeast  (specificity 23%, sensitivity 13% in MitoP2-Yeast).
Protein identification by mass spectrometry is a direct technique to define the mitochondrial proteome. This is typically done by purifying mitochondria and identifying protein components using mass spectrometry [49–51]. Mass spectrometry-based methods have been used for comparative proteomics studies under several conditions in yeast  and in different mouse tissues . These techniques, however, exhibit bias against proteins of low abundance (due to limitations of the dynamic range of detection) and membrane proteins [51, 53]. Moreover, the mitochondrial purification method chosen can have a considerable effect on the proteins identified or result in contamination with cytoskeletal proteins  (specificity 37–83%, sensitivity 10–38%).
Mitochondrial function and localization is often inferred from interaction with known mitochondrial proteins. Evidence for protein-protein interactions can be derived from large-scale experiments, literature, functional associations, and orthology; recent reviews, however, indicate that much of this data may be spurious or biologically irrelevant [54–57]. Interaction data and resources are discussed further in the Functional Gene Networks section (specificity 26–62%, sensitivity 22–38%).
Since the mitochondrion is derived from an ancient α-proteobacterium , several of its core components are homologous to α-proteobacterial proteins . Sequence similarity to proteins in the α-proteobacterium Rickettsia prowazekii, the closest living relative of the mitochondrial ancestor, suggests mitochondrial function. Past studies have classified a protein as mitochondrial if it has a high-scoring reciprocal best BLAST hit with an R. prowazekii protein (specificity 14%, sensitivity 30%).
Scientific literature is rich in single-gene studies that offer evidence for mitochondrial function. The sheer volume of the literature makes manual extraction of this information unfeasible, necessitating the use of computational text mining algorithms. The majority of English sentences, however, cannot yet be systematically parsed and understood by these algorithms . Nevertheless, text mining has been used to infer interactions [55, 56], functional annotations , and subcellular localization (SherLoc, www-bs.informatik.uni-tuebingen.de/Services/SherLoc/ ).
The goal of data integration here is to assess each gene’s involvement in mitochondrial function. The discrimination analysis method considers each gene as a collection of its values in each dataset. The algorithm ‘learns’ how to use these values by fitting a mathematical model of the classification using reference set genes . It then calculates a final score indicating how to classify each gene in the genome. To avoid overfitting—that is, to ensure the method will perform well not only on the reference set but on unseen data—one typically trains the method on a subset of the reference set (training set) and evaluates its performance on the remainder of the reference set (test set) . Discrimination analysis algorithms vary by the type of model used and the criteria for assessing the quality of the generated model. We describe here three types of algorithms that are standard for data integration problems and have successfully been applied to the prediction of mitochondrial function.
Naïve Bayes is one of the most straightforward discrimination analysis algorithms . An odds ratio is calculated for each gene in each dataset as the enrichment for positive versus negative reference genes with the same values as that of the gene (quantitative data must be discretized); the product of a gene’s odds ratios across datasets is its score, which is used to classify it as mitochondrial or not. This score equals the odds ratio of the gene being mitochondrial, assuming independence of the datasets given the gene classification. Although the independence assumption is not always fulfilled (for example, even among mitochondrial proteins, protein abundance is at least partially dependent on transcription levels), naïve Bayes can still yield good results in practice. For example, it was successfully employed to identify MPV17 as a mitochondrial disease gene using eight datasets .
Linear predictor methods, such as linear discriminant analysis or logistic regression  can be viewed as a generalization of naïve Bayes. These methods assign a weight to each dataset, and the score of a gene depends on the weighted sum of its dataset values. The advantage over naïve Bayes is that weights of linear predictors are fitted to optimal values using the training set. An advantage it shares with naïve Bayes is that the dataset weights indicate the relative contribution of the datasets to the prediction. A disadvantage is that both methods can suffer from overfitting on high-dimensional data compared to SVMs (discussed next). A recent implementation of linear prediction expanded the catalog of yeast mitochondrial proteins and their interactions, many of which were experimentally verified .
A more recently developed discrimination analysis algorithm is the support vector machine (SVM) . SVMs place each gene in multidimensional space according to its values in each dataset and compute the plane that maximally separates positive from negative genes. Thus when fitting the mathematical model in SVMs, the boundary datapoints (points that are closest to each other but belong to different groups) are most important, whereas the other methods consider all datapoints. Both from a theoretical point of view and in practice, this approach performs very well on high-dimensional datasets compared to traditional methods. SVMs are versatile in terms of choosing the mathematical functions used to generate the model . However, SVM fitting requires some parameter tuning, which is a manual and time-consuming process. Moreover, the predictor is not as easily interpretable as naïve Bayes or linear predictors. The MitoP2 database (Table 1), which is currently the most comprehensive resource available for mitochondrial proteins, provides SVM-based prediction and has facilitated the identification of mitochondrial disease genes .
Performance of the discrimination analysis method can be evaluated by selecting a threshold and estimating the sensitivity (the fraction of reference proteins scoring above the threshold) and specificity (the fraction of negative reference proteins scoring below the threshold) of the results. Figure 2 depicts the performance in terms of sensitivity and specificity (for all possible thresholds) of multiple yeast mitochondrial gene predictions, including a heuristic method (described in Prokisch et al. ), a linear predictor , and an SVM-based method , relative to 24 individual datasets. At a given sensitivity, a higher specificity is preferable and vice versa. Hence, methods with curves further toward the right and the top are the most effective. This illustrates that while output varies among the integration methods, all of them clearly outperform the individual datasets. Data integration approaches to identify mitochondrial parts-lists usually select a threshold score (based on sensitivity and specificity) above which genes are deemed mitochondrial. For the purpose of prioritizing candidate disease genes, however, genes can simply be ranked for mutation screening according to their score.
Genome-wide datasets have provided insight into the molecular networks that govern biological processes, such as protein-protein interaction, metabolic, and regulatory networks. Functional gene networks are abstractions of these networks, where gene nodes are connected if they share a functional relationship. They are used to infer gene function based on the guilt-by-association principle: if two genes are network neighbours, they presumably contribute to the same biological process [55, 62–65]. By placing genes in a functional context, these networks offer an additional means of candidate disease gene prioritization.
Functional gene networks are constructed by integrating a collection of datasets on gene-gene interactions using techniques described in the previous section. Evidence for interactions between genes can be derived from many sources: (a) their protein products physically interact [57, 66]; (b) they exhibit genetic interactions indicated by, for example, a synthetic lethal phenotype ; (c) their loss-of-function phenotypes are similar; or (d) they share another functional relationship [55, 64]. Many of the examples in the Datasets section are therefore also applicable to network construction. Several public network databases are detailed in Table 2. One example designed for disease gene identification is the tool Prioritizer, based on a human functional gene network generated by Bayesian integration .
Candidate disease genes can be prioritized according to the network proximity of disease genes associated with similar symptoms (Figure 3A). When grouping genes into modules according to their function, it has been observed that mutations in genes of a particular module often result in the same phenotype [63, 73]. Genes sharing functional modules with the approximately 150 mitochondrial disease genes identified to date  are therefore valid candidates for diseases with mitochondrial symptoms. Even in a yeast mitochondrial interaction network, orthologs of related disease genes were often found in the same functional module . In addition, a recent analysis of human mitochondrial metabolism has identified several ‘co-sets’ consisting of groups of genes whose mutations (in particular, SNPs) lead to similar metabolic consequences .
When studying multigenic diseases, networks can help to prioritize groups of candidate genes. In the event that several disease loci result from linkage analysis, the most probable combinations of genes from the loci can be predicted based on their relative positions in the network (Figure 3B). This is based on the observation that genes causing the same disease are separated by a significantly shorter-than-average network distance , which is the foundation of Prioritizer (Table 2). Therefore, if one gene has already been implicated in a multigenic disorder, additional candidates can be selected based on their network proximity to that gene. Without the advantages afforded by networks, testing all the combinations that result even from a few disease loci would be unfeasible.
Data integration techniques enable informed prioritization of candidate disease genes. Their power lies in combining the strengths of various high-throughput datasets and compensating for their individual limitations. Functional gene networks extend this potential by consolidating gene-gene relationships in the context of biological processes. Candidate disease genes can thus be prioritized by network proximity to predicted or known disease genes associated with the same or clinically related diseases.
As existing technologies for both high-throughput and more focused studies progress, and new ones emerge, the effectiveness of data integration methods and network models will only improve. For example, differential proteomics generates thousands of precise measurements of protein levels and thus complements transcription as a measure of gene activity and regulation under multiple conditions; tools for analyzing this data are rapidly becoming more accurate [51, 74]. The quantification of small-molecule metabolites (also known as metabolite profiling) is contributing to the construction of network models and a comprehensive description of mitochondrial  and cellular physiology [76–78]. Additional mitochondria-specific phenotyping assays, including quantitative morphology  and membrane potential measurements , enable more refined characterization of mitochondrial function. In lab-on-a-chip technologies, multiple types of measurements can be generated by a single experiment in cell-like environments, enabled by micro- and nanofluidics .
By clarifying the dynamic function of genes, proteins, and metabolites, the application of more quantitative technologies is instrumental in refining models of biological networks. Quantitative models can already predict the behaviour of mitochondrial metabolism under various conditions . Such models allow simulation of the effects of perturbations (e.g., inactivating a gene) on biological systems; thus, analyzing experimental data from a diseased system and inferring the causative defect will eventually be possible in silico. For example, a model of mitochondrial metabolism  has been used to identify the deficient respiratory chain complex in Leigh syndrome fibroblasts via metabolite profiling . Another study has pinpointed subsets of gene networks showing variation in gene expression associated with quantitative trait loci (QTLs) for obesity; as a result, three novel obesity genes have been identified .
In addition to identifying disease genes, network models can be used to study disease mechanisms by elucidating molecular roles in maintaining health. In particular, alterations caused by disease in portions of the underlying networks can be investigated. Therapeutic strategies can then be targeted at causative defects, their most consequential downstream effects, or affected subnetworks [83, 84]. As models are developed to encapsulate healthy and diseased states, they must be extended to capture the effects of genetic variation, including the thousands of polymorphisms yielding small effects or contributing to individual disease risk . High-throughput genotyping and genome-wide association studies are making progress towards understanding these relationships , and as these advances are mirrored on the clinical side with more comprehensive diagnostic tools, the gap between genotype and phenotype will be reduced . Eventually, personalized therapeutic strategies will be developed that cater to individual genetic makeup, environment, diet, and lifestyle . It is certain that these developments will not be realized through any one technology, and thus data integration makes them attainable by drawing on all the strengths molecular biology has to offer.
The authors would like to acknowledge Amoolya Singh, Michael DiBernardo, Christine Panagiotidis, Stéphanie Blandin, Eugenio Mancera, Fabiana Perocchi, Himanshu Sinha, and Sebastian Kühner for their help in preparing this manuscript.