|Home | About | Journals | Submit | Contact Us | Français|
High-throughput genome sequencing techniques have now reached vector biology with an emphasis on those species that are vectors of human pathogens. The first mosquito to be sequenced was Anopheles gambiae, the vector for Plasmodium parasites that cause malaria. Further mosquitoes have followed: Aedes aegypti (Yellow fever and Dengue fever vector) and Culex pipiens (lymphatic filariasis and West Nile fever). Species that are currently in sequencing include the body louse Pediculus humanus (Typhus vector), the triatomine Rhodnius prolixus (Chagas disease vector) and the tick Ixodes scapularis (Lyme disease vector). The motivations for sequencing vector genomes are to further understand vector biology, with an eye on developing new control strategies (for example novel chemical attractants or repellents) or understanding the limitations of current strategies (for example the mechanism of insecticide resistance); to analyse the mechanisms driving their evolution; and to perform an exhaustive analysis of the gene repertory. The proliferation of genomic data creates the need for efficient and accessible storage. We present VectorBase, a genomic resource centre that is both involved in the annotation of vector genomes and act as a portal for access to the genomic information (http://www.vectorbase.org).
The burden of infectious diseases on the world remains a major challenge to medical science. Understanding the complex interactions between vector, pathogen and host is necessary for the comprehension of these diseases but has proved especially difficult (Aksoy et al., 2002; Aksoy, 2003; Xu et al., 2005).
Observation and biological experiments have been for many decades the source of all the data and advances in the field of infectious diseases. The last few years have seen the start of a new era in this domain: the generation and analysis of genomic data derived from the sequencing of organisms. Genomic data can be generated relatively quickly and present a broader view opening the way for a range of genome-wide studies (e.g. expression microarrays, RNAi knock-down studies) as well as giving a boost to the hypothesis-driven experimentation for these species. Consideration of size and perceived simplicity meant that pathogens were at the forefront of the genomics area with viral and bacterial genomes being among the first to be sequenced. The speed and accuracy of modern sequencing technologies has yielded essentially complete genome sequences for many species in a relatively fast and cost effective manner, quickly adding the human genome to the list of the sequenced organisms. The vector genomes came later with the publication of the genome of the malaria mosquito Anopheles gambiae (Holt et al., 2002). The sequence of a genome is in itself of limited use without the associated annotation that attempts to describe the location and function of genes, as well as the control elements active in the genome. The annotation process is often based on a similarity approach that is, in turn, reliant on information from previous annotations.
This review focuses on the genomic resources for invertebrate vectors of human pathogens. We will discuss the motivations for sequencing the genomes of these species as opposed to other strategies for producing data, such as Expressed Sequence Tag (EST) sequencing. We will then present the current state of genomics resources for vector species and finally introduce VectorBase, a resource centre which organises and stores genomic data for presentation via the World Wide Web (WWW).
The most obvious reason for sequencing the genome of a vector is to improve our understanding of the organism’s biology with a view to designing new control measures, exploiting its pharmaceutical potential or developing new molecular tools for genetic manipulation. The ability to screen on a genome-wide basis is a powerful driver for genome sequencing as a cost-effective method of understanding individual genes. A good example would be the genome-wide studies that investigate the temporal, spatial and conditional expression of all genes in a single experiment. Knowledge of a genome also facilitates the quantification of polymorphisms within a species as well as comparative investigations into the complex mechanisms driving insect evolution.
A bottleneck in progress toward controlling invertebrate vectors of human pathogens is the lack of knowledge of their basic molecular biology. To better control these insects, we would like to understand at a molecular level their feeding habits, their mating behavior and mode of reproduction, their choice of habitat and, more than anything else, their relationship with the host and the pathogens. Connecting this with the genome information can help to understand these processes at a molecular level. For example, the sequencing of the A. gambiae genome has proven beneficial in identifying molecular mechanisms responsible for host seeking and other odor-mediated behaviors (Biessmann et al., 2005). A better understanding of vector population structure is essential when planning intervention strategies (Cuamba et al., 2006). Access to the genome sequence can be useful to facilitate the identification of new DNA markers such as micro-satellites and single nucleotide polymorphisms (SNPs) that make the identification of groups within the population more sensitive and cost effective.
The control of vectors is a key weapon in the fight against infectious diseases. A better understanding of the biology of the vector can lead to a faster identification of new targets and ultimately new control measures. For example, David et al. (2005) used the A. gambiae annotations and expression profiling using micro-arrays to identify genes involved in insecticide resistance. A potential list of 230 genes was reduced to just five that were highly regulated in insecticide resistant mosquito strains.
The identification of new transposons and repeated sequences facilitates the development of tools for the genetic manipulation of vectors. The availability of the genomic sequence helps to identify these elements faster and on a larger scale.
High efficiency transformation is possible for many mosquito species. Handler (2002) describes how the piggyBac transposon allows germ-line transformation of insects, including the yellow fever and malaria vectors Aedes aegypti (Lobo et al., 2002), A. gambiae (Grossman et al., 2001), Anopheles fluviatilis (Rodrigues et al., 2006), Anopheles stephensi (Nolan et al., 2002) and Anopheles albimanus (Perera et al., 2002). The same transposon was employed by Adelman (2004) to transform somatic cells in A. aegypti. The Hermes and Mariner transposons were successfully used to transform respectively Culex pipiens quinquefasciatus germ line (Allen et al., 2001) and A. aegypti somatic cells (Coates et al., 1998). In addition, the availability of the genome sequence facilitates the development of high-throughput genome wide technologies such as high density expression micro-array, genome tiling array, or chip-chip methodology. David et al. (2005), for example, developed the first micro-array to study insecticide resistance in malaria vectors. More recently, Vontas et al. (2007) monitored gene expression in insecticide resistant and susceptible strains of A. stephensi and identify a small number of genes putatively differentially expressed between the strains. Halasz et al. (2006) developed a method to analyse tilling array data and tested it on an A. gambiae tilling array, identifying non-exonic loci that were actively transcribed. Chip-chip methodologies have been applied to Drosophila, including analyses to study the binding of transcription factors (Moses et al., 2006; Zeitlinger et al., 2007), and are likely to be applied to mosquitoes soon.
The transmission of infectious agents (pathogens) from a vector to the human host is usually by direct contact (biting or sucking). To make the most of the feeding interaction with the host, the vector requires a system of anti-coagulants, vaso-dilatators and other modulators of the haemodynamic process. Identifying any of these compounds is potentially of interest to the medical industry. Ribeiro and collaborators have already characterized some of these agents through studies of mosquito salivary gland gene expression (Ribeiro et al., 2006; Calvo et al., 2007; Santos et al., 2007).
Recent sequencing efforts within the Insecta group have allowed the generation of increasingly accurate phylogenetic trees for organisms (Fig. 1). Genomic comparison helps to explain the evolutionary events linking species by identifying genes conserved across species, genes evolving quickly and in particular species-specific expansion of gene families. Such results can lead to the identification of genes linked to a given behavior (e.g. blood feeding), or involved in pathogen transmission or insecticide resistance (e.g. immunity genes). Ultimately, this could lead to new approaches in the control of disease transmission by these organisms. For example, Waterhouse et al. (2007) used the genome sequences of Drosophila melanogaster, A. gambiae and A. aegypti to compare the insect immune repertoire and identified conservative and rapidly evolving immune-related genes.
The sequencing of Expressed Sequenced Tags (ESTs), short fragments of expressed sequences, often precedes the sequencing of a genome, either as a complement or as a temporary alternative. The generation of EST sequences is rapid and relatively cheap and has been used for gene discovery for species where the resources are not currently available for full genome sequencing, such as the sand fly and tsetse fly. Ribeiro and collaborators have analysed the gene expression of several mosquito (Anopheline and Aedine) and tick species using small libraries, ranging from several hundred to several thousand sequences. EST sequences were clustered together and a series of bioinformatics analyses was then applied to each consensus sequence. The results of these analyses were collated into a spreadsheet that can be queried or browsed to identify the function of each transcript cluster consensus (Ribeiro et al., 2004; Santos et al., 2004; Arca et al., 2007).
EST sequences represent a sub-set of the repertoire of genes expressed in the RNA sample from which the library was generated. Analysis of the sequences can inform us about the variety and abundance of transcripts within a cell, a tissue type, a developmental stage or an organism. Depending on the experimental design, EST libraries can be normalized, a process by which abundant transcripts are removed in order to maximize novel gene discovery, or left in their native state where the abundance of each transcript is proportional to the expression level in the original RNA sample. Non-normalized libraries give information about expression levels and can be analysed with other libraries to identify differentially expressed genes between two samples or conditions; for example male and female mosquitoes or susceptible or resistant to an insecticide. Such studies can be carried out without any genomic sequence. Nisbet et al. (2006) describe in an article how the use of large scale EST projects has help the understanding of the immunology of host-parasites relationships and the potential of this knowledge in developing vaccines.
EST libraries are generated from polyA+ RNA and hence represent mainly expressed sequences from a genome. This information is very useful when predicting gene structures on a genomic sequence and genes predicted from EST data are high confidence and are more likely to be valid than ab initio predicted genes, for example, even if not necessarily full-length. Moreover, EST data remain the best way of finding and annotating alternative splice forms.
EST libraries can also be used as a resource for the community. The Malaria Research and Reference Reagent Resource Centre (MR4 - http://www.mr4.org/) is a repository for malaria related reagents and provides, free of charge, a variety of plasmid and clone vectors, antibodies, genomic and cDNA libraries, cell lines, preserved mosquitoes etc. It contains 15 EST libraries for various Plasmodium species and 4 EST libraries for A. gambiae. Reagents are collected from scientists who wish to make them available to the scientific malaria community. ESTs are a valuable tool for identifying genes and estimating their expression level, and when coupled to the genomic data, they provide additional information to improve the annotation.
In 2002, marking the start of the genomic area in the field of human pathogen vectors, the genome of the mosquito A. gambiae, vector of malaria, was sequenced and annotated (Holt et al., 2002). The annotation was helped by the existence of Anopheline mRNA and protein data, complemented by several EST libraries, and by the huge quantity of D. melanogaster data. A few years later, in 2005/06, the genome of the yellow fever mosquito A. aegypti was sequenced and annotated (Nene et al., 2007). More recently, the list of sequenced organisms has increased with the addition of the body louse Pediculus humanus and the mosquito C. pipiens, both gene sets being currently in preparation and expected to be released at the beginning of 2008. The genome sequence from the tick Ixodes scapularis has just been released and its annotation has started. The genome of the bug Rhodnius prolixus is planned to be released mid-2008 and should be annotated soon thereafter. Further organisms (including additional A. gambiae populations, the tsetse fly Glossina morsitans, and several sand fly species) are expected in the next few years. The tsetse and sand fly projects already have a certain amount of EST data and their analysis has started (Table 1).
With the increase in the number of vectors sequenced and annotated, it becomes easier to analyse the subsequent ones. Data from closely related organisms are often used to annotate new genomes, allowing the maximum usage of quality prediction between species, but opening up the possibilities of propagating bad prediction if not careful. The annotation of A. aegypti, for example, was largely based on the D. melanogaster and A. gambiae data. In addition, sequencing techniques and analysis tools continue to become cheaper, faster and more accurate. Funding agencies have recognized the need for large scale data for pathogen vectors to help the understanding of human infectious diseases.
These sequencing and annotation projects are based on international collaborations grouping scientists from the sequencing centres (e.g. the Sanger Centre, the Broad Institute, the J.Craig Venter Institute, the Genome Sequencing Centre at Washington University, the Human Genome Sequencing Centre at the Baylor College of Medicine), experts in genome annotation (e.g. sequencing centres, VectorBase) and the larger community of scientists specialist in these organisms.
With the increasing amount of data generated by the sequencing projects and their subsequent analysis, it becomes crucial to organize the storing and the access to these data. VectorBase (Lawson et al., 2007 - http://www.vectorbase.org) is a NIH-NIAID funded Resource Centre for Invertebrate Vectors of Human Pathogens, organising information about these organisms: sequences, gene sets and related information, pictures and controlled vocabulary for mosquito and tick anatomy and physiology. A key feature of VectorBase is the Ensembl genome browser, developed at the European Bioinformatics Institute (Hubbard et al., 2007), used to display genes along the genome and to link them to related information, such as manual annotation, physical mapping data, expression data or protein and DNA similarities. Comparative data are also handled similarly to Ensembl, with homolog (ortholog/paralog) information linking genes from the various organisms and the possibility to “jump” from one genome to the other via their protein or DNA similarities. Non-comparative and non-automatic annotations are stored in a Chado database and are managed using GMOD tools. GMOD (Generic Model Organism Database - http://www.gmod.org) is a collection of software tools for creating and managing genome-scale biological databases.
Chado, developed within the FlyBase consortium, is a relational database that is part of this kit and capable of representing many of the general classes of data frequently encountered in modern biology such as sequence, sequence comparisons, phenotypes, genotypes, ontologies, publications, and phylogeny (Mungall et al., 2007). Data not handled by the classical schema are integrated via the DAS protocol, a Distributed Annotation System protocol used to exchange biological sequence annotation. It is exploited to supply data from remote databases to the VectorBase genome browser, allowing external users to map their own data to the VectorBase genomes and making them available to the whole community, but retaining the ability to update them at any time. Most of the data available through VectorBase have been generated internally, using an annotation and a genomic comparison pipelines to analyse the data. Manual annotation is provided for selected regions using approaches developed at FlyBase (Crosby et al., 2007). The expression data, while not generated in house, are collected via the BASE interface (http://base.thep.lu.se/) and mapped to the genomes internally.
VectorBase offers a number of tools to mine the data, including a BLAST server allowing the user to compare his sequences to any of the sequences (genome, traces, ESTs, transcripts or proteins) of any of the organisms hosted and a ClustalW tool to align sequences together. The search engine allows the user to enter the site from any keywords: tool or organism, gene name or protein identifier, micro-array name or Controlled Vocabulary about mosquito or tick physiologies etc. VectorBase also provides GDAV, Genome De-linked Annotation Viewer, a simple set of tools allowing the publishing of EST, gene or protein annotations generated by in a independent project. It is installed on the user’s own computer system, leaving him in full control of his data. An optional component allows, via the DAS protocol, the viewing of similarities between the user sequence and one or more genomes from the VectorBase genome browser.
The VectorBase data are available by download as flat files, either in fasta format (sequence data) or as database dumps (on request). Additionally, BioMart (Kasprzyk et al., 2004) is a more sophisticated tool for querying the databases, building MySQL queries based on simple choices from the user, and returning tabulated flat files.
Eight organisms are currently available: two with a complete gene set, two with a genome and a gene set near completion and four at various stages of sequencing, going from the newly fully sequenced tick to the on-going tsetse fly, with ESTs and traces, and the sand flies for which ESTs only are available (Table 2).
VectorBase aims to be the main resource centre for the invertebrate disease vector communities, involving the scientists and generating, updating and giving easy access to the data.
Research into invertebrate vectors of human pathogens has reached a new level since the introduction of the genomics. Many areas have benefited from the huge amount of data generated: vector biology and its consequences on population studies and vector control, molecular biology with the increased understanding of biological processes and the possibility to design new tools, biomedicine with the prospect of new biopharmaceutical targets and evolutionary biology with the addition of new organisms for comparative genomic analysis. Progress can be expected to accelerate as more invertebrate genomes are sequenced and annotated. Additional Anopheles genomes will complement the existing data about A. gambiae (Besansky, 2006) and help in understanding the relation between sub-species. The list can be extended to include animal and plant vectors: the pea aphid Acyrthosiphon pisum (Gauthier et al., 2007) is currently being sequenced and the tick Boophilus microplus, affecting cattle and horses (Guerrero et al., 2006) is awaiting funding. Other non-vector insects already sequenced contribute to our ability to annotate and understand vector genomes: the honey bee Apis mellifera (Consortium, 2006), the silk worm Bombyx mori (Mita et al., 2004) and the agricultural pest, the red flour beetle Tribolium castaneum (Brown et al., 2003; Wang et al., 2007). The future in vectors of human pathogen research looks very exciting.
The core VectorBase project is funded by contract HHSN266200400039C from the NIAID, and supported, in part, by the BioMalPar network of excellence. We would like to thank all the persons involved in the VectorBase project for developing the resources described in this review, the Ensembl team for their help when using their pipelines and the scientific community working on invertebrate vectors of human pathogens for their collaborations and support.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.