1.  Phlebotomus papatasi SP15: mRNA expression variability and amino acid sequence polymorphisms of field populations 
Parasites & Vectors  2015;8:298.
The Phlebotomus papatasi salivary protein PpSP15 was shown to protect mice against Leishmania major, suggesting that incorporation of salivary molecules in multi-component vaccines may be a viable strategy for anti-Leishmania vaccines.
Here, we investigated PpSP15 predicted amino acid sequence variability and mRNA profile of P. papatasi field populations from the Middle East. In addition, predicted MHC class II T-cell epitopes were obtained and compared to areas of amino acid sequence variability within the secreted protein.
The analysis of PpSP15 expression from field populations revealed significant intra- and interpopulation variation.. In spite of the variability detected for P. papatasi populations, common epitopes for MHC class II binding are still present and may potentially be used to boost the response against Le. major infections.
Conserved epitopes of PpSP15 could potentially be used in the development of a salivary gland antigen-based vaccine.
Electronic supplementary material
The online version of this article (doi:10.1186/s13071-015-0914-2) contains supplementary material, which is available to authorized users.
PMCID: PMC4472253  PMID: 26022221
Sand fly; Saliva; PpSP15; Leishmaniasis; Vaccine; Expression variability; MHC class II epitopes
2.  VectorBase: an updated bioinformatics resource for invertebrate vectors and other organisms related with human diseases 
Nucleic Acids Research  2014;43(Database issue):D707-D713.
VectorBase is a National Institute of Allergy and Infectious Diseases supported Bioinformatics Resource Center (BRC) for invertebrate vectors of human pathogens. Now in its 11th year, VectorBase currently hosts the genomes of 35 organisms including a number of non-vectors for comparative analysis. Hosted data range from genome assemblies with annotated gene features, transcript and protein expression data to population genetics including variation and insecticide-resistance phenotypes. Here we describe improvements to our resource and the set of tools available for interrogating and accessing BRC data including the integration of Web Apollo to facilitate community annotation and providing Galaxy to support user-based workflows. VectorBase also actively supports our community through hands-on workshops and online tutorials. All information and data are freely available from our website at
PMCID: PMC4383932  PMID: 25510499
3.  Examination of the genetic basis for sexual dimorphism in the Aedes aegypti (dengue vector mosquito) pupal brain 
Most animal species exhibit sexually dimorphic behaviors, many of which are linked to reproduction. A number of these behaviors, including blood feeding in female mosquitoes, contribute to the global spread of vector-borne illnesses. However, knowledge concerning the genetic basis of sexually dimorphic traits is limited in any organism, including mosquitoes, especially with respect to differences in the developing nervous system.
Custom microarrays were used to examine global differences in female vs. male gene expression in the developing pupal head of the dengue vector mosquito, Aedes aegypti. The spatial expression patterns of a subset of differentially expressed transcripts were examined in the developing female vs. male pupal brain through in situ hybridization experiments. Small interfering RNA (siRNA)-mediated knockdown studies were used to assess the putative role of Doublesex, a terminal component of the sex determination pathway, in the regulation of sex-specific gene expression observed in the developing pupal brain.
Transcripts (2,527), many of which were linked to proteolysis, the proteasome, metabolism, catabolic, and biosynthetic processes, ion transport, cell growth, and proliferation, were found to be differentially expressed in A. aegypti female vs. male pupal heads. Analysis of the spatial expression patterns for a subset of dimorphically expressed genes in the pupal brain validated the data set and also facilitated the identification of brain regions with dimorphic gene expression. In many cases, dimorphic gene expression localized to the optic lobe. Sex-specific differences in gene expression were also detected in the antennal lobe and mushroom body. siRNA-mediated gene targeting experiments demonstrated that Doublesex, a transcription factor with consensus binding sites located adjacent to many dimorphically expressed transcripts that function in neural development, is required for regulation of sex-specific gene expression in the developing A. aegypti brain.
These studies revealed sex-specific gene expression profiles in the developing A. aegypti pupal head and identified Doublesex as a key regulator of sexually dimorphic gene expression during mosquito neural development.
PMCID: PMC4342991  PMID: 25729562
Aedes aegypti; Mosquito; Vector; Pupae; Brain; Nervous system; Dimorphism; Doublesex; Development; Optic lobe
4.  VectorBase: improvements to a bioinformatics resource for invertebrate vector genomics 
Nucleic Acids Research  2011;40(Database issue):D729-D734.
VectorBase ( is a NIAID-supported bioinformatics resource for invertebrate vectors of human pathogens. It hosts data for nine genomes: mosquitoes (three Anopheles gambiae genomes, Aedes aegypti and Culex quinquefasciatus), tick (Ixodes scapularis), body louse (Pediculus humanus), kissing bug (Rhodnius prolixus) and tsetse fly (Glossina morsitans). Hosted data range from genomic features and expression data to population genetics and ontologies. We describe improvements and integration of new data that expand our taxonomic coverage. Releases are bi-monthly and include the delivery of preliminary data for emerging genomes. Frequent updates of the genome browser provide VectorBase users with increasing options for visualizing their own high-throughput data. One major development is a new population biology resource for storing genomic variations, insecticide resistance data and their associated metadata. It takes advantage of improved ontologies and controlled vocabularies. Combined, these new features ensure timely release of multiple types of data in the public domain while helping overcome the bottlenecks of bioinformatics and annotation by engaging with our user community.
PMCID: PMC3245112  PMID: 22135296
5.  Standardized Metadata for Human Pathogen/Vector Genomic Sequences 
PLoS ONE  2014;9(6):e99979.
High throughput sequencing has accelerated the determination of genome sequences for thousands of human infectious disease pathogens and dozens of their vectors. The scale and scope of these data are enabling genotype-phenotype association studies to identify genetic determinants of pathogen virulence and drug/insecticide resistance, and phylogenetic studies to track the origin and spread of disease outbreaks. To maximize the utility of genomic sequences for these purposes, it is essential that metadata about the pathogen/vector isolate characteristics be collected and made available in organized, clear, and consistent formats. Here we report the development of the GSCID/BRC Project and Sample Application Standard, developed by representatives of the Genome Sequencing Centers for Infectious Diseases (GSCIDs), the Bioinformatics Resource Centers (BRCs) for Infectious Diseases, and the U.S. National Institute of Allergy and Infectious Diseases (NIAID), part of the National Institutes of Health (NIH), informed by interactions with numerous collaborating scientists. It includes mapping to terms from other data standards initiatives, including the Genomic Standards Consortium’s minimal information (MIxS) and NCBI’s BioSample/BioProjects checklists and the Ontology for Biomedical Investigations (OBI). The standard includes data fields about characteristics of the organism or environmental source of the specimen, spatial-temporal information about the specimen isolation event, phenotypic characteristics of the pathogen/vector isolated, and project leadership and support. By modeling metadata fields into an ontology-based semantic framework and reusing existing ontologies and minimum information checklists, the application standard can be extended to support additional project-specific data fields and integrated with other data represented with comparable standards. The use of this metadata standard by all ongoing and future GSCID sequencing projects will provide a consistent representation of these data in the BRC resources and other repositories that leverage these data, allowing investigators to identify relevant genomic sequences and perform comparative genomics analyses that are both statistically meaningful and biologically relevant.
PMCID: PMC4061050  PMID: 24936976
6.  Assessing De Novo transcriptome assembly metrics for consistency and utility 
BMC Genomics  2013;14:465.
Transcriptome sequencing and assembly represent a great resource for the study of non-model species, and many metrics have been used to evaluate and compare these assemblies. Unfortunately, it is still unclear which of these metrics accurately reflect assembly quality.
We simulated sequencing transcripts of Drosophila melanogaster. By assembling these simulated reads using both a “perfect” and a modern transcriptome assembler while varying read length and sequencing depth, we evaluated quality metrics to determine whether they 1) revealed perfect assemblies to be of higher quality, and 2) revealed perfect assemblies to be more complete as data quantity increased.
Several commonly used metrics were not consistent with these expectations, including average contig coverage and length, though they became consistent when singletons were included in the analysis. We found several annotation-based metrics to be consistent and informative, including contig reciprocal best hit count and contig unique annotation count. Finally, we evaluated a number of novel metrics such as reverse annotation count, contig collapse factor, and the ortholog hit ratio, discovering that each assess assembly quality in unique ways.
Although much attention has been given to transcriptome assembly, little research has focused on determining how best to evaluate assemblies, particularly in light of the variety of options available for read length and sequencing depth. Our results provide an important review of these metrics and give researchers tools to produce the highest quality transcriptome assemblies.
PMCID: PMC3733778  PMID: 23837739
7.  The Evolution of the Anopheles 16 Genomes Project 
G3: Genes|Genomes|Genetics  2013;3(7):1191-1194.
We report the imminent completion of a set of reference genome assemblies for 16 species of Anopheles mosquitoes. In addition to providing a generally useful resource for comparative genomic analyses, these genome sequences will greatly facilitate exploration of the capacity exhibited by some Anopheline mosquito species to serve as vectors for malaria parasites. A community analysis project will commence soon to perform a thorough comparative genomic investigation of these newly sequenced genomes. Completion of this project via the use of short next-generation sequence reads required innovation in both the bioinformatic and laboratory realms, and the resulting knowledge gained could prove useful for genome sequencing projects targeting other unconventional genomes.
PMCID: PMC3704246  PMID: 23708298
comparative; assembly; vector; malaria; collaboration
8.  Haplotype and minimum-chimerism consensus determination using short sequence data 
BMC Genomics  2012;13(Suppl 2):S4.
Assembling haplotypes given sequence data derived from a single individual is a well studied problem, but only recently has haplotype assembly been considered for population-sampled data. We discuss a software tool called Hapler, which is designed specifically for low-diversity, low-coverage data such as ecological samples derived from natural populations. Because such data may contain error as well as ambiguous haplotype information, we developed methods that increase confidence in these assemblies. Hapler also reconstructs full consensus sequences while minimizing and identifying possible chimeric points.
Experiments on simulated data indicate that Hapler is effective at assembling haplotypes from gene-sized alignments of short reads. Further, in our tests Hapler-generated consensus sequences are less chimeric than the alternative consensus approaches of majority vote and viral quasispecies estimation regardless of error rate, read length, or population haplotype bias.
The analysis of genetically diverse sequence data is increasingly common, particularly in the field of ecoinformatics where transcriptome sequencing of natural populations is a cost effective alternative to genome sequencing. For such studies, it is important to consider and identify haplotype diversity. Hapler provides robust haplotype information and identifies possible phasing errors in consensus sequences, providing valuable information for population studies and downstream usage of resulting assemblies.
PMCID: PMC3394418  PMID: 22537299
9.  High-throughput 454 resequencing for allele discovery and recombination mapping in Plasmodium falciparum 
BMC Genomics  2011;12:116.
Knowledge of the origins, distribution, and inheritance of variation in the malaria parasite (Plasmodium falciparum) genome is crucial for understanding its evolution; however the 81% (A+T) genome poses challenges to high-throughput sequencing technologies. We explore the viability of the Roche 454 Genome Sequencer FLX (GS FLX) high throughput sequencing technology for both whole genome sequencing and fine-resolution characterization of genetic exchange in malaria parasites.
We present a scheme to survey recombination in the haploid stage genomes of two sibling parasite clones, using whole genome pyrosequencing that includes a sliding window approach to predict recombination breakpoints. Whole genome shotgun (WGS) sequencing generated approximately 2 million reads, with an average read length of approximately 300 bp. De novo assembly using a combination of WGS and 3 kb paired end libraries resulted in contigs ≤ 34 kb. More than 8,000 of the 24,599 SNP markers identified between parents were genotyped in the progeny, resulting in a marker density of approximately 1 marker/3.3 kb and allowing for the detection of previously unrecognized crossovers (COs) and many non crossover (NCO) gene conversions throughout the genome.
By sequencing the 23 Mb genomes of two haploid progeny clones derived from a genetic cross at more than 30× coverage, we captured high resolution information on COs, NCOs and genetic variation within the progeny genomes. This study is the first to resequence progeny clones to examine fine structure of COs and NCOs in malaria parasites.
PMCID: PMC3055840  PMID: 21324207
10.  Breakpoint structure of the Anopheles gambiae 2Rb chromosomal inversion 
Malaria Journal  2010;9:293.
Alternative arrangements of chromosome 2 inversions in Anopheles gambiae are important sources of population structure, and are associated with adaptation to environmental heterogeneity. The forces responsible for their origin and maintenance are incompletely understood. Molecular characterization of inversion breakpoints provides insight into how they arose, and provides the basis for development of molecular karyotyping methods useful in future studies.
Sequence comparison of regions near the cytological breakpoints of 2Rb allowed the molecular delineation of breakpoint boundaries. Comparisons were made between the standard 2R+b arrangement in the An. gambiae PEST reference genome and the inverted 2Rb arrangements in the An. gambiae M and S genome assemblies. Sequence differences between alternative 2Rb arrangements were exploited in the design of a PCR diagnostic assay, which was evaluated against the known chromosomal banding pattern of laboratory colonies and field-collected samples from Mali and Cameroon.
The breakpoints of the 7.55 Mb 2Rb inversion are flanked by extensive runs of the same short (72 bp) tandemly organized sequence, which was likely responsible for chromosomal breakage and rearrangement. Application of the molecular diagnostic assay suggested that 2Rb has a single common origin in An. gambiae and its sibling species, Anopheles arabiensis, and also that the standard arrangement (2R+b) may have arisen twice through breakpoint reuse. The molecular diagnostic was reliable when applied to laboratory colonies, but its accuracy was lower in natural populations.
The complex repetitive sequence flanking the 2Rb breakpoint region may be prone to structural and sequence-level instability. The 2Rb molecular diagnostic has immediate application in studies based on laboratory colonies, but its usefulness in natural populations awaits development of complementary molecular tools.
PMCID: PMC2988034  PMID: 20974007
11.  A statistical approach to finding overlooked genetic associations 
BMC Bioinformatics  2010;11:526.
Complexity and noise in expression quantitative trait loci (eQTL) studies make it difficult to distinguish potential regulatory relationships among the many interactions. The predominant method of identifying eQTLs finds associations that are significant at a genome-wide level. The vast number of statistical tests carried out on these data make false negatives very likely. Corrections for multiple testing error render genome-wide eQTL techniques unable to detect modest regulatory effects.
We propose an alternative method to identify eQTLs that builds on traditional approaches. In contrast to genome-wide techniques, our method determines the significance of an association between an expression trait and a locus with respect to the set of all associations to the expression trait. The use of this specific information facilitates identification of expression traits that have an expression profile that is characterized by a single exceptional association to a locus.
Our approach identifies expression traits that have exceptional associations regardless of the genome-wide significance of those associations. This property facilitates the identification of possible false negatives for genome-wide significance. Further, our approach has the property of prioritizing expression traits that are affected by few strong associations. Expression traits identified by this method may warrant additional study because their expression level may be affected by targeting genes near a single locus.
We demonstrate our method by identifying eQTL hotspots in Plasmodium falciparum (malaria) and Saccharomyces cerevisiae (yeast). We demonstrate the prioritization of traits with few strong genetic effects through Gene Ontology (GO) analysis of Yeast. Our results are strongly consistent with results gathered using genome-wide methods and identify additional hotspots and eQTLs.
New eQTLs and hotspots found with this method may represent regions of the genome or biological processes that are controlled through few relatively strong genetic interactions. These points of interest warrant experimental investigation.
PMCID: PMC2974753  PMID: 20964847
12.  Population-level transcriptome sequencing of nonmodel organisms Erynnis propertius and Papilio zelicaon 
BMC Genomics  2010;11:310.
Several recent studies have demonstrated the use of Roche 454 sequencing technology for de novo transcriptome analysis. Low error rates and high coverage also allow for effective SNP discovery and genetic diversity estimates. However, genetically diverse datasets, such as those sourced from natural populations, pose challenges for assembly programs and subsequent analysis. Further, estimating the effectiveness of transcript discovery using Roche 454 transcriptome data is still a difficult task.
Using the Roche 454 FLX Titanium platform, we sequenced and assembled larval transcriptomes for two butterfly species: the Propertius duskywing, Erynnis propertius (Lepidoptera: Hesperiidae) and the Anise swallowtail, Papilio zelicaon (Lepidoptera: Papilionidae). The Expressed Sequence Tags (ESTs) generated represent a diverse sample drawn from multiple populations, developmental stages, and stress treatments.
Despite this diversity, > 95% of the ESTs assembled into long (> 714 bp on average) and highly covered (> 9.6× on average) contigs. To estimate the effectiveness of transcript discovery, we compared the number of bases in the hit region of unigenes (contigs and singletons) to the length of the best match silkworm (Bombyx mori) protein--this "ortholog hit ratio" gives a close estimate on the amount of the transcript discovered relative to a model lepidopteran genome. For each species, we tested two assembly programs and two parameter sets; although CAP3 is commonly used for such data, the assemblies produced by Celera Assembler with modified parameters were chosen over those produced by CAP3 based on contig and singleton counts as well as ortholog hit ratio analysis. In the final assemblies, 1,413 E. propertius and 1,940 P. zelicaon unigenes had a ratio > 0.8; 2,866 E. propertius and 4,015 P. zelicaon unigenes had a ratio > 0.5.
Ultimately, these assemblies and SNP data will be used to generate microarrays for ecoinformatics examining climate change tolerance of different natural populations. These studies will benefit from high quality assemblies with few singletons (less than 26% of bases for each assembled transcriptome are present in unassembled singleton ESTs) and effective transcript discovery (over 6,500 of our putative orthologs cover at least 50% of the corresponding model silkworm gene).
PMCID: PMC2887415  PMID: 20478048
13.  SNP discovery via 454 transcriptome sequencing 
The Plant Journal   2007;51(5):910-918.
A massively parallel pyro-sequencing technology commercialized by 454 Life Sciences Corporation was used to sequence the transcriptomes of shoot apical meristems isolated from two inbred lines of maize using laser capture microdissection (LCM). A computational pipeline that uses the POLYBAYES polymorphism detection system was adapted for 454 ESTs and used to detect SNPs (single nucleotide polymorphisms) between the two inbred lines. Putative SNPs were computationally identified using 260 000 and 280 000 454 ESTs from the B73 and Mo17 inbred lines, respectively. Over 36 000 putative SNPs were detected within 9980 unique B73 genomic anchor sequences (MAGIs). Stringent post-processing reduced this number to > 7000 putative SNPs. Over 85% (94/110) of a sample of these putative SNPs were successfully validated by Sanger sequencing. Based on this validation rate, this pilot experiment conservatively identified > 4900 valid SNPs within > 2400 maize genes. These results demonstrate that 454-based transcriptome sequencing is an excellent method for the high-throughput acquisition of gene-associated SNPs.
PMCID: PMC2169515  PMID: 17662031
SNPs; ESTs; maize; 454 sequencing; markers
14.  Global gene expression analysis of the shoot apical meristem of maize (Zea mays L.) 
The Plant Journal   2007;52(3):391-404.
All above-ground plant organs are derived from shoot apical meristems (SAMs). Global analyses of gene expression were conducted on maize (Zea mays L.) SAMs to identify genes preferentially expressed in the SAM. The SAMs were collected from 14-day-old B73 seedlings via laser capture microdissection (LCM). The RNA samples extracted from LCM-collected SAMs and from seedlings were hybridized to microarrays spotted with 37 660 maize cDNAs. Approximately 30% (10 816) of these cDNAs were prepared as part of this study from manually dissected B73 maize apices. Over 5000 expressed sequence tags (ESTs) (about 13% of the total) were differentially expressed (P<0.0001) between SAMs and seedlings. Of these, 2783 and 2248 ESTs were up- and down-regulated in the SAM, respectively. The expression in the SAM of several of the differentially expressed ESTs was validated via quantitative RT-PCR and/or in situ hybridization. The up-regulated ESTs included many regulatory genes including transcription factors, chromatin remodeling factors and components of the gene-silencing machinery, as well as about 900 genes with unknown functions. Surprisingly, transcripts that hybridized to 62 retrotransposon-related cDNAs were also substantially up-regulated in the SAM. Complementary DNAs derived from the LCM-collected SAMs were sequenced to identify additional genes that are expressed in the SAM. This generated around 550 000 ESTs (454-SAM ESTs) from two genotypes. Consistent with the microarray results, approximately 14% of the 454-SAM ESTs from B73 were retrotransposon-related. Possible roles of genes that are preferentially expressed in the SAM are discussed.
PMCID: PMC2156186  PMID: 17764504
shoot apical meristem; global gene expression; laser capture microdissection; 454 sequencing; development; retrotransposon expression
15.  PROBEmer: a web-based software tool for selecting optimal DNA oligos 
Nucleic Acids Research  2003;31(13):3746-3750.
PROBEmer ( is a web-based software tool that enables a researcher to select optimal oligos for PCR applications and multiplex detection platforms including oligonucleotide microarrays and bead-based arrays. Given two groups of nucleic-acid sequences, a target group and a non-target group, the software identifies oligo sequences that occur in members of the target group, but not in the non-target group. To help predict potential cross hybridization, PROBEmer computes all near neighbors in the non-target group and displays their alignments. The software has been used to obtain genus-specific prokaryotic probes based on the 16S rRNA gene, gene-specific probes for expression analyses and PCR primers. In this paper, we describe how to use PROBEmer, the computational methods it employs, and experimental results for oligos identified by this software tool.
PMCID: PMC168975  PMID: 12824409

