Alpha-11C-methyl-L-tryptophan positron emission tomography (AMT-PET) allows evaluation of brain serotonin synthesis and can also track upregulation of the immunosuppressive kynurenine pathway in tumor tissue. Increased AMT uptake is a hallmark of World Health Organization grade III–IV gliomas. Our recent study also suggested decreased frontal cortical AMT uptake in glioma patients contralateral to the tumor. The clinical significance of extratumoral tryptophan metabolism has not been established. In the present study, we investigated clinical correlates of tryptophan metabolic abnormalities in the non-tumoral hemisphere of glioma patients.
Standardized AMT uptake values (SUV) and the uptake rate constant of AMT (K [mL/g/min], a measure proportional to serotonin synthesis in non-tumoral gray matter) were quantified in the frontal and temporal cortex as well as thalamus in the non-tumoral hemisphere in 77 AMT-PET scans of 66 patients (41 males; mean age: 55±15 years) with grade III–IV gliomas. These AMT values were determined pre-treatment in 35 and post-treatment in 42 patients and were correlated with clinical variables and survival.
AMT uptake in the thalamus showed a moderate age-related increase pre-treatment (SUV, r=0.39, p=0.02) but decrease post-treatment (K, r=−0.33; p=0.057). Females had higher thalamic SUVs pre-treatment (p=0.037) and higher thalamic (p=0.013) and frontal cortical K values (p=0.023) post-treatment. In the post-treatment glioma group, high thalamic SUVs and high thalamo-cortical SUV ratios were associated with short survival in Cox regression analysis. The thalamo-cortical ratio remained strongly prognostic (p<0.01) when clinical predictors, including age, glioma grade, and time since radiotherapy were entered in the regression model. Long interval between radiotherapy and post-treatment AMT-PET as well as high radiation dose affecting the thalamus were associated with lower contralateral thalamic or cortical AMT uptake values.
These observations provide evidence for altered tryptophan uptake in contralateral cortical and thalamic brain regions in glioma patients after initial therapy, suggesting treatment effects on the serotonergic system. Low thalamic tryptophan uptake appears to be a strong, independent predictor of long survival in patients with previous glioma treatment.
Glioma; brain; tryptophan metabolism; PET; survival
This study was conducted as a part of the Chromosome-Centric Human Proteome Project (C-HPP) of the Human Proteome Organization. The United States team of C-HPP is focused on characterizing the protein-coding genes in chromosome 17. Despite its small size, chromosome 17 is rich in protein-coding genes, it contains many cancer-associated genes, including BRCA1, ERBB2 (Her2/neu), and TP53. The goal of this study was to examine the splice variants expressed in three ERBB2 expressed breast cancer cell line models of hormone receptor negative breast cancers by integrating RNA-Seq and proteomic mass spectrometry data. The cell-lines represent distinct phenotypic variations subtype: SKBR3 (ERBB2+ (over-expression)/ ER−/PR−; adenocarcinoma), SUM190 (ERBB2+ (over-expression)/ER−/PR−; inflammatory breast cancer) and SUM149 (ERBB2 (low expression) ER−/PR −; inflammatory breast cancer). We identified more than one splice variant for 1167 genes expressed in at least one of the three cancer cell lines. We found multiple variants of genes that are in the signaling pathways downstream of ERBB2 along with variants specific to one cancer cell line compared to the other two cancer cell lines and to normal mammary cells. The overall transcript profiles based on read counts indicated more similarities between SKBR3 and SUM190. The top-ranking Gene Ontology and BioCarta pathways for the cell-line specific variants pointed to distinct key mechanisms including: amino sugar metabolism, caspase activity, and endocytosis in SKBR3; different aspects of metabolism, especially of lipids in SUM190; cell- to-cell adhesion, integrin and ERK1/ERK2 signaling, and translational control in SUM149. The analyses indicated an enrichment in the electron transport chain processes in the ERBB2 over-expressed cell line models; and an association of nucleotide binding, RNA splicing and translation processes with the IBC models, SUM190 and SUM149. Detailed experimental studies on the distinct variants identified from each of these three breast cancer cell line models may open opportunities for drug target discovery and help unveil their specific roles in cancer progression and metastasis.
Splice variants (SpV); splice variant protein (SpP); splice variant transcript (SpT); ERBB2+ (Her2/neu); EGFR; proteotypic peptide; I-TASSER; breast cancer subtypes
Biological processes are fundamentally driven by complex interactions between biomolecules. Integrated high-throughput omics studies enable multifaceted views of cells, organisms, or their communities. With the advent of new post-genomics technologies, omics studies are becoming increasingly prevalent; yet the full impact of these studies can only be realized through data harmonization, sharing, meta-analysis, and integrated research. These essential steps require consistent generation, capture, and distribution of metadata. To ensure transparency, facilitate data harmonization, and maximize reproducibility and usability of life sciences studies, we propose a simple common omics metadata checklist. The proposed checklist is built on the rich ontologies and standards already in use by the life sciences community. The checklist will serve as a common denominator to guide experimental design, capture important parameters, and be used as a standard format for stand-alone data publications. The omics metadata checklist and data publications will create efficient linkages between omics data and knowledge-based life sciences innovation and, importantly, allow for appropriate attribution to data generators and infrastructure science builders in the post-genomics era. We ask that the life sciences community test the proposed omics metadata checklist and data publications and provide feedback for their use and improvement.
The American College of Medical Genetics and Genomics (ACMG) recently released guidelines regarding the reporting of incidental findings in sequencing data. Given the availability of Direct to Consumer (DTC) genetic testing and the falling cost of whole exome and genome sequencing, individuals will increasingly have the opportunity to analyze their own genomic data. We have developed a web-based tool, PATH-SCAN, which annotates individual genomes and exomes for ClinVar designated pathogenic variants found within the genes from the ACMG guidelines. Because mutations in these genes predispose individuals to conditions with actionable outcomes, our tool will allow individuals or researchers to identify potential risk variants in order to consult physicians or genetic counselors for further evaluation. Moreover, our tool allows individuals to anonymously submit their pathogenic burden, so that we can crowd source the collection of quantitative information regarding the frequency of these variants. We tested our tool on 1092 publicly available genomes from the 1000 Genomes project, 163 genomes from the Personal Genome Project, and 15 genomes from a clinical genome sequencing research project. Excluding the most commonly seen variant in 1000 Genomes, about 20% of all genomes analyzed had a ClinVar designated pathogenic variant that required further evaluation.
Autism is a complex disease whose etiology remains elusive. We integrated previously and newly generated data and developed a systems framework involving the interactome, gene expression and genome sequencing to identify a protein interaction module with members strongly enriched for autism candidate genes. Sequencing of 25 patients confirmed the involvement of this module in autism, which was subsequently validated using an independent cohort of over 500 patients. Expression of this module was dichotomized with a ubiquitously expressed subcomponent and another subcomponent preferentially expressed in the corpus callosum, which was significantly affected by our identified mutations in the network center. RNA-sequencing of the corpus callosum from patients with autism exhibited extensive gene mis-expression in this module, and our immunochemical analysis showed that the human corpus callosum is predominantly populated by oligodendrocyte cells. Analysis of functional genomic data further revealed a significant involvement of this module in the development of oligodendrocyte cells in mouse brain. Our analysis delineates a natural network involved in autism, helps uncover novel candidate genes for this disease and improves our understanding of its molecular pathology.
autism spectrum disorders; corpus callosum; functional modules; oligodendrocytes; protein interaction network
The morbidity and mortality attributable to heritable and sporadic carcinomas of the colon are substantial and affect children and adults alike. Despite current colonoscopy screening recommendations colorectal adenocarcinoma (CRC) still accounts for almost 140000 cancer cases yearly. Familial adenomatous polyposis (FAP) is a colon cancer predisposition due to alterations in the adenomatous polyposis coli gene, which is mutated in most CRC. Since the beginning of the genomic era next-generation sequencing analyses of CRC continue to improve our understanding of the genetics of tumorigenesis and promise to expand our ability to identify and treat this disease. Advances in genome sequence analysis have facilitated the molecular diagnosis of individuals with FAP, which enables initiation of appropriate monitoring and timely intervention. Genome sequencing also has potential clinical impact for individuals with sporadic forms of CRC, providing means for molecular diagnosis of CRC tumor type, data guiding selection of tumor targeted therapies, and pharmacogenomic profiles specifying patient specific drug tolerances. There is even a potential role for genomic sequencing in surveillance for recurrence, and early detection, of CRC. We review strategies for diagnostic assessment and management of FAP and sporadic CRC in the current genomic era, with emphasis on the current, and potential for future, impact of genome sequencing on the clinical care of these conditions.
Colorectal adenocarcinoma; Familial adenomatous polyposis; Genome sequencing; Personalized medicine; Cancer genomics; Pharmacogenomics; Genomic medicine
Synthetic genes that confer resistance to the antibiotic nourseothricin in the pathogenic fungus Candida albicans are available, but genes conferring resistance to other antibiotics are not. We found that multiple C. albicans strains were inhibited by hygromycin B, so we designed a 1026 bp gene (CaHygB) that encodes Escherichia coli hygromycin B phosphotransferase with C. albicans codons. CaHygB conferred hygromycin B resistance in C. albicans transformed with ars2-containing plasmids or single-copy integrating vectors. Since CaHygB did not confer nourseothricin resistance and since the nourseothricin resistance marker SAT-1 did not confer hygromycin B resistance, we reasoned that these two markers could be used for homologous gene disruptions in wild-type C. albicans. We used PCR to fuse CaHygB or SAT-1 to approximately 1 kb of 5’ and 3’ noncoding DNA from C. albicans ARG4, HIS1 and LEU2, and we introduced the resulting amplicons into 6 wild-type C. albicans strains. Homologous targeting frequencies were approximately 50-70%, and disruption of both ARG4, HIS1 and LEU2 alleles was verified by the respective transformants’ inabilities to grow without arginine, histidine and leucine. CaHygB should be a useful tool for genetic manipulation of different C. albicans strains, including clinical isolates.
The endoplasmic reticulum-associated degradation (ERAD) pathway is responsible for the translocation of misfolded proteins across the ER membrane into the cytosol for subsequent degradation by the proteasome. In order to understand the spectrum of clinical and molecular findings in a complex neurological syndrome, we studied a series of eight patients with inherited deficiency of N-glycanase 1 (NGLY1), a novel disorder of cytosolic ERAD dysfunction.
Whole-genome, whole-exome or standard Sanger sequencing techniques were employed. Retrospective chart reviews were performed in order to obtain clinical data.
All patients had global developmental delay, a movement disorder, and hypotonia. Other common findings included hypo- or alacrima (7/8), elevated liver transaminases (6/7), microcephaly (6/8), diminished reflexes (6/8), hepatocyte cytoplasmic storage material or vacuolization (5/6), and seizures (4/8). The nonsense mutation c.1201A>T (p.R401X) was the most common deleterious allele.
NGLY1 deficiency is a novel autosomal recessive disorder of the ERAD pathway associated with neurological dysfunction, abnormal tear production, and liver disease. The majority of patients detected to date carry a specific nonsense mutation that appears to be associated with severe disease. The phenotypic spectrum is likely to enlarge as cases with a more broad range of mutations are detected.
NGLY1; alacrima; choreoathetosis; seizures; liver disease
Glioblastoma is an infiltrative malignancy that tends to extend beyond the MRI-defined tumor volume. We utilized positron emission tomography (PET) imaging with the radiotracer alpha-[11C]methyl-L -tryptophan (AMT) to develop a reliable high-risk gross tumor volume (HR-GTV) method for delineation of glioblastoma. AMT can detect solid tumor mass and tumoral brain infiltration by increased tumoral tryptophan transport and metabolism via the immunosuppressive kynurenine pathway.
We reviewed all patients in our database with histologically proven glioblastoma who underwent preoperative AMT-PET scan prior to surgery and chemoradiation. Treated radiotherapy volumes were derived from the simulation CT with MRI fusion. High-GTV with contrast enhanced T1-weighted MRI alone (GTVMRI) was defined as the postoperative cavity plus any residual area of enhancement on postcontrast T1-weighted images. AMT-PET images were retrospectively fused to the simulation CT, and a high-risk GTVs generated by both AMT-PET alone (GTVAMT) was defined using a threshold previously established to distinguish tumor tissue from peritumoral edema. A composite volume of MRI and AMT tumor volume was also created (combination of MRI fused with AMT-PET data; GTVMRI+AMT). In patients with definitive radiographic progression, follow-up MRI demonstrating initial tumor progression was fused with the pretreatment images and a progression volume was contoured. The coverage of the progression volume by GTVMRI, GTVAMT, and GTVMRI+AMT was determined and compared using the Wilcoxon’s signed-rank test.
Eleven patients completed presurgical AMT-PET scan, seven of whom had progressive disease after initial therapy. GTVMRI (mean, 50.2 cm3) and GTVAMT (mean, 48.9 cm3) were not significantly different. Mean concordance index of the volumes was 39±15 %. Coverage of the initial recurrence volume by HR-GTVMRI (mean, 52 %) was inferior to both GTVAMT (mean, 68 %; p =0.028) and GTVMRI+AMT (mean 73 %; p =0.018). The AMT-PET-exclusive coverage was up to 41 % of the recurrent volume. There was a tendency towards better recurrence coverage with GTVMRI+AMT than with GTVAMT alone (p =0.068). Addition of 5 mm concentric margin around GTVMRI, GTVAMT, and GTVMRI+AMT would have completely covered the initial progression volume in 14, 57, and 71 % of the patients, respectively.
We found that a GTV defined by AMT-PET produced similar volume, but superior recurrence coverage than the treated standard MRI-determined volume. A prospective study is necessary to fully determine the usefulness of AMT-PET for volume definition in glioblastoma radiotherapy planning.
MRI; Tryptophan; PET; GTV; Volumetry; Radiation therapy; Recurrence coverage
Transcription factors (TFs) bind in a combinatorial fashion to specify the on-and-off states of genes; the ensemble of these binding events forms a regulatory network, constituting the wiring diagram for a cell. To examine the principles of the human transcriptional regulatory network, we determined the genomic binding information of 119 TFs in 458 ChIP-Seq experiments. We found the combinatorial, co-association of TFs to be highly context specific: distinct combinations of factors bind at specific genomic locations. In particular, there are significant differences in the binding proximal and distal to genes. We organized all the TF binding into a hierarchy and integrated it with other genomic information (e.g. miRNA regulation), forming a dense meta-network. Factors at different levels have different properties: for instance, top-level TFs more strongly influence expression and middle-level ones co-regulate targets to mitigate information-flow bottlenecks. Moreover, these co-regulations give rise to many enriched network motifs -- e.g. noise-buffering feed-forward loops. Finally, more connected network components are under stronger selection and exhibit a greater degree of allele-specific activity (i.e., differential binding to the two parental alleles). The regulatory information obtained in this study will be crucial for interpreting personal genome sequences and understanding basic principles of human biology and disease.
The rapid development of high-throughput technologies and computational frameworks enables the examination of biological systems in unprecedented detail. The ability to study biological phenomena at omics levels in turn is expected to lead to significant advances in personalized and precision medicine. Patients can be treated according to their own molecular characteristics. Individual omes as well as the integrated profiles of multiple omes, such as the genome, the epigenome, the transcriptome, the proteome, the metabolome, the antibodyome, and other omics information are expected to be valuable for health monitoring, preventative measures, and precision medicine. Moreover, omics technologies have the potential to transform medicine from traditional symptom-oriented diagnosis and treatment of diseases towards disease prevention and early diagnostics. We discuss here the advances and challenges in systems biology-powered personalized medicine at its current stage, as well as a prospective view of future personalized health care at the end of this review.
The health of an individual depends upon their DNA as well as environmental factors (environome or exposome). It is expected that although the genome is the blueprint of an individual, its analysis with that of the other omes, such as the DNA methylome, the transcriptome, proteome, as well as metabolome will further provide a dynamic assessment of the physiology and health state of an individual. This review will help to categorize the current progress of omics analyses, and how omics integration can be used for medical research. We believe that integrative Personal Omics Profiling (iPOP) is a stepping stone to a new road to personalized health care and may improve 1) Disease risk assessment, 2) Accuracy of diagnosis, 3) Disease monitoring, 4) Targeted treatments and 5) Understanding the biological processes of disease states for their prevention.
Combined Immunodeficiency with Multiple Intestinal Atresias (CID-MIA) is a rare hereditary disease characterized by intestinal obstructions and profound immune defects.
We sought to determine the underlying genetic causes of CID-MIA by analyzing the exomic sequence of 5 patients and their healthy direct relatives from 5 unrelated families.
We performed whole exome sequencing on 5 CID-MIA patients and 10 healthy direct family members belonging to 5 unrelated families with CID-MIA. We also performed targeted Sanger sequencing for the candidate gene TTC7A on 3 additional CID-MIA patients.
Through analysis and comparison of the exomic sequence of the individuals from these 5 families, we identified biallelic damaging mutations in the TTC7A gene, for a total of 7 distinct mutations. Targeted TTC7A gene sequencing in 3 additional unrelated patients with CID-MIA revealed biallelic deleterious mutations in two of them, as well as an aberrant splice product in the third patient. Staining of normal thymus showed that the TTC7A protein is expressed in thymic epithelial cells as well as in thymocytes. Moreover, severe lymphoid depletion was observed in the thymus and peripheral lymphoid tissues from two patients with CID-MIA.
We identified deleterious mutations of the TTC7A gene in 8 unrelated patients with CID-MIA and demonstrated that the TTC7A protein is expressed in the thymus. Our results strongly suggest that TTC7A gene defects cause CID-MIA.
Damaging mutations in the gene TTC7A should be scrutinized in patients with CID-MIA. Characterization of the role of this protein in the immune system and intestinal development, as well as in thymic epithelial cells may have important therapeutic implications.
Combined Immunodeficiency with Multiple Intestinal Atresias; Tetracopeptide Repeat Domain 7A; Whole Exome Sequencing; Thymus
Rapid growth of sequencing technologies has greatly contributed to increasing our understanding of human genetics. Yet, in spite of this growth, mainstream technologies have been largely unsuccessful in resolving the diploid nature of the human genome. Here we describe statistically aided long read haplotyping (SLRH), a rapid, accurate method based on a simple experimental protocol that requires potentially as little as 30 Gbp of sequencing in addition to a standard (50x coverage) whole-genome analysis for human samples. Using this technology, we phase 99% of single-nucleotide variants in three human genomes into long haplotype blocks of 200 kbp to 1 Mbp in length. As a demonstration of the potential applications of our method, we determine allele-specific methylation patterns in a human genome and identify hundreds of differentially methylated regions that were previously unknown. Such information may offer insight into the mechanisms behind differential gene expression.
Steroidogenic acute regulatory protein (StAR)-related lipid transfer (START) domains were first identified from mammalian proteins that bind lipid/sterol ligands via a hydrophobic pocket. In plants, predicted START domains are predominantly found in homeodomain leucine zipper (HD-Zip) transcription factors that are master regulators of cell-type differentiation in development. Here we utilized studies of Arabidopsis in parallel with heterologous expression of START domains in yeast to investigate the hypothesis that START domains are versatile ligand-binding motifs that can modulate transcription factor activity.
Our results show that deletion of the START domain from Arabidopsis Glabra2 (GL2), a representative HD-Zip transcription factor involved in differentiation of the epidermis, results in a complete loss-of-function phenotype, although the protein is correctly localized to the nucleus. Despite low sequence similarly, the mammalian START domain from StAR can functionally replace the HD-Zip-derived START domain. Embedding the START domain within a synthetic transcription factor in yeast, we found that several mammalian START domains from StAR, MLN64 and PCTP stimulated transcription factor activity, as did START domains from two Arabidopsis HD-Zip transcription factors. Mutation of ligand-binding residues within StAR START reduced this activity, consistent with the yeast assay monitoring ligand-binding. The D182L missense mutation in StAR START was shown to affect GL2 transcription factor activity in maintenance of the leaf trichome cell fate. Analysis of in vivo protein–metabolite interactions by mass spectrometry provided direct evidence for analogous lipid-binding activity in mammalian and plant START domains in the yeast system. Structural modeling predicted similar sized ligand-binding cavities of a subset of plant START domains in comparison to mammalian counterparts.
The START domain is required for transcription factor activity in HD-Zip proteins from plants, although it is not strictly necessary for the protein’s nuclear localization. START domains from both mammals and plants are modular in that they can bind lipid ligands to regulate transcription factor function in a yeast system. The data provide evidence for an evolutionarily conserved mechanism by which lipid metabolites can orchestrate transcription. We propose a model in which the START domain is used by both plants and mammals to regulate transcription factor activity.
Electronic supplementary material
The online version of this article (doi:10.1186/s12915-014-0070-8) contains supplementary material, which is available to authorized users.
Transcription; Steroidogenic acute regulatory related lipid transfer; START; StAR; Homeodomain; HD-Zip; Glabra2; Yeast; Arabidopsis; Mouse
In this study we selected three breast cancer cell lines (SKBR3, SUM149 and SUM190) with different oncogene expression levels involved in ERBB2 and EGFR signaling pathways as a model system for the evaluation of selective integration of subsets of transcriptomic and proteomic data. We assessed the oncogene status with RPKM values (Reads Per Kilobase per Million mapped reads1) for ERBB2 (14.4, 400 and 300 for SUM149, SUM 190 and SKBR3 respectively and for EGFR 60.1, not detected and 1.4 for the same 3 cell lines. We then used RNA-Seq data to identify those oncogenes with significant transcript levels in these cell lines (total 31) and interrogated the corresponding proteomics data sets for proteins with significant interaction values with these oncogenes. The number of observed interactors for each oncogene showed a significant range, e.g. 4.2% (JAK1) to 27.3% (MYC). The percentage is measured as a fraction of the total protein interactions in a given data set vs. total interactors for that oncogene in STRING (Search Tool for the Retrieval of Interacting Genes/Proteins, version 9.0) and I2D (Interologous Interaction Database, version 1.95). This approach allowed us to focus on 4 main oncogenes, ERBB2, EGFR, MYC, and GRB2, for pathway analysis. We used the following bioinformatics sites, GeneGo, PathwayCommons and NCI receptor signaling networks to identify pathways which contained the four main oncogenes, had good coverage in the transcriptomic and proteomic data sets as well as significant number of oncogene interactors. The four pathways identified were ERBB signaling, EGFR1 signaling, integrin outside-in signaling, and validated targets of C-MYC transcriptional activation. The greater dynamic range of the RNA-Seq values allowed the use of transcript ratios to correlate observed protein values with the relative levels of the ERBB2 and EGFR transcripts in each of the four pathways. This provided us with potential proteomic signatures for the SUM149 and 190 cell lines, growth factor receptor-bound protein 7 (GRB7), Crk-like protein (CRKL) and Catenin delta-1 (CTNND1) for ERBB signaling, caveolin 1 (CAV1), plectin (PLEC) for EGFR signaling; filamin A (FLNA) and actinin alpha1 (ACTN1) (associated with high levels of EGFR transcript) for integrin signalings: branched chain amino-acid transaminase 1 (BCAT1), carbamoyl-phosphate synthetase (CAD), nucleolin (NCL) (high levels of EGFR transcript); transferrin receptor (TFRC), metadherin (MTDH) (high levels of ERBB2 transcript) for MYC signaling; S100-A2 protein (S100A2), caveolin 1 (CAV1), Serpin B5 (SERPINB5), stratifin (SFN), PYD and CARD domain containing (PYCARD), and EPH receptor A2 (EPHA2) for PI3K signaling, p53 sub-pathway. Future studies of inflammatory breast cancer (IBC), from which the cell lines were derived, will be used to explore the significance of these observations.
We report progress assembling the parts list for chromosome 17 and illustrate the various processes that we have developed to integrate available data from diverse genomic and proteomic knowledge bases. As primary resources we have used GPMDB, neXtProt, PeptideAtlas, Human Protein Atlas (HPA), and GeneCards. All sites share the common resource of Ensembl for the genome modeling information. We have defined the chromosome 17 parts list with the following information: 1169 protein-coding genes, the numbers of proteins confidently identified by various experimental approaches as documented in GPMDB, neXtProt, PeptideAtlas, and HPA, examples of typical data sets obtained by RNASeq and proteomic studies of epithelial derived tumor cell lines (disease proteome) and a normal proteome (peripheral mononuclear cells), reported evidence of post-translational modifications, and examples of alternative splice variants (ASVs). We have constructed a list of the 59 ‘missing’ proteins as well as 201 proteins that have inconclusive mass spectrometric (MS) identifications. In this report we have defined a process to establish a baseline for the incorporation of new evidence on protein identification and characterization as well as related information from transcriptome analyses. This initial list of ‘missing’ proteins that will guide the selection of appropriate samples for discovery studies as well as antibody reagents. Also we have illustrated the significant diversity of protein variants (including post-translational modifications, PTMs) using regions on chromosome 17 that contain important oncogenes. We emphasize the need for mandated deposition of proteomics data in public databases, the further development of improved PTM, ASV and single nucleotide variant (SNV) databases and the construction of websites that can integrate and regularly update such information. In addition, we describe the distribution of both clustered and scattered sets of protein families on the chromosome. Since chromosome 17 is rich in cancer associated genes we have focused the clustering of cancer associated genes in such genomic regions and have used the ERBB2 amplicon as an example of the value of a proteogenomic approach in which one integrates transcriptomic with proteomic information and captures evidence of co-expression through coordinated regulation.
Chromosome-Centric Human Proteome Project; Chromosome 17 Parts List; ERBB2; Oncogene
Large-scale sequencing efforts have documented extensive genetic variation within the human genome. However, our understanding of the origins, global distribution, and functional consequences of this variation is far from complete. While regulatory variation influencing gene expression has been studied within a handful of populations, the breadth of transcriptome differences across diverse human populations has not been systematically analyzed. To better understand the spectrum of gene expression variation, alternative splicing, and the population genetics of regulatory variation in humans, we have sequenced the genomes, exomes, and transcriptomes of EBV transformed lymphoblastoid cell lines derived from 45 individuals in the Human Genome Diversity Panel (HGDP). The populations sampled span the geographic breadth of human migration history and include Namibian San, Mbuti Pygmies of the Democratic Republic of Congo, Algerian Mozabites, Pathan of Pakistan, Cambodians of East Asia, Yakut of Siberia, and Mayans of Mexico. We discover that approximately 25.0% of the variation in gene expression found amongst individuals can be attributed to population differences. However, we find few genes that are systematically differentially expressed among populations. Of this population-specific variation, 75.5% is due to expression rather than splicing variability, and we find few genes with strong evidence for differential splicing across populations. Allelic expression analyses indicate that previously mapped common regulatory variants identified in eight populations from the International Haplotype Map Phase 3 project have similar effects in our seven sampled HGDP populations, suggesting that the cellular effects of common variants are shared across diverse populations. Together, these results provide a resource for studies analyzing functional differences across populations by estimating the degree of shared gene expression, alternative splicing, and regulatory genetics across populations from the broadest points of human migration history yet sampled.
Previous gene expression studies have identified factors influencing population-level variation in gene regulation. However, these efforts have been limited to a small set of well-studied populations. By leveraging the high resolution of RNA sequencing and broad population sampling, we survey the landscape of transcriptome variation across a globally distributed set of seven populations that span a breadth of human genetic variation and major dispersal events. We assess differences in gene expression, transcript structure, and regulatory variation. We find only 44 transcripts that show significant differences in expression, likely as a result of the small sample size, but we find that 25% of the variance in gene expression is due to population differences. This is a larger fraction than previously observed, and it is likely due to the greater breadth of human diversity assayed in this study. We also find that population-specific variance is mostly due to transcription variability rather than the configuration of expressed gene products. Additionally, known common regulatory variants have similar effects across populations including those we study here. These data and results serve as a resource cataloging the wide array of gene expression regulation affecting population variation among diverse groups, improving our understanding of transcriptional diversity.
Whole exome sequencing by high-throughput sequencing of target-enriched genomic DNA (exome-seq) has become common in basic and translational research as a means of interrogating the interpretable part of the human genome at relatively low cost. Presented here is a comparison of three major commercial exome sequencing platforms from Agilent, Illumina and Nimblegen applied to the same human blood sample. The Nimblegen platform, which is the only one to use high-density overlapping baits, provides increased efficiency of enrichment and sensitivity for detecting variants but covers fewer genomic regions than the other platforms. As a result, Nimblegen requires the least amount of sequencing to sensitively detect small variants, but Agilent and Illumina are able to detect a greater total number of variants with additional sequencing. Illumina in particular captures the untranslated regions, which are missing from the Nimblegen and Agilent platforms. Exome sequencing and whole genome sequencing (WGS) of the same sample were also compared, demonstrating that exome-seq allows for the detection of additional small variants missed by WGS. These data suggest that WGS experiments benefit from being supplemented with targeted exome-seq data. This study serves to assist the community in selecting the optimal exome-seq platform for their experiments, as well as proving that exome-seq is capable of identifying important coding variations that are missed by a typical WGS experiment.
In parallel to the genetic code for protein synthesis, a second layer of information is embedded in all RNA transcripts in the form of RNA structure. RNA structure influences practically every step in the gene expression program1. Yet the nature of most RNA structures or effects of sequence variation on structure are not known. Here we report the initial landscape and variation of RNA secondary structures (RSS) in a human family Trio, providing a comprehensive RSS map of human coding and noncoding RNAs. We identify unique RSS signatures that demarcate open reading frames, splicing junctions, and define authentic microRNA binding sites. Comparison of native deproteinized RNA isolated from cells versus refolded purified RNA suggests that the majority of the RSS information is encoded within RNA sequence. Over 1900 transcribed single nucleotide variants (~15% of all transcribed SNVs) alter local RNA structure. We discover simple sequence and spacing rules that determine the ability of point mutations to impact RSS. Selective depletion of RiboSNitches versus structurally synonymous variants at precise locations suggests selection for specific RNA shapes at thousands of sites, including 3’UTRs, binding sites of miRNAs and RNA binding proteins genome-wide. These results highlight the potentially broad contribution of RNA structure and its variation to gene regulation.
Pseudorabies virus (PRV) is a neurotropic herpesvirus that causes Aujeszky’s disease in pigs. PRV strains are widely used as transsynaptic tracers for mapping neural circuits. We present here the complete and fully annotated genome sequence of strain Kaplan of PRV, determined by Pacific Biosciences RSII long-read sequencing technology.
Different trans-acting factors (TF) collaborate and act in concert at distinct loci to perform accurate regulation of their target genes. To date, the co-binding of TF pairs has been investigated in a limited context both in terms of the number of factors within a cell type and across cell types and the extent of combinatorial co-localizations. Here we use a novel approach to analyze TF co-localization within a cell type and across multiple cell lines at an unprecedented level. We extend this approach with large-scale mass spectrometry analysis of immunoprecipitations of 50 TFs. Our combined approach reveals large numbers of interesting and novel TF-TF associations. We observe extensive change in TF co-localizations both within a cell type exposed to different conditions and across multiple cell types. We show distinct functional annotations and properties of different TF co-binding patterns and provide new insights into the complex regulatory landscape of the cell.
Gene duplication is a significant source of novel genes and the dynamics of gene duplicate retention vs loss are poorly understood, particularly in terms of the functional and regulatory specialization of their gene products. We compiled a comprehensive data set of S. cerevisiae phosphosites to study the role of phosphorylation in yeast paralog divergence. We found that proteins coded by duplicated genes created in the Whole Genome Duplication (WGD) event and in a period prior to the WGD are significantly more phosphorylated than other duplicates or singletons. Though the amino acid sequence of each paralog of a given pair tends to diverge fairly similarly from their common ortholog in a related species, the phosphorylated amino acids tend to diverge in sequence from the ortholog at different rates. We observed that transcription factors (TFs) are disproportionately present among the set of duplicate genes and among the set of proteins that are phosphorylated. Interestingly, TFs that occur on higher levels of the transcription network hierarchy (i.e., tend to regulate other TFs) tend to be more phosphorylated than lower-level TFs. We found that TF paralog divergence in expression, binding, and sequence correlates with the abundance of phosphosites. Overall, these studies have important implications for understanding divergence of gene function and regulation in eukaryotes.
phosphorylation; gene duplication; paralogs; transcription factors
Global RNA studies have become central to understanding biological
processes, but methods such as microarrays and short-read sequencing are unable
to describe an entire RNA molecule from 5′ to 3′ end. Here we
use single-molecule long-read sequencing technology from Pacific Biosciences to
sequence the polyadenylated RNA complement of a pooled set of 20 human organs
and tissues without the need for fragmentation or amplification. We show that
full-length RNA molecules of up to 1.5 kb can readily be monitored with little
sequence loss at the 5′ ends. For longer RNA molecules more 5′
nucleotides are missing, but complete intron structures are often preserved. In
total, we identify ~14,000 spliced GENCODE genes. High-confidence
mappings are consistent with GENCODE annotations, but >10% of
the alignments represent intron structures that were not previously annotated.
As a group, transcripts mapping to unannotated regions have features of long,
noncoding RNAs. Our results show the feasibility of deep sequencing full-length
RNA from complex eukaryotic transcriptomes on a single-molecule level.