|Home | About | Journals | Submit | Contact Us | Français|
The polyglutamine diseases are caused in part by a gain-of-function mechanism of neuronal toxicity involving protein conformational changes that result in the formation and deposition of β-sheet rich aggregates. Recent evidence suggests that the misfolding mechanism is context-dependent, and that properties of the host protein, including the domain architecture and location of the repeat tract, can modulate aggregation. In order to allow the bioinformatic investigation of the context of polyglutamines, we have constructed a database, PolyQ (http://pxgrid.med.monash.edu.au/polyq). We have collected the sequences of all human proteins containing runs of seven or more glutamine residues and annotated their sequences with domain information. PolyQ can be interrogated such that the sequence context of polyglutamine repeats in disease and non-disease associated proteins can be investigated.
Polyglutamine (PolyQ) repeats are implicated in several neurodegenerative diseases, including Huntington’s disease and several spinocerebellar ataxia’s. It is commonly thought that a toxic gain-of-function mechanism is triggered by the presence of a polyQ tract, involving a conformational change within the protein and the formation and deposition of β-sheet rich amyloid-like fibrils (1–3).
The length of the polyQ repeat is critical to pathogenesis; however, there is evidence that other protein factors, including the location, type and number of flanking domains can modulate pathogenesis (4–10). Although there are many human polyQ-containing proteins (11), only nine polyQ-containing proteins are implicated in pathogenesis, and the precise repeat threshold to pathogenesis varies within the disease subset, for example, a 37 glutamine repeat is sufficient to lead to Huntington's disease, while SCA3 results only when the polyQ repeat expands to 45 or greater (12–14).
Many other human, non-disease related proteins contain polyQ repeats, which are intrinsically prone to expansion at the genetic level (11,15,16). In fact, a 40 glutamine repeat is the normal allele present in forkhead box P2 transcription factor; a protein that has not been found to be associated with a polyQ disease (17,18). This evidence has led to the hypothesis that protein characteristics modulate the propensity of polyQ-containing proteins to aggregate and cause disease. To investigate the variable characteristics of polyQ proteins we have performed a bioinformatics investigation of the protein context of polyglutamine repeats, and constructed a web-accessible database of all human proteins containing a polyQ repeat greater than seven glutamines in length, termed ‘PolyQ’. The PolyQ database provides a tool to compare the polyQ repeat location, the occurrence/type of domains and the number of domain repeats present across disease and non-disease proteins.
The PolyQ database was populated by extracting all human sequences from the NCBI non-redundant (NR) database that contained at least seven consecutive glutamine residues. We then performed a Pfam (19) domain search to find protein domains within this subset of sequences. The NCBI NR contains many versions of the same protein, which created bias in the statistical analysis of PolyQ location data. We simplified the analysis by indentifying protein variants/isoforms and using only the longest protein isoforms (which we termed ‘master sequences’), therefore eliminating splice variants/protein fragments. Multiple variants/isoforms of each protein were crudely identified by comparing the protein sequence following the PolyQ chains. The original sequences were then subjected to the BLASTClust (20), FORCE (21), MCL (22) and HomoClust algorithms (23), and the variants/isoforms were adjusted as necessary. The crude identification used the 10 amino acids immediately after the PolyQ chain as a ‘search string’; any sequence that had the 10 amino acids immediately following its own polyQ chain was presumed to have homology with that sequence. The homology groups were confirmed by analyzing the data using the above algorithms. This yielded a total of 128 master sequences, from an original data set of >700 polyQ-containing human protein sequences.
The database can be searched according to protein name, Pfam domain or sequence. The results of a typical search, shown in Figure 1A, show both a graphical summary (Figure 1A, top) and textual details (Figure 1A, bottom) according to sequence classification (see below). The graphical summary shows pie chart and bar chart representations of the results according to sequence classification (Figure 1A, top), Pfam domain occurrence (Figure 1B) and Pfam domain repetition (Figure 1C). Retrieved database entries are listed in table format with one row per protein, and three columns containing protein name (with links to the GenBank entry), Pfam domains, and protein sequence (with the polyQ region annotated), respectively. Homologs in the database can be included or excluded from the search. From this view, the domain and sequence context of the polyQ sequence can be identified and further interrogated. To aid analysis specific entries can be selected from the results (using the ‘examine’ button) and grouped together.
The data are sorted and annotated according to the following sequence classifications: N-Terminal PolyQs—sequences where the first polyQ chain appears before all Pfam domains; C-Terminal PolyQs—sequences where the last polyQ chain appears after all Pfam domains; Interdomain PolyQs—sequences where the polyQ chains appear between the first Pfam domain and the last Pfam domain; Mid Domain PolyQs—sequences in which the polyQ chain appears in the middle of a Pfam domain, or overlaps a Pfam domain; No Significant Domain PolyQs—sequences that do not contain any significant Pfam domains; Unclassified PolyQs—sequences that did not fit into any of the above classifications. Each group is readily accessed using the tabs in the web page (Figure 1A). We have also further reduced the redundancy in the data by clustering sequence homologs, and have also tagged known disease proteins.
The website features pre-constructed pages that show the database entries sorted according to non-disease and disease-causing proteins respectively. This distinction is applied to the sequence classifications above, the domain occurrence (e.g. listing all domains, Figure 1B), and domain repeats (Figure 1C). This allows database entries to be grouped and examined according to whether the polyQ tracts are found in non-disease or disease-causing proteins (Figure 2).
PolyQ is a valuable resource for theoreticians and experimentalists looking for insights into the context of PolyQ repeats in proteins and relationships with disease. Although the query tool allows searching across much of the database, we are developing a custom interface that will allow user-configurable queries against the whole data set as well as user customization of how the results are displayed. We are also adding the structural information [e.g. from the SCOP (24), CATH (25) and PDB databases (26)] to the resources such that the structural context of polyQ repeats can be investigated.
This work is supported by National Health and Medical Research Council and the Australian Research Council. S.P.B. and A.M.B. are NHMRC Senior Research Fellows. Funding for open access charge: National Health and Medical Research Council (Australia).
Conflict of interest statement. None declared.