The total number of human proteins studied was 18,666; of these, 99.8% contain proline. The median length of all the proteins is 436 residues. The proteins encompass a total of 10,882,808 amino acids, among which there are 687,434 prolyls, which account for 6.3% of all amino acids in the human proteome. The relative abundance of proline in the proteins of the human proteome is shown in the histogram in . The distributions of all 20 amino acids as a function of relative length are shown in the (
Table S1). There are 564,316 singlet prolines (82.1%), 46,540 pairs of proline (13.5%), and 30,038 spans of 3 or more (4.4%).
Proline-Rich, Polyproline-Rich, and Proline-Poor Proteins
There are 46 proteins that contain no proline (
Table S2). On the other hand, the long keratinocyte envelope protein SPRR2G (small proline rich protein 2G), some 73-amino acid residues in length, is comprised of 39.7% proline. We examined the enrichment of functional annotations in the top and bottom 1% of molecules in the human proteome, as ordered by their percentage of proline,
Table S3 &
Table S4.
Both the proline-rich and proline-poor proteins are heavily involved with the formation of the dermis and with keratinization, but their functional roles are very different. Proline-rich proteins include the collagens (e.g. COL3A1, COL5A1, COL1A2) and proteins in the cornified envelope (e.g. LCE2C, SPR2G, LCE2B, SPRR2F) which rely on the properties of proline and hydroxy-proline to form their helices. By contrast, many of the proline-poor proteins exhibit a different coiled-coil motif that includes the intermediate filament proteins that make up the keratins (e.g.KRT9, KRT19, KRT77, KRT14). This distinction highlights the important role of proline and polyproline in determining helical structure. The two kinds of helical structures lie on opposite ends of the proline abundance scale. Proline-poor proteins and domains form one class of helices (e.g. keratins), which can assemble only by excluding proline. Proline-rich molecules, which exhibit a contrasting triple-helical conformation, such as the collagens, constitute another large group of proteins.
Many of the proline-poor proteins, such as the SNAP receptor (SNARE) complex, are involved in vesicle transport and membrane fusion, as exemplified by membrane docking, internal protein transport, and exocytosis (e.g. KDELR3, STX6, RAB2A, GNAI3, AP2S1, SNAP25). Another set of proline-poor proteins is highly enriched for the alpha-helical, calcium binding, EF-hand domain (e.g. S100A, CALB1, FLG2, MYL5). Other proline-poor proteins include the fatty acid binding family (FABP1-5) and other lipid binding proteins (e.g. PMP2, RBP2), as well as GTP binding proteins (e.g. ARL5A, RAB13, GMAI1, GTBP6).
In addition to the collagens, proline-rich proteins include some of the homeobox (HOXA3, HOXA4, ESX1, CDX1, TPRX1) and forkhead box (FOXE3, FOXN1) proteins, as well as the zinc finger proteins. The proline-rich proteins also tend to be highly enriched for consecutive sequences of prolines, so called polyproline sequences.
We examined proteins containing abundant proline, as well as proteins with stretches of contiguous prolines forming a “polyproline motif,” which we define as three or more prolines in consecutive sequence. The distribution of polyproline motifs among proteins in the human proteome is shown in & .
Table S2 provides detailed information about the number of prolines in the longest spans and the number of separate polyproline sequences.
Table S5 shows the start and end positions of each polyproline span.
We examined 11 proteins that start with a tri-proline repeat (ZBTB4, DHX34, HINFP, CD19, IRGQ, ELMO1, ELMO2, MAP1LC3C, IGLON5, IDS, and CPZ), along with 4 proteins that end with a tri-proline repeat (PYDC2, ARHGEF15, RHBDL1, and OR10S1). There are no sequential patterns of amino acids in any of these and there are no apparent functional commonalities among them. Protein OR10S1, which ends with a tri-proline, is an olfactory receptor protein that interacts with odorants and triggers a neuronal response. This pattern was not found in the other proteins that are believed to be odorant detectors: 52 NP olfactory receptor protein, MOR256-8, MOR256-17, MOR256-22, OMP olfactory marker protein. Overall, we could find no association between polyproline at the initial or terminal ends and protein functions.
Consecutive sequences of six or more prolines are associated with DNA/RNA processing, including zinc fingers, actin, and developmental processes. There are 27 proteins that contain from 12 to 27 consecutive repeats; there are 13 proteins with 11 repeats; there are 21 proteins with 10 repeats, and 30 proteins with 9 repeats. Their functional roles are shown in . Of the total 91 proteins in the above groups, there are 42 proteins (45%) associated with DNA/RNA processing, including 14 zinc finger proteins (15%), and 11 proteins associated with actin (12%).
| Table 1Association of functional classes with polyproline long repeats. |
Zinc Finger Proteins
Because of the apparent over representation of zinc finger proteins, we focused on the structure and function of these molecules to gain further insight into the role of polyproline, and found that none of the first 10 display regular recurring motifs: PCLO, ZIF268, ZFP746, ZNF827, Zinc Family Member 5, Zinc Finger CCCH domain, ZFHX4, Zinc Finger Protein 318, Zinc Finger Homeobox protein 3, ZNF367, ZFP579, ZFPM1 (
Figure S1). We suspect that the complex configurations introduced by polyproline helices disrupt long continuous motifs. On the other hand, acute angular changes in conformation could subserve the geometric requirements of highly articulated intra- and inter-molecular interactions.
In some zinc finger proteins (Kruppel type) that contain only singlet or doublet prolyls (no triplets and their helices, and no longer runs of prolines), there is an amino acid motif in which a prolyl recurs every 28 residues (
Figure S2). A 28-residue conserved motif is a well-known feature of some zinc finger structures
[24],
[25], which we call TWEAZR (Twenty-Eight Amino acid Zinc finger Repeat). It includes a linker sequence TGEH. The proline is followed by a YKCEEC sequence, and later an HXXXH sequence . The two cysteines and the two histidines conjugate with a zinc atom. By contrast, in zinc finger proteins that contain prolyl triplets and their miniature helices, as well as longer consecutive repeats that may encompass such helices, this pattern breaks down, possibly because the small polyproline helices insert irregularities into the larger spiral contours of this class of zinc finger molecules. For instance, among the first ten proteins free of polyproline sequences (ZNF100, ZFP726, ZFP729, ZFP732, ZFP733, ZFP736, ZFP737, ZFP739, ZNF741), we found the TWEAZR motif in each, with proline recurring every 28 residues.
Figure S3 shows the pattern repeated in ZNF729.
We conducted a detailed analysis of proline in zinc finger proteins according to their number of consecutive prolyl repeats, from 2 to 27. Among the 95 members containing 9 to 27 consecutive repeats, there are 13 zinc finger proteins (13.4%) (). In molecules that contain consecutive prolyl spans of three or more (highest 22), there are 4245 proteins, of which 83 are zinc finger proteins (1.95%). Among the proteins lacking repeats, there are 14,102 proteins, and 425 zinc finger proteins (0.03%). Of the first 9 zinc finger proteins that show a disorderly amino acid arrangement, 8 (89%) contain prolyl dimers in the pattern of ppx. In these 9 proteins there is a total of 63 dimers, of which 35 contain “guest” amino acids in the third position (
Table S6). Common guests include glycine, asparagine, alanine, glutamine, valine, aspartic acid, histidine and lysine.
| Table 2Abundance of Zinc Finger Proteins with Polyproline Spans. |
By contrast in the low-proline zinc finger protein group, among the first 10 proteins, there are only 5 which contain proline dimers, all with guests in the third position (
Table S7). In the total there are 50 zinc finger proteins in this subgroup, containing a total of 659,569 amino acids, and 28 prolyl dimers. Prolyl dimers account for only 0.00004% of the residues. Thus prolyl dimers are rare in such zinc finger proteins.
The repetitive pattern, TWEAZR, with prolyls recurring every 28 residues, was not found in any of the 10 zinc finger proteins containing the longest consecutive spans or in proteins with the highest percentage of prolyl residues (20%). To determine whether one or more proline trimers in a molecule is associated with the presence or absence of TWEAZR, we compared the frequency of such patterns among zinc finger proteins in which there are one or more trimers, with the frequency of such patterns in molecules in which there are no trimers. There are 43 zinc finger protein molecules in the first group. Of these, only two (ZNF189 and ZNF283) display the repetitive motif, or a frequency of 4.6%. Noteworthy is the fact that in both cases the ppp triplet is located near the beginning of the lead sequence. In ZNF189 (612 amino acids) the triplet prolyls occur in positions 6,7, and 8. In ZNF 283 (679 amino acids) the prolyl triplet occupies positions 22, 23, and 24. On the other hand, of the first 43 zinc finger proteins in the “0” category in which the molecules contain no consecutive prolyl spans beyond 2 (that is, dimers), there are 32 molecules that display the recurring TWEAZR motif (74%), and 11 molecules (26%) that lack it.
These results indicate that a polyproline sequence of three is unlikely to be associated with the presence of the recurring TWEAZR motif within a zinc finger molecule. In the unusual instances in which tri-prolines are present, they appear to be limited to the lead sequence and do not occur within the repetitive domains.
It is suggested that tri-proline helices disrupt the repetitive amino acid zinc finger protein sequences that we have noted and that have been previously described
[26],
[27]. This conclusion is supported by that fact that of the 144 proteins in which there are 8 to 27 polyproline repeats, only formin-2 (NP_064450.3), a 1,722 amino acid actin-associated protein, displays a repetitive motif. This consists of 22 consecutive sequences comprised of a quintet of prolines, each followed by
lpgagi, commencing at residue number 976 and ending at residue number 1211 (
Figure S4). Formins are multidomain proteins that are involved in actin nucleation
[28],
[29].
Genetic and Acquired Disorders Related to Proline
There is a voluminous literature about hereditary disease caused by mutations involving proline. PubMed lists 6,068 citations, as of 26 May 2012. Little is known about acquired disease in humans caused by the ingestion of azetidine-2-carboxylic acid (Aze), the lower homologue of proline, containing four members in its ring instead of five (). It is a constituent of the diet. Aze eludes the gatekeeping function of prolyl aminoacyl tRNA synthetases, and is misincorporated into proteins in place of proline
[1],
[30],
[31].