To generate our database, we defined AltORFs as ORFs located in a non-canonical reading frame of the RefORF, in the 5′ and 3′UTR regions of an mRNA, or partially overlapping with both the RefORF and an UTR region (). We based our prediction algorithm on characteristics of known AltORFs (AUG as TIS, alternative stop codon different from the RefORF stop codon), and added a size cut-off of 40 codons to keep the database to a reasonable list of polypeptides readily analysable by LC-MS/MS or detectable by SDS-PAGE (Database S1
; ). Criteria included in previous predictions −conservation among species, presence of an optimal Kozak context around the initiator AUG codon, location within the reference coding sequence− were not taken into account in this study because experimental evidence indicate that these criteria are not necessarily required for an AltORF to be expressed 
. Our database predicts 83,886 unique AltORFs with a minimum size of 40 codons (). Most predicted AltORFs overlap RefORFs (41.09%) or populate 3′UTRs (46.55%) (). The majority of mRNAs (87.58%) have at least one predicted AltORF (), and there is an average of 3.88 AltORFs for each mRNA. These proportions are in agreement with the number of detectable translation initiation sites (TIS) determined by ribosome profiling 
. Most predicted AltORFs have less than 100 codons, and the median alternative protein length is 57 amino acids, compared to 344 for the conventional proteome ().
Using this novel alternative protein database as well as GenBank protein entries, we analyzed a HeLa cells proteomic data set we had previously generated by LC-MS/MS 
. A total of 68,035 peptides from 5,558 reference proteins and 280 peptides from 129 alternative proteins were identified (; Table S1
; ). The mean sequence coverage for reference and alternative proteins was 28.8% and 32.3%, respectively. Overall, alternative proteins represented 2.27% of the total identified proteins. This result clearly shows that the contribution of alternative proteins to the proteome, and thus the number of multiple coding genes, has been overlooked. It is noteworthy that alternative proteins coding sequences are spread across the different regions of mRNAs in agreement with the predicted distribution (compare and ). Co-expression of an alternative protein and its reference protein was observed for 42 genes (Table S1
). For each of these genes, the average peptide intensity plot of both the reference and alternative proteins revealed large variations in co-expression ratio (), indicating that a reference protein might not always be the main protein product of a gene. To confirm the expression of alternative proteins in cell lines different from HeLa cells, we performed LC-MS/MS on human colon cell lines and identified 45 alternative proteins (; Table S1
; ). AltORFs associated with these 45 proteins were distributed within UTRs and RefORFs with frequencies comparable to those observed in HeLa cells (compare and ). Comparative analysis of alternative proteins detected in both HeLa cells and colon cell lines indicated that 14 are expressed in at least two cell lineages (). This is more than expected by chance (Fisher's exact test, p
Summary of LC-MS/MS analyses of human samples.
Endogenous expression of alternative proteins in cultured cells.
SDS-PAGE in combination with LC-MS/MS is generally limited to the analysis of proteins above 10 kDa, and a low molecular weight is a known limitation in protein identification by MS 
. Since the majority of the predicted alternative proteome is composed of proteins less than 90 amino acids long which have a predicted molecular weight below 10 kDa (), it is not surprising to have detected much more peptides corresponding to the conventional proteome compared to the alternative proteome. To further assess the abundance of the alternative proteome compared to the conventional proteome, HeLa cells proteins were separated by 1-D SDS-PAGE, and one gel slice between the 4.6 and 10 kDa markers was trypsin digested. The resulting peptides were analyzed by LC-MS/MS. A total of 44 reference and 14 alternative proteins were detected, and alternative proteins represented 24.14% of the total identified proteins (; Table S1
), thus showing that alternative proteins are enriched in the pool of small cellular proteins. The detection of alternative proteins with MW between 4.78 and 9.49 kDa (Table S1
) is further proof that peptides were not misassigned and that these alternative proteins are actually expressed.
Next, we tested the expression of alternative proteins in a variety of human tissues by LC-MS/MS. First, we analyzed normal colon and lung tissues and detected 13 and 40 alternative proteins respectively (; Table S2
). In a second set of experiments, we analyzed ovarian cancer tissue areas and normal areas from the same formalin fixed, paraffin-embedded tissue section of two patients, one presenting endometrioid ovarian cancer and the second presenting a serous ovarian cancer (Fig. S1
). A total of 19 alternative proteins were identified in the normal endometrium, endometrioid ovary, serous ovary, normal ovary, and serous fallopian tube (; Table S2
). We completed these proteomic studies with human fluids, including cerebrospinal fluid, urine, plasma, and serum, identifying 16, 47, 90, and 928 alternative proteins in each fluid respectively (; Table S3
). Strikingly, alternative proteins represent approximately 55% of the proteins identified in plasma and serum (). Overall, we detected a total of 1,259 alternative proteins (), and 47 were expressed in different cell lines and/or tissues (Table S4
In accordance with the scanning model of translation initiation, we used the first AUG rule in order to predict the TIS of AltORFs present in our database. Since other non-AUG codons can be used as TIS 
, we tested the reliability of our TIS prediction for the alternative proteins previously detected by two independent methods. First, the detection of N-acetylated peptides, a modification specific to protein N-termini 
, in 889 out of the 1,259 total alternative proteins detected throughout our different LC-MS/MS experiments allowed us to determine that in most cases (886/889), the alternative TIS predicted in our database was correct (Table S4
). Second, we randomly selected and tested the co-expression of 6 alternative proteins and their corresponding reference proteins from the 129 alternative proteins detected by LC-MS/MS in the fractionated HeLa cells lysate. A strategy based on the transfection of constructs with two tags, an HA tag in frame with the reference protein and a GFP tag in frame with the alternative protein, was used to report the co-expression of both proteins in transfected cells (). The corresponding alternative proteins were all detected by both western blot and GFP fluorescence. Importantly, inactivating mutations (AUG to AAG) of the predicted alternative TIS significantly reduced their expression ().
Transfection of tagged constructs validate the expression and translation initiation site prediction of alternative proteins detected by LC-MS/MS.
Transfection of cDNAs in cultured cells is a routine technique in most laboratories. The possible unnoticed co-expression of an alternative protein with the reference protein could be a major issue, as 67.36% of human protein coding genes are predicted to have at least one AltORF contained within the RefORF (). We selected 6 well studied RefORFs from the AltORFs database, including the tumor suppressor p53 (). The strategy to detect the co-expression of reference and alternative proteins is shown in . After transfection, we determined that each cDNA led to the constitutive co-expression of the alternative and reference proteins as observed by western blot and fluorescence (). Diverse subcellular distributions could be observed among the tested constructs (, see also ), suggesting a variety of possible functions associated with alternative proteins.
Co-expression of alternative and reference proteins in cDNA transfection experiments is common.
Many cDNA clones identified in large scale screening assays, including yeast two-hybrid (Y2H) studies do not match any known protein of the conventional proteome because they represent out-of-frame clones 
. In Y2H, these unknown interacting proteins are usually rejected as false positive hits; yet, we reasoned that a proportion of such clones could represent alternative proteins with real affinity for the bait. We found in the literature the partial sequence of five out-of-frame clones from a Y2H experiment performed with the tandem BRCT domain of breast cancer susceptibility protein 1 (BRCA1) 
. One sequence was 100% identical to an alternative protein from our database whose AltORF is located in the 3′UTR of the mRNA produced from the MRVI1
gene (). AltMRVI1EGFP
was cloned and transfected into HeLa cells. Similar to BRCA1, AltMRVI1EGFP
localized to the nucleus (). We confirmed the interaction between BRCA1 and AltMRVI1EGFP
by co-immunoprecipitation (). Thus, AltMRVI1 is possibly a novel BRCA1 interacting protein that was already identified by Y2H, but mistakenly rejected as a false positive hit.
The alternative protein AltMRVI1 interacts with BRCA1.
To assess the evolutionary conservation of predicted alternative proteins, we generated a database of AltORFs present in mature mRNAs from different eukaryote species (Databases S2−S8), and predicted 5019 distinct alternative proteins in Saccharomyces cerevisiae
, 35,532 in Caenorhabditis elegans
, 38,248 in Drosophila melanogaster
, 52,454 in Xenopus tropicalis
, 82,305 in Mus musculus
, 57,492 in Bos taurus,
and 79,874 in Pan troglodytes
; Table S5
). The median AltORF size of these predicted proteins ranges from 50 to 57 codons, showing that as observed for humans, the alternative proteome is composed of small proteins compared to the reference proteome (Table S5
). Although the homology between reference proteins is greater than that of alternative proteins, we also identified thousands of human alternative proteins conserved with predicted AltORFs in vertebrates, hundreds in invertebrates, and were even surprised to find that 13 alternative proteins are conserved between human and yeast with a median sequence identity of 47.8% ().
Alternative proteins are conserved from human to yeast.