We partitioned the reporters into five distinct categories of descending confidence: fully valid; Refseq RNA valid; other gene valid; possibly valid; and invalid. Lists of these reporters, and where possible, the associated genes and transcripts, are provided as additional files 1
. Counts of the number of reporters in each category are shown in Table ; for all tables, our results were tabulated by chromosomes, but since we detected no chromosome-specific effects, only genome-wide totals are shown. Fully valid
reporters can be associated with the transcripts of a single gene and with chromosomal locations that lie within the same gene. Refseq RNA valid
reporters are associated with the RefSeq RNA transcripts of a single gene, but for one of the reasons described below cannot be placed within the chromosomal extent of a single gene. Other gene valid
reporters can be placed on the chromosome within the location of a single gene, but the reporter either lies in an intron of that gene or in a transcript not yet in RefSeq. Possibly valid
reporters can be placed at a unique location on the human reference assembly, but this location is not the position of a gene currently in Entrez Gene. Despite the lack of solid evidence to associate a possibly valid reporter to a gene, for studies relating expression levels to genomic positions (e.g., reference [10
]), it makes sense to include these reporters. The division into categories is dependent on the current contents of the RefSeq and Entrez database, which are frequently changing.
Reporters divided into the five categories discussed in Results
As explained in Methods, we restricted our analysis to the 42683 reporters labeled as being on an autosome or on the X chromosome. We consider 25505 (60%) reporters to be fully valid and associated with a single gene in RefSeq. A fully valid reporter must have MegaBLAST alignments to the RefSeq RNA transcripts of exactly one gene in Entrez Gene, and at least one of those alignments must be high-quality as defined in Methods. An example of a reporter that fails the uniqueness test is 1591a because it has 100% identical matches to two related genes PRAME16 and PRAME17. The alignment to a transcript should be in the forward ("sense") direction; reporters that have only alignments in the reverse direction are invalid. If a reporter has alignments in the reverse direction, but also has at least one high-quality forward alignments to a RefSeq RNA whose status is more definite than "model", then the reverse alignments are ignored and the reporter remains under consideration for being fully valid or RefSeq RNA valid. Table summarizes the results of aligning reporters to RefSeq RNA. The distinction between fully valid and Refseq RNA valid is that fully valid reporters are placed by Splign on the reference assembly of chromosome specified by Agilent and within the Entrez extent of exactly one gene, which must be the same gene found by alignment to RefSeq RNA. Fully valid reporters are permitted to have additional alignments to their corresponding gene, to the reverse complement of a different gene, to intergenic regions, or to untranscribed pseudogenes.
Results of aligning all eligible reporters to the database of human RefSeq RNAs
There are 1859 (4%) reporters that align to the transcripts of a single RefSeq gene, but that Splign did not place within the accepted extent of exactly one gene. We do not consider such reporters to be fully valid, but rather place them in the category of "Refseq RNA valid." Reporters in this category fall into four distinct classes (Table ). The first class consists of reporters not placed within the accepted extent of any gene. For example, ID 22872b is supposed to match gene SLC35D1, but maps approximately 3.6 kbp away from the RefSeq extent of SLC35D1. The second class includes reporters placed within the extent of more than one gene. The additional alignments may be to introns, or may instead be parts of transcripts not yet included in the RefSeq RNA database. The third class includes reporters that are placed at a single location on the reference assembly, but this position lies within the extent of more than one gene because the genes themselves overlap. For example, reporter 5099c matches gene KTI12, which lies in an intron of another gene TXNDC12. We consider it likely that reporters in this third class identify the transcripts of a single gene, but to avoid potential confusion we do not include these reporters in the list of those that are fully valid. The fourth class consists of reporters that align to a different gene than the one found by aligning to RNA transcripts. We did not need to consider the class of genes on chromosome X that are duplicated on the Y chromosome; no reporters mapped to the positions of these genes.
Results of placing the reporters that align with the RefSeq RNA transcripts of a single gene on the chromosome
We do not require that the GeneID associated with a reporter during the validation process matches the GeneID supplied by Agilent, nor that the chromosomal positions match exactly. GeneIDs may differ because the Entrez and RefSeq databases are not static, and sometimes updates to records merit changes in an identifier. The positions found by Splign sometimes disagree with the positions supplied by Agilent because a splice site occurs within the alignment, and Splign resolves the splice differently. There are also 87 reporters that are fully valid, but align to more than one position within their identified gene.
Excluding the 27364 (64%) fully valid and Refseq RNA valid reporters leaves 15319 (36%) reporters. Among those, 4472 were found to be invalid by the tests of alignment to Refseq RNA (Table : sum reverse complement + multiple genes), leaving 10847 reporters for further evaluation. We were able to associate 7603 of these with a putative transcript included in Entrez Nucleotide or the Gene Index Database [1
] using the identifier provided in the annotation file; the remaining 3244 (10847 - 7603) reporters were considered invalid and were excluded from further consideration by rules described in Methods (see Table ).
Counts of reporters associated with a putative transcript not in RefSeq
We used Splign to place the remaining 7603 reporters on the human reference assembly. Table summarizes the results of performing this placement. Of the 7603 reporters, 2187 are placed within the extent of exactly one gene, and this gene is on the chromosome specified in the annotation file. We refer to these reporters as "other gene valid". Reporters that also align to the reverse complement of a gene are not included in this category, because we do not have sufficient evidence to suggest that the forward copy, rather than the reverse, is transcribed. Moreover, 3168 of the 7608 reporters may be placed at a single position on the reference assembly, though this position is not the location of a known gene. We categorize these reporters as "possibly valid" (Column 5, table: ). Any secondary placement of a reporter suffices to exclude it from this category, because Entrez Gene does not provide sufficient evidence to determine which unit of transcription, if any, the reporter measures.
Results of placing the reporters that do not align with a RefSeq RNA transcript