We downloaded all 309,976 raw Sanger ESTs for Pinus taeda
from NCBI Trace Archive (http://trace.ncbi.nlm.nih.gov/Traces/trace.cgi
), which were generated by three different sequencing centers or labs including previously well-known TIGR institute. For these raw ESTs, we were able to collect the complete information about cDNA library construction protocol (i.e.
, plasmid vector, adapter or linker sequences, restriction enzyme sites, sequence name convention and associated sequencing directions), which is required by AFST to conduct accurate pattern analysis. In particular, 230,783 out of 309,976 ESTs were submitted by each center or lab into GenBank dbEST as final clean ESTs after raw EST cleanup and trimming. Therefore, this dataset represents a valuable benchmark for us to evaluate AFST performance. Also due to the availability of the complete cDNA library construction information, we were able to use two other popular Sanger EST cleanup tools – Lucy [13
] and SeqClean [15
] to process 309,976 raw ESTs for performance comparison with AFST. In order to demonstrate that our protocol performs well for cDNA libraries other than those in Pinus taeda
, we downloaded all Arachis hypogaea
(peanut) ESTs (86,939) from dbEST, which were deposited by many labs and investigators. Among them, the biggest data set was 38,709, whose complete cDNA library construction protocol information (i.e.
, pBluescript II SK as the plasmid vector and EcoR
I and Xho
I as the restriction enzyme sites, etc.) was available by extracting and parsing dbEST records. Because cDNA library construction information is not mandatory for dbEST submission, it is often difficult to get complete cDNA library constructional information among dbEST records. Therefore, we used AFST to process 38,709 peanut ESTs and detect cDNA terminus patterns.
In our pattern analysis protocol, there are three important concepts worthy of further explanation: Pattern, Confidence score and Reasonable pair. They are also implemented in our software tool AFST to identify abnormal sequences such as those with RECA and DBT and determine final clean ESTs.
“Pattern” refers to a cDNA terminus structure detected in a raw EST sequence. It is determined by the type, number, order and context of all cDNA termini in terms of the specification (expectation) given by a specific cDNA library construction protocol. To identify the pattern, we first find all putative cDNA termini existing in the sequence, then consider good/low quality regions and vector fragment positions, and finally determine the pattern with respect to the following aspects:
(1) Type of cDNA termini. There are four major types of termini [5TSS], [3TSS], [5TNS] and [3TNS] as shown in Figure . Each of these termini can be further categorized into some sub-categories. For example, [3TSS] includes 3TSS, 3TSS-1, 3TSS-2, 3TSS-3, 3TSS-4 and 3TSS-5 (also see Figure ).
(2) Number of cDNA termini. For instance, a sequence with just one [5TSS] and another sequence with two [5TSS] have different patterns.
(3) Sequential order of cDNA termini. For example, a sequence with first a [5TSS] and then a [3TSS] has a different pattern from another sequence that has first a [3TSS] and then a [5TSS].
(4) Context of cDNA termini. Flanking region context refers to the sequence region immediately before/after a terminus. There are mainly two cases: vector fragment (represented by V) and non-vector fragment (N). Furthermore, vector or non-vector fragments can be further classified into vector fragment in high quality region (HV), vector fragment in low quality region (LV), non-vector fragment in high quality region (HN) and non-vector fragment in low quality region (LN). Context is one of the basis on which terminus’ confidence is estimated by computing a confidence score (see below).
Because of sequencing errors [16
], some in-silico
identified cDNA termini might be false positives. When a terminus defined in Figure is detected, we will quantify our confidence in its detection with a confidence score, which is calculated by considering the extent of the completeness of all required sequence elements, adjacent sequence contents (contexts) and the percentage of bases that match the whole terminus.
(1) Determine the completeness score for a given terminus (A score). The completeness score is directly reflective of the number of sequence elements in the terminus. For example, the completeness score of 5TNS, which has three sequence elements (i.e., a poly(T) tail, XhoI site, and VF2), is higher than 5TNS-1 and 5TNS-2, each of which have only two sequence elements. Comparing with 5TNS, 5TNS-1 and 5TNS-2, the completeness score of 5TNS-3 is the lowest one because it only has one sequence element. The terminus with the higher completeness score is more likely to be authentic, instead of being an artifact of a sequencing error.
(2) Score a terminus according to its flanking region context (B score). Sequence contents that match the expected structures in terms of a cDNA library construction protocol deserve a higher score. For example, we expect to detect a vector fragment sequence immediately upstream of 5TSS. Correspondingly, the vector fragment, the low-quality non-vector fragment, and the high-quality non-vector fragment detected immediately upstream of 5TSS will result in the highest (100), intermediate (50) and lowest B score (0) respectively, which will be assigned to the identified 5TSS.
(3) Score the percentage of matched bases (C
score). The percentage is calculated in terms of the detected bases that are the same as the expected bases divided by length of the terminus. For example, the 3' EST NDL1_11_A06.b1_A029
in Additional file 1
: Figure S3 C has the cDNA terminal pattern of 5TNS-4+N+3TNS-5+N+3TNS+V
. Obviously the cDNA terminus detected in the front is 5TNS-4
, whereas the terminus detected at the end can be either a 3TNS
. Because the percentage of matched bases is much lower for 3TNS-5
than for 3TNS
, as well as due to the adjacent sequence contents (contexts), 3TNS
is assigned with a higher C
score than 3TNS-5
. Therefore, we identified 3TNS
as the real terminus at the end of this EST while 3TNS-5
was a false one. The formula that determines the confidence score for a given terminus is:
Reasonable pair of detected termini
Based on the expected cDNA terminus structure shown in Figure , if the cDNA insert is short enough, we should be able to detect both the 5′ end terminus and the 3′ end terminus in an EST sequence. This also means that we are able to detect a reasonable pair of cDNA termini for some ESTs, using the following definitions: (1) for a 5′-end EST, [5TSS
] and [3TSS
] is the reasonable termini pair where [5TSS
] should be upstream of [3TSS
]; (2) for a 3′-end EST, [5TNS
] and [3TNS
] is the reasonable termini pair where [5TNS
] should be upstream of [3TNS
]; (3) the distance between the two paired termini shouldn’t be too short to contain a cDNA insert (i.e. >
In our software tool AFST, essentially, we first determine all cDNA terminal patterns according to the type, number, order and context of all expected termini, and identify and filter out RECA and DBT abnormalities. We then search for reasonable terminus pairs, calculate confidence scores for all detected termini, and select the reasonable terminus pair that yields the highest cumulative confidence scores to delineate the start and end positions of the bono fide cDNA insert. Finally, the final clean sequence is obtained by trimming off both low-quality regions and vector fragments from the sequence fragment between the two termini of the best reasonable pair.