Because of the critical influence of RNA integrity on downstream experiments, there is a strong need for a reliable, reproducible, and standardized approach to classify the quality of RNA samples. The long time standard consisting in a 28S to 18S peak ratio of 2.0 was shown to provide only weak correlation with RNA integrity.
The Agilent 2100 bioanalyzer, a bio-analytical device based on a combination of microfluidics, microcapillary electrophoresis, and fluorescence detection, provides a platform to record the size distribution of molecules, e.g., RNA, in a digital format. Since this approach is highly reproducible and automated, it provides the basis for an automated, user-independent, and reproducible approach to evaluate the integrity of RNA samples using a software algorithm.
For the development of the RNA Integrity Number algorithm, a total of 1208 RNA samples from various sources and of different degradation states was analyzed. After assigning the samples to 10 different categories ranging from 1 (worst) to 10 (best), methods from information theory were applied to calculate features describing the curve of the electropherogram. In the following step, features were selected for further processing that showed high information content about the task to distinguish the 10 categories. These features were then taken as input variables for a model-training step. Here, using a Bayesian learning approach to select the most probable model, several models were trained utilizing artificial neural networks and the best was chosen for prediction of previously unseen test data. The result produced by this procedure is an algorithm called RNA Integrity Number (RIN).
Analysis of the RIN model
The RIN algorithm is based on a selection of features that contribute information about the RNA integrity. It is obvious, that a single feature is hardly sufficient for a universal integrity measure. Moreover, a combination of different features covers several aspects of the measurement and is more robust against noise in the signal (see Additional file 2
for a overview of all features). To understand why the features were selected and to enhance the confidence for application specialists it is important, to give an interpretation of the features:
1. The total RNA ratio measures the fraction of the area in the region of 18S and 28S compared to the total area under the curve and reflects the proportion of large molecules compared to smaller ones. It has large values for categories 6 to 10.
2. The height of the 28S peak contributes additional information about the state of the degradation process, i.e. during degradation, the 28S band disappears faster than the 18S band. Therefore, it allows detection of a beginning degradation. It has largest values for categories 9 and 10, and zero values for categories 1 to 3.
3. The fast area ratio reflects how far the degradation proceeded and has typically larger values for the categories 3 to 6.
4. The marker height has large values for categories 1 and 2 and small values for all other categories since short degradation products will overlap with the lower marker.
Figure shows the projection of the distribution of integrity categories onto a two-dimensional space spanned by the two most important features. Clearly, a global non-linear relationship can be observed. The experiments are grouped along a characteristic line with varying variance. The boundaries between adjacent categories are not perfectly sharp, but clearly visible in this projection with some interchanges.
Figure 7 2D visualization of integrity categories. The figure shows a projection of the categories onto the two-dimensional space spanned by the first two features of the selected combination. These are total RNA ratio and 28S peak height. The experiments are (more ...)
Comparing the approaches
Using a single simple feature to judge RNA Integrity was already shown to be insufficient [2
]. While focusing on one aspect of the electropherogram allows for a rough orientation about the integrity, it is still subjective to a high degree. Linear models based on these features show a mean squared error that is four to sixty times higher (degradation factor resp. 28S/18S ratio) than compared to the proposed approach.
The reason for this tremendous difference lies in the fact, that neither the 28S/18S ratio nor the degradation factor reflect all properties of the RNA degradation process. For example, several samples of integrity category 10 are labeled BLACK from the degradometer software as they have low signal intensities. This happened for 42% of the samples under consideration, which are all samples that were under investigation for microarray experiments. The degradation factor contains similar information as the fast area ratio, which reflects typical characteristics of categories 3 to 7. The high ribosomal ratio is useful to detect a certain amount of high quality samples, but the categorization is not valid for all of them. Using several features which complement one another and allow for a non-linear weighting of these features allowed to reduce the error to a minimum value which is in the order of the natural noise in the target data. The noise results from using a categorical grid for a continues process as well as from a few abnormalities. Interestingly, almost no interchanges over more than one categorical border are observed. Thus, the classification errors appear almost only at the borderline between two categories, which was also difficult for humans to decide, when labeling the data.
Availability of the RIN model
The Agilent 2100 bioanalyzer system software can be downloaded from Agilent's webpage [see Additional file 1
]. Version B.01.03 and later will allow for measurement reviews (free of licenses) including the calculation of the RNA integrity number [8
]. Up-to-date information about the RIN-project can be found at the RIN web site [9