In this paper we described a relatively simple method to predict a classifier's performance for a given sample size, through the creation and modelling of a learning curve. As prior research suggests, the learning curves of machine classifiers generally follow the inverse-power law [1
]. Given the purpose of predicting future performance, our method assigned higher weights to data points associated with larger sample size. In evaluation, the weighted methods resulted in more accurate prediction (p < 0.05) than the un-weighted method described by Mukherjee et al.
The evaluation experiments were conducted on free text and waveform data, using passive and active learning algorithms. Prior studies typically used a single type of data (e.g. microarray or text) and a single type of sampling algorithm (i.e. random sampling). By using a variety of data and sampling methods, we were able to test our method on a diverse collection of learning curves and assess its generalizability. For the majority of curves, the RMSE fell below 0.01, within a relative small sample size of 200 used for curve fitting. We observed minimal differences between values of RMSE and MAE which indicates a low variance of the errors.
Our method also provides the confidence intervals of the predicted curves. As shown in Figure , the width of the confidence interval negatively correlates with the prediction accuracy. When the predicted value deviates more from the actual observation, the confidence interval tends to be wider. As such, the confidence interval provides an additional measure to help users make the decision in selecting a sample size for additional annotation and classification. In our study, confidence intervals were calculated using a variance-covariance matrix on the fitted parameters. Prior studies have stated that the variance is not an unbiased estimator when a model is tested on new data [1
]. Hence, our confidence intervals may sometimes be optimistic.
A major limitation of the methods is that an initial set of annotated data is needed. This is a shortcoming shared by other SSD methods for machine classifiers. On the other hand, depending on what confidence interval is deemed acceptable, the initial annotated sample can be of moderate size (e.g. n = 100~200).
The initial set of annotated data is used to create a learning curve. The curve contains
j data points with a starting sample size of m0 and a step size of k. The total sample size m = m0 + (j-1)*k. The values of m0 and k are determined by users. When m0 and k are assigned the same value, m = j*k. In active learning, a typical experiment may assign m0 as 16 or 32 and k as 16 or 32. For very small data sets, one may consider use m0 = 4 and k = 4. Empirically, we found that j needed to be greater than or equal to 5 for the curve fitting to be effective.
In many studies, as well as ours, the learning curves appear to be smooth because each data point on the curve is assigned the average value from multiple experiments (e.g. 10-fold cross validation repeated 100 times). With fewer experiments (e.g. 1 round of training and testing per data point), the curve will not be as smooth. We expect the model fitting to be more accurate and the confidence interval to be narrower on smoother curves, though the fitting process remains the same for the less smooth curves.
Although the curve fitting can be done in real time, the time to create the learning curve depends on the classification task, batch size, feature number, processing time of the machine among others. The longest experiment we performed to create a learning curve using active learning as sample selection method run on a single core laptop for several days, though most experiments needed only a few hours.
For future work, we intend to integrate the function to predict sample size into our NLP software. The purpose is to guide users in text mining and annotation tasks. In clinical NLP research, annotation is usually expensive and the sample size decision is often made based on budget rather than expected performance. It is common for researchers to select an initial number of samples in an ad hoc fashion to annotate data and train a model. They then increase the number of annotations if the target performance could not be reached, based on the vague but generally correct belief that performance will improve with a larger sample size. The amount of improvement though cannot be known without the modelling effort we describe in this paper. Predicting the classification performance for a particular sample size would allow users to evaluate the cost effectiveness of additional annotations in study design. Specifically, we plan for it to be incorporated as part of an active learning and/or interactive learning process.