|Home | About | Journals | Submit | Contact Us | Français|
Producing indices composed of multiple input variables has been embedded in some data processing and analytical methods. We aim to test the feasibility of creating data-driven indices by aggregating input variables according to principal component analysis (PCA) loadings. To validate the significance of both the theory-based and data-driven indices, we propose principles to review innovative indices. We generated weighted indices with the variables obtained in the first years of the two-year panels in the Medical Expenditure Panel Survey initiated between 1996 and 2011. Variables were weighted according to PCA loadings and summed. The statistical significance and residual deviance of each index to predict mortality in the second years was extracted from the results of discrete-time survival analyses. There were 237,832 surviving the first years of panels, represented 4.5 billion civilians in the United States, of which 0.62% (95% CI = 0.58% to 0.66%) died in the second years of the panels. Of all 134,689 weighted indices, there were 40,803 significantly predicting mortality in the second years with or without the adjustment of age, sex and races. The significant indices in the both models could at most lead to 10,200 years of academic tenure for individual researchers publishing four indices per year or 618.2 years of publishing for journals with annual volume of 66 articles. In conclusion, if aggregating information based on PCA loadings, there can be a large number of significant innovative indices composing input variables of various predictive powers. To justify the large quantities of innovative indices, we propose a reporting and review framework for novel indices based on the objectives to create indices, variable weighting, related outcomes and database characteristics. The indices selected by this framework could lead to a new genre of publications focusing on meaningful aggregation of information.
An index or composite measure can be used to represent an idea or an outcome. Many of the weighted indices or composite measures are the summation of the products of variables values and equal or variable-specific weights. Although there may be differences in how to these indices are developed, they have been widely used in social science[1–3], health research and biomedical investigation[4, 5]. The construction of weighted or unweighted indices involves several steps, including validation of individual measures that make up the indices, assessment of the variability between subjects, and index scoring[1, 5]. Some differences exist across disciplines or research subjects and thus specialized methods may be preferred. Besides external validity and generalizability, the statistical significance or predictive power between the produced index and external outcomes is also important for wider use or subsequent application to other research topics. For example, the concept of frailty, defined as a geriatric syndrome, has been characterized by indices composed of different sets of multiple indicators, especially weight loss and less grip strength[5, 7, 8]. A variety of frailty indices have been proven useful and statistically significant to predict major outcomes, such as mortality, surgical outcomes and occurrence of disability.
The generation of weighted or unweighted indices has been important to operationalize abstract subjects or create new research tools. However, the use of indices or composite measures is more prevalent than many may expect. The process of producing indices that are composed of a subset of variables from a database has been embedded in several data processing and analytical methods. For example, principal components (PCs) are the linear combinations of the variables according to principal component analysis (PCA). Partial least squares (PLS) regression also applies a set of loadings to the original input variables, though different from those obtained from the PCA. Addictive models use multiple functions that aggregate features with potentially dissimilar coefficients to derive new input functions for outcome prediction. Neural networks use original inputs to obtain a number of derived variables that then serve as predictors for the outcomes in multi-layer models. The implicit creation of indices, most likely to be of unequal weights for input variables, makes us curious about whether it is possible to reproduce and fine-tune the process of information aggregation or index generation in these methods.
Combining the conventional view that takes statistical significance as the criteria for the validity of indices and the prevalent use of data aggregation and implicit weighting of input variables, we aim to propose and test a data-driven procedure of “index mining” or a systematic search for optimal variable aggregation. By taking PCA as an example to assign weights to input variables, the procedures to aggregate input variables are according to PCA loadings and a PCA-based method to generate statistically significant mortality indices is developed. After index mining, we also propose a review framework to examine the validity of newly generated indices according to the differences we identify between the data-driven approach and prevalent theory-based index-generating methods.
This secondary data analysis study was approved by the ethics committee of the Centre hospitalier de l'Université de Montréal. We generated weighted indices with the variables from the first years of the two-year Medical Expenditure Panel Survey (MEPS) panels according to PCA loadings to predict mortality in the second years (see Fig 1 for the flowchart). First, we conducted PCA with year-one variables to obtain the loadings to construct each PC. Second, we sorted the input variable by the absolute values of loadings in each PC to generate weighted indices. The input variables with larger absolute values of loadings were summed first for each PC. Third, the indices were the sums of the products of input variables and PCA loadings. Fourth, the statistical significance and deviance of each index to predict mortality in the second years was extracted from the results in discrete-time event history analyses.
This study analyzed the 16 longitudinal panels released from the MEPS that were conducted annually among civilian non-institutionalized population to produce nationally representative statistics since 1996 in the United States. Each panel lasted for two years and consisted of five rounds of data collection. Only year-one variables were used for PCA to predict mortality in the second years.
The 16 longitudinal two-year panels of the MEPS were pooled by variable names common to all panels. There were 1989 common year-one variables across 16 panels (panels beginning throughout 1996 and 2011, see S1 Table for the list of variables and their characteristics). Only subjects participating throughout the two-year panels were retained in the data set, in addition to those deceased before the end of the two-year panels. Administrative variables and the variables that were used to flag certain circumstances in data gathering were not used for PCA. To avoid overlapping information and increase the computational feasibility, the 789 variables containing individual information in the first years of the two-year panels were retained for further variable selection and analysis.
Reserved values that identified specific responses across all variables were recoded according to the MEPS codebooks: -2 recoded to the same answers in previous rounds, -1 to inapplicability and others to missing values (-3, -7, -8, and -9 for “no data in round”, “refused”, “do not know”, and “not ascertained” respectively); see S1 Table for the percentages of observations in these categories of the variables).. The skewness of continuous variables was evaluated without adjusting for survey design. Log transformation was applied if the skewness of log-transformed variables were less than original variables.
This study first selected features with a correlation-based method proposed for the purpose of removing redundant variables and increasing computational feasibility[18, 19]. The data redundancy might be created for the ease of survey implementation or data labeling. For example, different sources of income were separately asked and total income was the sum of incomes from all sources. The levels of education might be presented in years spent in school or types of highest grade completed (See S1 Table for details in variable names and labels).
Spearman’s rank-order correlation was used to create a correlation matrix of all variables, categorical or continuous[18, 19]. For each pair of variables in the correlation analysis, the subjects were dropped if there were any missing values in these two variables. The threshold for redundancy was Spearman’s rank correlation coefficient greater than 0.9. There were 251 variables left for further analysis (see Fig 1 for the flowchart). The proportions of missingness ranged from 0% to 7.18%, median 0.18% among 83 variables with any missing values. Sixty-eight of the retained variables were categorical and 15 were continuous. After variable selection, missing values in all variables were imputed with the multivariate imputation by chained equations.
Of the 71 categorical variables, three ordinal variables that ranked poverty categories (povcaty1), difficulty in using fingers to grasp (fngrdf1), and a summary measure of vision impairment (vision2) were not transformed. Other 68 nominal variables were replaced with 184 binominal variables. This led to 367 variables available for PCA and 15 variables used for personal identification and control for survey design.
PCA, PC for principal component, was proven useful for dimension reduction or data pre-processing. Although there were other choices of PCA[23–25] or similar data techniques[12, 26], the choices of dimension reduction methods applicable to survey design were limited. We considered linear PCA as the optimal and feasible option in consideration of complex survey design. Before PCA, each variable was centered to zero and scaled to unit variance. PCA was conducted with the 367 variables while adjusting for survey design. The PC values were predicted for each subject.
The indices were generated according to PCA loadings. Each PC was a linear combination of all input variables and could be seen as a weighted sum of all variables after input variables being catered and scaled. The number of PCs was the same as the number of input variables, denoted by N. In Eq 1, a PC, specified with a subscript pc, was the sum of all input variables, denoted by x, weighted by PC-specific loadings, denoted by L.
The process of PCA-based index generation was described as follows. The first index of each PC, denoted by Indexpc.n, was the product of the leading variable, in terms of absolute values of loadings, and its PC-specific loading, denoted by Lixi while pc referring the PC that was used to produce indices, n specifying the numbers of input variables required for the index in Eq 2, and n equalling one. The second one was the sum of the products of the first two leading variables weighted by PC-specific loadings, denoted by . By repeating the same procedure, we include all variables weighted by loadings in each PC and the last index in each PC was the same as the PC value. There were 367 weighted indices generated for each PC, 134,689 for 367 PCs in total.
The outcome of interest was mortality in the second years. The survival function of the MEPS interviewees was estimated with Kaplan-Meier method and adjusted for survey design by months in the second years of the panels. We tested the differences in survival functions by sex and race/ethnicity.
The deaths in the second years were modelled in four three-month periods or quarters: January to March, April to June, July to September, and October to December. Each individual was duplicated for each quarter if they remained alive. For example, an individual that survived throughout the second years of the MEPS panels would have four data entries representing four quarters. Each data entry was labelled alive. If someone died in the third quarter, July to September, they would only have three observations for non-existence in the fourth quarter and the third entry was labelled dead.
The survival of the MEPS participants in the second year of each panel was modelled with discrete-time event history analysis for the violation of the proportional hazard assumption in Cox model by generated indices. We tried with first few indices and found that the proportional-hazard assumption for the Cox-proportional regression model might not hold for most indices. In unadjusted models, deaths in each discrete time periods, quarters, were predicted with each generated index, time and interaction between index and quarters (see S1 Equation for details). In adjusted models, age, sex and races were added as independent variables. Ages in years were calculated based on the birth and interview dates. Sex included male and female. Races were white, black, American Indians or Alaska natives, native Hawaiian or Pacific islanders, and multiple races. Event history analysis for binominal outcome, mortality, was conducted with the adjustment of complex survey design in the MEPS data sets with survey package available in R (v3.2.2) and RStudio environment (0.99.903).
Because of the large number of significant indices generated from the MEPS data alone, we would like to estimate the impact of new indices on academic publishing and knowledge translation, assuming three steps required within three-month periods of index mining: generation of weighted indices, formation of theories and methods, and comparisons across databases (Table 1). In each step, a manuscript for publication was drafted and three manuscripts generated for one index. For each significant index, a researcher could use it to publish one article every month and four articles per year. This might help to secure academic tenure by publishing four innovative indices per year or 12 related articles annually. The estimated impact on academic tenure was the number of years that a researcher could maintain this pace of publication, estimated by dividing the number of significant indices by four.
The average number of articles published in an academic journal was about 64 to 68 annually in 2012, 1.8 to 1.9 million articles by 28,100 active journals. We assumed that there was a journal focusing on publishing innovative and significant indices. The expected time of journal publishing in numbers of years was estimated through dividing the total number of significant weighted indices by 66.
There were 244,089 individuals surveyed throughout the two-year panels in the first to 16th MEPS panels. There were 237,832 surviving the first years of panels. This represented 4.5 billion civilians in the United States, of which 0.62% (95% CI = 0.58% to 0.66%) died in the second years of the panels. The demographic characteristics were listed in Table 2. The proportions of two sexes and white or non-white races were not statistically different across the MEPS panels (p = 1 and 0.24 respectively in Fig 2). The proportions of dying in the second years of the MEPS panels were not the same across panels (p < 0.01). The Kaplan-Meier survival curves by sex and races were shown in Fig 2. The survival curves by months in the second years were significantly different across sex and races (p < 0.001 for both).
The leading variables contributing the most to the first five PCs were listed in S2 Table. The PC values of those dying in the second years of the MEPS panels were plotted against those surviving throughout the panels in Fig 3. Those dying in the second years did not seem to evenly distribute across the first five PCs, especially in PC2 and PC3. In Fig 3a and 3b, those dying in the second years seemed to be associated with lower PC2 and PC3. Taking PC1 and PC2 as examples, the coefficients to predict the mortality risks obtained from even history analyses were shown in Tables Tables33 and and4.4. PC1 was not significant in the unadjusted model that only accounted for time, quarters in the second years, and interactions between PC1 and time (p = 0.78). However, in the adjusted model that added age, sex and races as predictors, PC1 was significantly associated with mortality risk (p < 0.001).
The p values of all 134,689 weighted indices with or without the adjustment of age, sex and races were plotted in Fig 4. Statistical significance represented by red color prevailed both graphs. In Table 5, the numbers of weighted indices were categorized by the numbers of variables composing the indices. The number of input variables that significantly predicted the mortality probability in the second years was 208, 56.68% of all input variables, in both adjusted and unadjusted models. The proportions of significant indices diminished with the number of input variables, from one to, 30, 70 and 367. However, there were still large numbers of significant indices in both unadjusted and adjusted models. All weighted indices composed both significant and insignificant input variables and none of them could be uniquely constructed with significant or insignificant variables.
Following the proposed publication cycles in Table 1, the 40,803 weighted indices in the both models could lead to 10,200 years of academic tenure for individual researchers or 618.2 years of publishing for journals with annual publication of 66 articles. If young or new researchers were wary of the publication of complicated indices and would like to use significant ones composed of more than one and less than 30 variables, the 5,161 indices could lead to 1290.25 years of academic tenure or 78.19 years in journal publishing. For certain research topics, about which 70-item index might be acceptable, the volume of significant indices might be sufficient for 1,622.25 years of academic tenure and 98.32 years of journal publication due to 6,489 significant indices composing 31 t o70 input variables.
There are opportunities and challenges identified from the data-driven index mining process. There are several important points learned from the process and results of data-driven index mining. First, the number of significant innovative indices composing multiple input variables is large and the proportion is beyond the probability that we may expect, one out of 20, if aggregating variables according to PCA loadings. In addition to PCs that are often used in PC regression and other models, we find that aggregating input variables according to the order of absolute values of PCA loadings is an alternative to search for composite measures or indices significantly predicting outcomes. Based on the large number of alternative indices to predict mortality based on this data-driven method, we suspect the process of traditional or theory-based index generation may not be optimal. For example, the frailty index of input variables assigned with unequal weights derived from neural networks predict adverse outcomes better than that of input variables assigned with equal weights. A systematic approach to review new and innovative indices is required to obtain and select useful indices.
Second, all of the significant indices compose input variables of unequal weights. This contrasts the usual practice of assigning equal weights to all input variables[6, 8]. In addition to assigning equal or PCA-based weights to input variables, there are other methods to assign weights that have been rigorously tested based on the theories or quantitative evidence. For example, the 10-year risk of cardiovascular disease is calculated based on the regression model that predict the occurrence of cardiovascular disease. The regression coefficients that are unequal are regarded as the weights for input variables. The human development index is the multiplicative products of three dimensions regarding health, education and standard of living.
In this empirical study, using equal weights in most indices is not the best method to aggregate information or augment signal in this data set. Compared to the loadings obtained from PCA in S2 Table, the indices using equal weights for each input variable will not be optimal in terms of variance maximization. However, there are at least two occasions in PCA, in which the loadings of the input variables are similar. One is that the input variables are highly correlated and summing them with equal weights maximizes the variance of one of the PCs. However, this is to sum variables that resemble each other. This can be a solution to the problem of collinearity in regression models. Unfortunately, this also means the input variables do not provide information much different from each other. This type of indices may be reducible to one or two of the input variables. The other situation for homogenous loadings in one PC is that these variables have very low between-variable correlations, such as the first few leading variables in the PC1 in S2 Table. We think this would be another occasion to apply equal or homogenous weights to input variables. However, whether uncorrelated information from two measures can be summed to represent a concept may need further justification.
Third, using equal weights is a strong and possibly arbitrary assumption for the relationships between input variables and their predictive power. In Eq 1 shown below, the coefficient of the index (βindex) regarding a hypothetical outcome (y) can be transmitted to all input variables (xi). The input variables included in the index are subsequently assigned with the regression coefficients, βindex wi for each xi.
This means that the weighting scheme (wi) links the relative scales of predictive power of all input variables and assumes the regression coefficients of the input variables regarding the outcome should not be estimated individually. The coefficients should be set collectively (βindex wi for each xi). For another outcome, the same restriction applies and the actual coefficients of the input variables simultaneously change in the same relative scales, as a new βindex for all input variables regarding the new outcome.
For indices created solely to represent concepts or abstracts ideas that cannot be measured with singles variables, the pre-determined scales or weights for all input variables may be justifiable. For example, there are proxy indices that are generated to represent functionality, emotional well-being and quality of life. However, the existence of some indices are partly justified by significant associations with major outcomes, such as mortality and surgical outcomes. They are more frequently used as outcome predictors than proxy measures of abstract ideas.
For indices frequently used as proxy predictors, the restriction on the relationships and relative scales of all input variables by enforcing an index coefficient may not be ideal. Questions, like why not directly use single variables as predictors to obtain variable coefficients (βi for each xi) if there are sufficient numbers of sample sizes, how to interpret the composite coefficients (βindexwi for each xi) derived from the index coefficients, how much of the outcome variability can be explained by each input variable and how the outcome may change with the alteration of one input variable if controlling for another input variable of the index, will not be easy to answer. If these questions are the major concern for researchers, using indices as proxies to predict outcomes may not be ideal.
Fourth, for the indices used as predictors, equal or PCA-based weights can be further improved using methods that combine the information from outcomes. Besides PCA, there are other data or estimation methods to take both input variables and outcomes into consideration and generate weighted composite measures, such as partial least squares (PLS) transformation. By applying the PLS projection, the weighting schemes can be searched in consideration of both outcomes and independent variables. We notice that there are many indices that are used heavily as proxy measures and generated without considering outcomes[1, 4–6, 9, 10]. In fact, there are many unexplored alternatives that can be used to determine the optimal or ideal weighting scheme for input variables. Two of the alternatives are subset selection and shrinkage methods that search the set of coefficients optimized for outcomes based on model fit criteria, such as mean square errors or Bayesian information criterion. However, this approach is not applicable for our data that requires the adjustment of survey design.
Fifth, our results support that weak classifiers can be combined to form stronger classifiers. We find that the there is no single PC-based index composing only the input variables that significantly predict mortality. Those insignificant input variables can be combined to obtain new insight toward the prediction of mortality. This can be partly due to the information gain from the weak classifiers that supplement the information of strong classifiers. The use of insignificant or weak predictors in the formation of new indices may need to be systematically explored and should be put more attention.
Lastly, the publication of new indices may help to secure academic tenure and journal publication. The is because of the fact that the number of publications is significantly associated with tenure decisions. The large number of significant indices can help researcher to generate hypotheses or theories in order to augment their publication portfolios and secure academic tenure. For journals, this suggests it is possible to maintain the publication volume with research articles using significant indices. However, the estimation about the numbers of publications still needs to be tested in real world. We observe that adjusting the numbers of input variables in an index by a multiple of ten for publication seems to be well accepted for theory-based indices[35, 43, 44]. This publication strategy should be tried first.
With sufficient sources of publication materials, the focus may soon become how to improve the publication quality or ask authors to comply with review frameworks designed for innovative indices, such as the one we propose in Fig 5. In fact, a standardized reporting guideline should be developed and adopted to review the procedures and justification of index generation. This type of reporting guidelines has been well developed for clinical trials, epidemiological studies and systematic reviews.
To deal with the identified problems and questions to the newly generated mortality indices in a single data set, we suggest the index creators or readers to assess these problems according to research objectives and through analytical methods. Based on our experiences in generating PCA-based indices, it is important to first understand the problems or questions researchers may encounter while mining indices, listed in the Problems box in Fig 5. These questions are related to why imposing equal weights on input variables, whether there are outcomes to be considered, and whether there exists preferred weighting schemes that may be empirical or theory-based.
In the Assessment box in Fig 5, there are tools that can help researchers to understand the weighting schemes and relationships with outcomes, including PCA, PLS transformation, and the regression coefficients obtained from subset selection and shrinkage methods. However, there are other considerations after the initial assessment in the third section, Issues to consider in Fig 5. For example, PCs obtained from PCA can help to address the problem of collinearity. PCA loadings can provide PCA-based weighting schemes and combine weak classifier or insignificant input variables to significant indices. However, the objective of PCA to maximize the PC variances may not be useful if researchers have specific outcomes to consider. Subset selection, such as forward-stepwise regression and random matrix, and shrinkage methods, such as LASSO and ridge regression, prefer and retain significant input variables. Moreover, nonlinear methods to summarize data, such as non-parametric PCA, diffusion map, and t-SNE (t-distributed stochastic neighbour embedding), are possible options to search for nonlinear projections of input variables.
After reviewing potential problems, assessment results, and important issues in the data set, there are several options toward the data set or weighting schemes, listed in the Actions to be justified section in Fig 5. The first can be the selection of input variables, whether to drop or keep variables. The other is the choice of weighting schemes, equal or unequal weights. If unequal weighting schemes are chosen, it is important to understand the global objectives and the methods to derive the weights, such as PCA or PLS transformation or other projection methods.
However, there are other considerations that also matter in the aggregation of information and the generation of index in the last section, Other considerations, in Fig 5. First, whether the new indices will be often referred as outcomes should be considered. The outcome indices can help to represent abstract ideas or concepts. Despite the shortcomings and the necessity to justify the use of equal weights, the outcome indices may be the sums of input variables with equal weights for reasons such as simplicity and interpretability. For example, the number of difficulties in the activities of daily living (ADL) provides understandable and straightforward summaries in functional status, although this adds up the number of difficulties in distinct dimensions, such as bathing and eating. The equal weighting of major dimensions of certain concept, such as functionality, is easy to comprehend and can effectively reduce the number of independent variables to one functional indicator. This is beneficial for studies of small sample sizes. However, there may be alternative weighting schemes much more preferable, if other objectives, such as to maximize aggregate variances or the covariance with the outcomes, exist.
Second, whether the sample sizes of the databases that researcher may use to generate or test new indices are large enough for PCA or other methods is also a key issue. PCA becomes unstable if the number of observations is less than the number of variables. With smaller sample sizes, it is more likely to have PCA be influenced by the outliers in the database and the results of PCA from different databases can vary greatly. Large sample size, universal access and data quality are the reasons why we use the MEPS database to demonstrate the procedures of PC-based index generation and examination.
Lastly, the role of the newly generated indices is also important to consider. For the indices that are treated as outcomes, the theories or existing evidence to combine the indices may be more important than other data objectives. For indices that serve as predictors, proxy indices, the reason why and how to combine input variables are the key to choose the weighting schemes and the methods to generate new indices.
There are several limitations to this study. First, computing power is important for index mining. The creation of a complete matrix of significance in Fig 4 requires more than six-month computing time for a regular desktop computer. Due to this limitation, we are able to test only one outcome, mortality and thus unable to estimate the numbers of indices significant to other outcomes, such as disease incidence and socioeconomic status change.
The weighting and summation of variables based on PCA loadings can lead to a large number of significant weighted indices regarding important outcomes, such as mortality in this study. However, the numbers of publishable indices may be less than those of significant indices due to several reasons. The first is that the adjacent indices produced according to the loadings of the same principal components may be quite similar because some of the loadings can be close to zero. There are currently no methods or algorithms to estimate the exact numbers of publishable indices. We are currently developing several methods to prioritize the significant indices for publication, some of which are computationally intensive. One option is to first examine the significant indices with insignificant neighbouring indices. Another is to use explicit criteria to prioritize indices relative to the neighbouring indices and among the others created based on the same principal components. The criteria can be p values, model fit statistics, or effect sizes regarding specific outcomes. The chosen ones can be those with much lower p values than the neighbouring ones. Other computationally intensive methods we are developing aims to directly interpret the derived indices and select those interpretable and significant ones for publication. This involves algorithms to interpret the derived indices and select based on the similarity between indices and input variables in terms of certain information criteria. The methods to select the best method to aggregate information into indices remains to be further developed and justified.
PCA loadings can be used to assign weights to input variables and generate innovative indices. With data from 16 longitudinal 2-year MEPS panels, there are 134,689 indices derived from 251 non-redundant variables. Of all indices, there are 40,803 indices significantly associated with mortality in the second years of the MEPS panels with or without the adjustment of age, sex and races. We find that assigning equal weights to variables requires justification and clear objectives. The results help us to develop a preliminary data-driven framework to review the process of index generation. In this framework, the objectives and rationales to combine information from input variables are important issues to consider, as well as the characteristics of the databases. In the face of the possible deluge of innovative indices, we suggest the development of a standard reporting system for the publication of indices and the creation of publication channels for further discussion of information aggregation or variable stacking.
YSC is financed by the Fonds de recherche du Québec – Santé (FRQS) fellowship. The granting agencies had no role in this study.
All of the Medical Expenditure Panel Survey data can be downloaded at the following website: https://meps.ahrq.gov/data_stats/download_data_files.jsp.