|Home | About | Journals | Submit | Contact Us | Français|
Elevated alanine aminotransferase (ALT >40 IU/mL) is a marker of liver injury but provides little insight into etiology. We aimed to identify and stratify risk factors associated with elevated ALT in a randomly selected population with a high prevalence of elevated ALT (39%), obesity (49%) and diabetes (30%).
Two machine learning methods, the support vector machine (SVM) and Bayesian logistic regression (BLR), were used to capture risk factors in a community cohort of 1532 adults from the Cameron County Hispanic Cohort (CCHC). A total of 28 predictor variables were used in the prediction models. The recently identified genetic marker rs738409 on the PNPLA3 gene was genotyped using the Sequenom iPLEX assay.
The four major risk factors for elevated ALT were fasting plasma insulin level and insulin resistance, increased BMI and total body weight, plasma triglycerides and non-HDL cholesterol, and diastolic hypertension. In spite of the highly significant association of rs738409 in females, the role of rs738409 in the prediction model is minimal, compared to other epidemiological risk factors. Age and drug and alcohol consumption were not independent determinants of elevated ALT in this analysis.
The risk factors most strongly associated with elevated ALT in this population are components of the metabolic syndrome and point to nonalcoholic fatty liver disease (NAFLD). This population-based model identifies the likely cause of liver disease without the requirement of individual pathological diagnosis of liver diseases. Use of such a model can greatly contribute to a population-based approach to prevention of liver disease.
High rates of chronic end-stage liver disease have been documented together with significantly elevated prevalence of diabetes and obesity among Mexican–Americans living at the United States (U.S.)/Mexico border (1–3). Most striking are data from a randomly recruited cohort from this population in which we show a high rate (~39%) of the metabolic syndrome and elevated alanine aminotransferase (ALT) levels, indicative of liver injury (4). We observe a marked gender effect with young males more likely to be obese and to have raised ALT levels (5). These rates are in the absence of evidence for excessive alcohol consumption, but in any event alcoholic and nonalcoholic fatty liver disease (NAFLD) are not exclusive processes and may be additive. These observations raise the important question as to whether this population has high rates of NAFLD and, more importantly, nonalcoholic steatohepatitis (NASH), which leads to end-stage liver disease. Although elevated ALT is known to be indicative of liver injury, it lacks diagnostic specificity for NAFLD in the absence of liver biopsy. Because the risks and the cost of liver biopsy, particularly in a disadvantaged population, are prohibitive on a large scale, we applied machine learning methods to our database in order to identify risk factors for the elevated ALT from extensively documented clinical and biological information. From this we obtain an estimate of the potential burden of NAFLD in our population of Americans of Mexican descent. This knowledge is important in health disparity populations ill equipped to bear additional burdens of preventable liver disease both economically and socially.
The Hispanic population in the city of Brownsville, Cameron County, Texas is one of the poorest in the U.S. (2). Since 2003 we have recruited >2000 healthy participants randomly selected from the community: the Cameron County Hispanic Cohort (CCHC) (2). These individuals consented to extensive sociodemographic, anthropometric and biological analyses. Using weighted data (i.e., data corrected for sampling bias based on census data to account for age, gender, tract/block and household clustering) from this cohort show the prevalence of obesity and diabetes to be high; 7.9% individuals are morbidly obese (BMI ≥40), 48.5% individuals overall are obese (BMI ≥30), and 81.7% are obese or overweight (BMI ≥25). Using the 2010 definition of diabetes recommended by the American Diabetes Association (ADA) (6), 19.2% had pre-diabetes [fasting plasma glucose (FPG) 100–125 mg/dL or A1C 5.7–6.4%] and 30.7% had diabetes (FPG ≥126 mg/dL or glycosylated hemoglobin [HbA1c] ≥6.5%). In addition to obesity and diabetes, we found the prevalence of elevated ALT (≥40 U/L) to be 40.8%. ALT is mainly produced in the liver and released into the bloodstream as the result of liver injury. Chronic liver disease is one of the major causes of death in the adult Hispanic population in the U.S. (2008 statistics of the National Center for Injury Prevention and Control: NCIPC, http://www.cdc.gov/injury/wisqars/fatal.html). Liver disease is ranked as the eighth leading cause of death among Hispanics aged 25 to 34 years, rising to sixth at 35–44 years, and fourth between the ages of 45 and 64 years. Our own data yielded rates of 126/100,000 for end-stage liver disease in a retrospective chart review using ICD-9 codes for end-stage liver disease (1). Rates were considerably higher in males (386/100,000) and overall 8.7% of the 176 cases identified had been diagnosed with hepatocellular carcinoma. There were no biopsy data and only four patients had been referred for liver transplant and none had received one. This study drew charts from a Federally Qualified Clinic in Brownsville serving the same mainly uninsured Mexican–American population from which we drew the CCHC (1). As stated above, the limitation of all these data is that accurate diagnosis of the cause of liver disease depends on the relatively invasive and expensive procedure of liver biopsy. Given these constraints and the concerns raised by our data, we sought to generate more precise data on risk factors using less invasive procedures (venipuncture).
Recently, genome-wide association studies have provided a new tool in the identification of genetic susceptibility of liver injury (7,8). Two single nucleotide polymorphisms (SNP) rs738409 (causing the amino acid substitution Ile148Met) and rs2281135 in the PNPLA3 locus have been highlighted as being associated with NAFLD (7) and elevated ALT (9), respectively. Our preliminary genetic analysis of the CCHC suggests rs738409 tags the genetic association with elevated ALT better than rs2281135. In addition, our data showed the genetic susceptibility tagged by rs738409 was not biased by population structure of the admixed Mexican–American population. Therefore, genotypes of rs738409 were used as a genetic risk factor in the machine learning process.
Machine learning methods are able to automatically capture risk factors from a large number of variables. The support vector machine (SVM) and Bayesian logistic regression (BLR) are the two most representative machine learning methods for disease risk modeling. SVM is a modern machine learning method that operates by finding an optimal separating hyperplane between affected and unaffected individuals (10). SVM is particularly useful in classifying high-dimensional data and taking into account the interactions among environmental and genetic factors (11,12). Logistic regression is a classical method in disease risk modeling. BLR extends the logic regression to a Bayesian framework by incorporating prior information (13). In this study we aimed to identify the risk factors contributing to elevated ALT in the Brownsville CCHC using these two machine learning methods. We anticipate that this approach will provide a robust measure of the likely disease processes associated with abnormal liver function in this Mexican–American population without invasive liver biopsy. The information will be important for public health policy makers and planners to develop the most efficient prevention and disease management strategies at the population level.
Written informed consent was obtained from each participant, and the study was approved by the Committee for the Protection of Human Subjects of the University of Texas Health Science Center at Houston (UTHealth).
This study investigated 1532 adult individuals on whom we had complete data, recruited prospectively in the Cameron County Hispanic Cohort (CCHC). These individuals were from households randomly selected for recruitment on the basis of 2000 census tract data in the city of Brownsville, Cameron County, Texas (2). The general description of this cohort is in our previous report (2). Among these adult participants in this study, 39% have ALT >40.
Standard protocols for data entry, cleaning, and quality control of the data were applied throughout. Personal identifiers are secured separately with access limited to only those personnel needing to contact participants who had given prior consent to be recontacted. De-identified data are also secured behind the UTHealth firewall with access only to approved collaborators. Data weighting was performed as described (2).
The genotyping of rs738409 was performed using the Sequenom iPLEX assay (Sequenom, Cambridge, MA). The genotyping call rate was 100%. For the purpose of quality control, 93 DNA samples were genotyped in duplicate. The concordance rate of each duplicate is 100%.
The following variables were included in our risk model: gender, age, rs738409 genotype, body mass index (BMI), body height, waist circumference, hip circumference, waist/hip ratio, pulse rate, blood pressure, physical activity, alcohol consumption, smoking, education levels, status of diabetes (diagnosed by the ADA 2010 guidelines) (6), history of hepatitis, medications, fasting plasma glucose (FPG), fasting plasma insulin level, homeostasis model assessment-estimated insulin resistance (HOMA-IR) (14), fasting lipids [serum triglycerides, high-density lipoprotein cholesterol (HDL-c), non-HDL-c, and low-density lipoprotein cholesterol (LDL-c)]. These input variables were linearly scaled to the range [0; +1] and were mapped into a high-dimensional feature space.
In this study, all classification tasks were performed by support vector machine (SVM) and Bayesian logistic regression (BLR). SVM is a very effective supervised machine learning classifier widely used in pattern recognition or classification. Our soft margin SVM model was implemented with the LIBSVM package (http://www.csie.ntu.edu.tw/~cjlin/libsvm) (15). The radial basis function (RBF) kernel was chosen in this study, which gives the highest accuracy for our test. In our study, the RBF kernel showed better performance than the linear kernel in the SVM model (AUC scores: 0.743 vs. 0.735 in males; 0.690 vs. 0.664 in females). For parameter selection, a grid search heuristic was imposed with 10-fold cross-validation. The weight (or relative importance) of each variable in the SVM model was assessed by the F-score. The F-score measures the discrimination of two groups where the larger F-score suggests the feature (elevated ALT in this paper) and is better discriminated by the variable (16). BLR is the extension of binary logistic regression model. Compared to a standard logistic regression model, the regression coefficients in BLR were estimated with Bayesian prior density (13). Our BLR model was implemented with the Laplace prior part of the Bayesian binary regression (BBR) software (http://code.google.com/p/bbrbmr/). The weight of each variable in the BLR model was assessed by the maximum likelihood β value. Comparisons between the two groups (normal ALT ≤40 vs. elevated ALT >40) were performed using Student t-tests for continuous variables and Pearson χ2 tests for categorical variables. For the purposes of modeling, we chose this cut-off for ALT so that our results would be comparable with a previous study using data from a different population (17). This study used the receiver operating characteristic (ROC) curve to assess the model performance. The ROC curve plot was generated by calculating true-positive rates and false-positive rates over a relevant range of thresholds. For obtaining aggregate numbers of true vs. false positive rates, the thresholds of each ROC curve underwent stepwise variation from 0–1 in each 0.01 interval. Each threshold was assigned as the probability of an individual having liver injury. The area under the curve (AUC) was used as a measure of performance of the classifiers.
We modeled the risk of liver injury for males and females separately because the rate of elevated ALT is highly stratified according to gender (male = 54.7%, female = 27.4%). In the risk modeling, the performances of the SVM method and the BLR method were assessed based upon area under the receiver operator characteristic curve (AUROC) scores of 0.743 (males) and 0.690 (females) for SVM, 0.693 (males) and 0.670 (females) for BLR (Figure 1). Although SVM and BLR had different performances in our study, SVM results were largely supported by BLR results. The weights of predictor variables in the SVM model represented by the F-scores were validated by removal of a specific predictor variable and then reassessment of AUC scores. For example, consistent with the F scores of the genetic marker rs738409, AUC scores before and after removing rs738409 in the SVM model were 0.743 vs. 0.743 in males and 0.690 vs. 0.669 in females. The risk factors contributing to elevated ALT in males and females are shown in Table 1. Our models identified four groups of risk factors in both genders: (1) abnormal fasting plasma insulin level and insulin resistance (HOMA-IR); (2)(3) serum triglycerides and non-HDL-c; (4) diastolic hypertension; and interestingly (5) genetic susceptibility tagged by the PNPLA3 SNP rs738409 in females. Alcohol consumption was not identified as a risk factor in this population. These risk factors of increased ALT in this community cohort are consistent with the known risk factors of NAFLD (17,18).
Increased glutamate oxaloacetic transaminase (also known as aspartate aminotransferase, AST) may also manifest liver injury. Elevated ALT with ALT/AST <1 are considered counter to the diagnosis of NAFLD (19). Among the 556 individuals with ALT >40, there are 93 individuals with ALT/AST <1. Our risk modeling in this subset of 93 subjects showed poor AUROC scores suggesting that the risk factors in these participants were largely unrelated to metabolic syndrome, and therefore undetermined using the current set of variables. Nevertheless, our risk model in these 93 individuals did capture increased insulin level and HOMA-IR as the major risk factors in males, with history of hepatitis as the second risk. In females, our modeling highlights the major risk from BMI and waist circumference, whereas non-HDL-c is the second risk. Even in participants with elevated ALT and ALT/AST <1, metabolic syndrome remains a risk factor for liver injury.
This study highlights the major but under-appreciated health and economic threats of liver disease in disadvantaged populations with high rates of obesity and diabetes. Obesity induces NAFLD through dysfunctional adipose metabolism mediated by adipokines (20). Our results show that body weight consistently contributes particular risk of liver injury independent of BMI in both males and females. These observations suggest that total amount of body fat is of special importance in contributing to the risk of NAFLD. Furthermore, waist circumference and waist/hip ratio also contribute to liver injury, emphasizing the importance of central fat distribution. Because our observations are highly statistically significant, they underline recent data regarding the roles of central and generalized obesity in NAFLD (21). Although generalized obesity increases the risk of NAFLD, central obesity makes an additional and independent contribution to NAFLD. We confirm that insulin resistance and increased fasting plasma insulin are also important correlates of liver injury, most markedly in females. To date, it is still unclear whether NAFLD is caused by insulin resistance (22) or leads to insulin resistance because of critical disturbances in liver metabolism (23). However, dysfunctional adipose metabolism is a common pathological mechanism shared by insulin resistance and NAFLD (24,25).
Both serum triglycerides and non-HDL cholesterol contributed to the risk of liver injury in this population. The risk effect of non-HDL-c is especially obvious in males manifested by its high rank among all the risk factors (Table 1). This differs from the previously reported correlation between hypertriglyceridemia (but not hypercholesterolemia) and fatty infiltrations of NAFLD (26), emphasizing the importance of separately analyzing the genders in order not to mask potential correlates that are gender dependent. Given the high rates of end-stage liver disease in our previous study where we found significantly higher rates in males than females (1), this gender difference may point to important metabolic pathways and/or behavioral differences in this population that lead to different rates of NAFLD in each gender (27,28). In addition, our analysis highlighted the critical role of non-HDL-c rather than total cholesterol in liver injury. A very minor risk effect of total cholesterol could be identified in our modeling provided we did not discriminate non-HDL-c from total cholesterol, because HDL-c counteracted and diluted the risk of non-HDL-c. On the other hand, LDL-c calculated according to the formula of Friedewald et al. (29) is not associated with increased ALT in our study. This finding provides additional evidence that, instead of calculated LDL-c, non-HDL-c is a risk marker of metabolic syndrome (30) or the liver injury of metabolic syndrome NAFLD in our population. The machine learning approach adds to our previous observations by identifying diastolic hypertension as an independent major risk factor for liver injury in this Hispanic community. Although diastolic hypertension is a common complication of the metabolic syndrome, it is not known whether there is an independent pathogenic mechanism associating it with NAFLD.
A gender-specific genetic susceptibility was highlighted in this study. As shown by our study, the genetic susceptibility tagged by the PNPLA3 SNP rs738409 is only seen in females. The lack of association in males cannot be explained by sample size or statistical power. Comparing the genetic effect in males (OR [95% CI] = 1.171 [0.911, 1.505]) and females (OR [95% CI] = 1.640 [1.347, 1.996]), the heterogeneity is statistical significance (p = 0.038). In addition to our study, a similar gender-specific effect has been reported by meta-analysis of genetic association of rs738409 and NAFLD (31). Molecular mechanisms underlying this gender-specific effect remain unknown, which is being investigated in our future study. In spite of the highly significant association, the role of rs738409 in the prediction model is minimal compared to other epidemiological risk factors.
We found that history of hepatitis contributed only a minor risk for elevated ALT in males. Our previous unpublished data show very low rates of hepatitis C seropositivity (0/320) and hepatitis B (3/320) in randomly collected sera from this population (Fisher-Hoch, unpublished data). We also found that aging in itself was not a risk factor for NAFLD. On the contrary, evidence of liver injury was most marked in younger males who are also more likely to be severely obese (5). Neither drugs nor alcohol consumption was an indicator of liver injury in this study; however, a history of medication (including any prescription drugs by a physician) did show a protective effect in males. This may be correlated with receiving treatment for insulin resistance, hypertriglyceridemia, or hypertension. Although physical activity showed a trend towards a protective effect, it did not reach statistical significance, perhaps due to imprecise quantification of physical activity in this study. The interesting finding of this study that FBG does not contribute to the high risk of elevated ALT in males highlighted that intensive control of blood glucose alone may not be a viable therapeutic target for liver injury in metabolic syndrome.
We identified and stratified the most important and specific risk factors for liver injury in this cohort of Mexican-American subjects. Our conclusions are consistent with previous studies on subjects with a clear diagnosis of NAFLD (17,18) because risk factors for liver injury in our study are largely concordant with previous studies on NAFLD. In a group of Korean subjects with NAFLD, Oh et al. showed that increased ALT was associated with serum lipid (increased triglycerides and non-HDL-c, and decreased HDL-c), insulin resistance (fasting glucose, fasting insulin, and HOMA-IR), body fat (waist circumference and BMI), and hypertension (diastolic blood pressure) (17). In a group of Italian subjects with NAFLD, Bedogni et al. showed increased risk of NAFLD due to obesity, hyperglycemia, hypertriglyceridemia, and systolic hypertension (18). In comparison, the distribution pattern of risk factors identified by our study suggests NAFLD as likely the most important cause of increased ALT in Mexican-Americans. Our model allows us to identify NAFLD as the major cause of liver injury in this Mexican-American population. Globally, NAFLD is a fast emerging disease and has recently received extensive attention.
The definitive diagnosis of NAFLD is difficult. Liver biopsy is the “gold standard” to diagnose NAFLD and to differentiate it from NASH because it determines the presence and extent of hepatic fibrosis (32). However, because of the potential serious complications (0.1% major hemorrhage, and 0.01% death) and its technical complexity, inconvenience to the participant and cost, liver biopsy is not suitable for use in a screening study at the population level (33) so was not considered appropriate in this study. Despite these limitations, our data identify NAFLD as a likely major contributor to the extraordinarily high rate of liver injury in this Mexican-American population with limited access to health care. Community-based efforts and simple preventive medicine to reduce NAFLD are critically needed to significantly lower the burden of liver injury, including end-stage liver disease, in disadvantaged populations.
We thank our cohort recruitment team, particularly Rocio Uribe, Elizabeth Braunstein and Julie Ramirez. We also thank Marcela Montemayor and other laboratory staff for their contribution and Christina Villarreal for administrative support. We thank Valley Baptist Medical Center, Brownsville for providing us space for our Center for Clinical and Translational Science Clinical Research Unit. We also thank the community of Brownsville and the participants who so willingly participated in this study in their city.
This work was supported by MD000170 P20 funded from the National Center on Minority Health and Health Disparities (NCMHD) and the Centers for Translational Science Award 1U54RR023417–01 from the National Center for Research Resources (NCRR). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. HQQ is supported by intramural funding from the University of Texas, School of Public Health.