In this paper, we developed a weighted BNI model for pancreatic cancer prediction by combining PubMed knowledge and EHR data to calculate the ratio between the positive and negative evidence for the associations between each risk factor and the target disease. The evaluation results indicate that the weighted BNI significantly outperformed the conventional BNI and two other classification models for pancreatic cancer prediction. This result can be explained by the following characteristics of the weighted BNI model. First, the posterior probabilities of the weighted BNI are determined by two data sources, the ratio between the positive-negative evidence for the association between the risk factor and the disease and the prior probability of each risk factor in EHR, both being important empirical evidence for disease risk prediction. The more frequently a risk factor can be found in EHR, the higher the posterior probability of the risk factor. The weighted BNI can tell clinically relevant variables from clinically irrelevant variables and weigh the relevant variables according to PubMed evidence. The conventional BNI can recommend risk factors only by using high posterior probability of statistical significance. Moreover, some approaches simply eliminate irrelevant variables; however, we keep seemingly irrelevant variables in the model but use PubMed knowledge to avoid abusing their prior probability. Our design seems to be more realistic and sensitive than a simplified model that disregards such variables. To our knowledge, the weighted BNI is a novel approach to handling clinically irrelevant variables for disease risk prediction. Our results in and confirm our hypothesis that the weighted BNI model can overcome the limitations in the conventional BNI based on pure probabilities.
The weighted BNI also outperformed KNN and SVM for pancreatic cancer prediction. This may be because of two reasons. Firstly, the BNI model better serves risk prediction than other classification models by using a small number of variables. Our model contains only 20 variables. KNN and SVM, on the other hand, usually excel in a high-dimensional feature space, such as highly dimensional microarray datasets and do not show advantages in low-dimensional feature space. Secondly, the weighted BNI obtains information of the association between the variables and pancreatic cancer from the topology of Bayesian Network and the weights from PubMed and EHR, while KNN and SVM do not have such knowledge to support accurate prediction.
The combination of PubMed and EHR knowledge and information for weighing risk factors can be used to generate hypotheses about the clinical significance of an association between a variable and pancreatic cancer by a combined analysis of its frequency in patients with pancreatic cancer () and in patients without pancreatic cancer (). For example, glucose is top ranked in , but appears at the bottom in . This result indicates a positive association between glucose and pancreatic cancer, which is consistent with scientific knowledge in that glucose is the second most frequently studied variable in pancreatic cancer research due to the association between diabetes and pancreatic cancer.
According to
Eq (1.1) and
(1.2), the weight
woi represents the ratio between the number of PubMed abstracts of positive association (
Pi) and the number of PubMed abstracts of negative association (
Ni) for each risk factor
i. If
Pi is bigger than
Ni, then the original weight
woi is bigger than 1, which means there is more positive than negative evidence showing variable
Vi is associated with pancreatic cancer; if
Pi is smaller than
Ni, then the original weight
woi is smaller than 1, which means that there is more negative than positive evidence indicating that variable
Vi is associated with pancreatic cancer. In , almost all the original weights
woi are > 1, which may imply that most PubMed publications about risk factors are positive results and negative results are rare. Because we cannot tell if there is a publication bias toward only positive association, this warrants further study to guide the use of PubMed evidence.
We identified several tasks as future work to continuously improve the weighted BNI model for disease risk prediction. First, a highly accurate dataset is crucial to realizing the full potential of the weighted BNI and the software iDiagnosis for pancreatic cancer risk prediction. In this paper, we reused a dataset of manually reviewed 98 cases [
21]; however, we faced significant challenges when it came to verifying the completeness and accuracy of the information for the larger sample population, the 14,971-patient control group. Each variable entails laborious information extraction and summarization from PubMed and EHR. The same variable may be reflected in multiple formats in different data sources (e.g., ICD-9 codes, various types of notes, and other structured data sources such as lab results) in EHR. Our unstructured EHR data in the research data warehouse were pre-processed by one of the best medical natural language processing software, MedLEE [
10,
28-
31], but the data accuracy was not close to 100%. Time was an issue in this study as for the smaller case sample we used manual review to compensate for the NLP limitations, which was time consuming. We also lacked a method to reconcile the inconsistencies between structured and unstructured data sources. Development, validation, and reuse of sophisticated phenotyping algorithms in the EHR are much needed to improve the efficiency and accuracy of EHR phenotyping.
Second, although the weighted BNI model improves the accuracy for pancreatic cancer prediction over conventional BNI and the other two popular classification methods, it can be improved in multiple aspects, including the efficiency for variable generation and selection, prior probabilities calculation, and variable weights calculation. In this study, we selected 20 variables to predict pancreatic cancer risks. It is possible that unknown variables related to pancreatic cancer have not been included in our model. It is beyond our current capacity to define a model with hundreds or thousands of fine-grained phenotypic features related to pancreatic cancer. Therefore, efficient discovery of unknown disease features is a challenging research topic that needs more future work.
Moreover, in this study, we used the batch processing mode to obtain data to calculate prior probabilities from EHR. It would be more efficient to support prior probability calculation using a real-time data warehouse to automatically update the parameters of the model online as the warehouse receives updates. An advanced analytical framework based on efficient EHR-phenotyping algorithms can be developed to increase the efficiency of dynamic prior probability calculation for each risk factor in vivo.
Finally, in this weighted BNI network, we weighted nodes only. As an alternative, the causal edges between nodes can be weighted. In [
32], Zhou et al. developed a causal edge weighted BNI for visual tracking, and the authors achieved better recognition results than when using a conventional BNI. One of our future works is to investigate the efficacy of weighing causal edges for improving the predictive accuracy of BNI.