|Home | About | Journals | Submit | Contact Us | Français|
Family history is an important component in modern clinical care especially in the era of precision medicine. Family history information in the Electronic Health Record (EHR) system is usually stored in structured format as well as in free-text format. In this study, we systematically analyzed a family history text corpus from 3 million clinical notes for the patients receiving their primary care at Mayo Clinic. Family members, medical problems, and their associations were analyzed and reported. Our findings showed a great agreement between positive/negated medical problems mentioned in the diagnosis report and the family history, as measured by observed agreement and random agreement. We also found that the family history of some medical problems existed up to 10~15 years prior to the diagnosis date of such problems. Finally two patient cases were studied to show the medical problems in the diagnosis and family history associated with the timeline.
A large number of common diseases (e.g., cardiovascular diseases, cancers, and Alzheimer’s disease) have been shown to be familial 1–6. Family history information, which can reflect genetic susceptibilities of familial conditions, is essential in clinical care in the era of precision medicine. Additionally, family history captures shared environment and behavioral risk factors among the family members which are also important in clinical care 7–10. Family history information has been utilized for risk assessment and stratification, clinical decision support, and clinical research11, 12.
Family history can be stored as structured and unstructured data in the Electronic Health Record (EHR) system. How to capture comprehensive family history information remains a research question13. We hypothesize that systematically analyzing family history information in clinical notes can provide more insight towards the underlying utility of family history in clinical care as clinicians tend to document information that assists their decision-making process (i.e., family history information recorded by clinicians may be more valuable for patient care than the family history information stored systematically).
In this paper, we provide a systematic analysis of the family history section in clinical notes leveraging natural language processing (NLP). We utilized all clinical notes for a cohort of patients who received their primary care at Mayo Clinic in Year 2013 to perform the analysis. Family members, medical problem mentions and their associations with the corresponding family history were analyzed and reported. In addition, we used observed agreement and random agreement to measure the agreement between positive/negated medical problem mentions in the diagnosis report and those in the family history section. Then we studied the prediction power of family history for diagnosis by analyzing the timeline of medical problems in family history and diagnosis.
Many studies focus on how to accurately extract information from the free-text family history using Natural Language Processing (NLP) 14–18. Fridlin et al. developed a rule-based NLP system for extracting and coding family history data from hospital admission notes 16. Goryachev et al. developed a simple rule-based NLP algorithm to identify and extract family history from discharge summaries and outpatient clinical notes 17. Bill et al. proposed an Unstructured Information Management Application (UIMA)-based NLP module for automated extraction of family history 14.
Apart from extracting information from the free-text family history, a couple of studies have performed systematic analysis using structured family history information. For example, Marder et al. used the structured family history information for Parkinson’s disease 19; Silverman et al. investigated the application of the structured family history on genetic studies of Alzheimer’s disease 20; Endevelt et al. summarized the information in the structured family history 8. However, few studies focus on systematically analyzing the information contained in the free-text family history. The study conducted by Chen et al. reported an analysis of free-text family history comments in the EHR 21. Yet, there are two shortcomings in their study. First, the free-text family history data are the auxiliary comments on the structured family history. Those data supplement the structured data and may not contain complete family history information. Second, only cancer-related information was extracted and studied in their work.
As we aim to systematically assess the associations between medical problems and family history information, the cohort we select includes patients who received their primary care at Mayo Clinic in Year 2013. We envision that their specialty care would also be provided by Mayo Clinic. The resultant corpus used in our study contains 3,224,427 clinical notes for a total of 115,710 patients with an average of 27.8 notes per patient. The dataset cannot be made public because it is private and contains protected health information (PHI).
In the Mayo Clinic EHR system, the free-text family history is recorded and stored in the “family history section” of each clinical note. The “family history section” can be identified by matching section header “Family History:20109” in the clinical notes where “20109” is the section ID. An NLP tool MedTagger 22, incorporated with previously developed family history identification method 18, is utilized in this study to extract two major kinds of information from the family history section: (1) assembled medical problems, and (2) family members. The family members are extracted according to the list in Table 1 by using regular expression patterns. We omit “paternal” and “maternal” modifiers for “grandfather”, “grandmother”, “aunt”, and “uncle” in this study. Since the accuracy of NLP tools has been verified in the previous studies 9, 18, it is excluded in this study.
Hierarchical clustering is implemented to analyze the association between medical problems and family members in the family history 23. In this study, our analysis is based on the sentence-level co-occurrence information of medical problems and family members. We calculate the frequency of co-occurrence of each medical problem and each family member. Suppose the frequency of the ith medical problem for the jth family member is fij, the frequency vector of the ith medical problem for n family members can be written as fi = (fi1, fi2,, , fin). Then we apply agglomerative clustering for fi through fm where m is the number of medical problems. In the agglomerative clustering, two closest frequency data points are merged into one cluster according to the Euclidean distance defined as below:
Subsequently, average linkage clustering is used to merge pairs of clusters according to the average distance between clusters. The average distance between cluster C1 and C2 is computed as follows:
By iteratively doing so, all the data points can be merged into a single cluster. The same clustering methods are applied to the n frequency vectors of family members for m medical problems. Finally, the association between medical problems and family members can be observed through the cluster hierarchy.
To analyze the association between medical problems in family history and those in diagnosis, we use the diagnosis sections of those patients who have family history. The same NLP algorithm is applied to extract medical problem mentions from the diagnosis sections. Observed agreement and random agreement are used to measure the agreements between the positive/negated medical problems mentions in diagnosis and those in family history. The observed agreement and random agreement measures are defined by:
where a denotes the frequency of positive medical problems mentioned in family history while positive in diagnosis, b the frequency of positive medical problems mentioned in family history while negated in diagnosis, c the frequency of negated medical problems mentioned in family history while positive in diagnosis, and d the frequency of negated medical problems mentioned in family history while negated in diagnosis.
Out of 115,710 patients, 77,810 (67.2%) have family history sections in their EHRs. The ratio of patients with documentation of family history is much higher compared to 12% in Endevelt et al.’s study 8. It implies that physicians at Mayo Clinic pay a lot of attention to the family history information. Those patients form a Family History (FH) Cohort. We retrieved the EHRs of FH Cohort from the Mayo Clinic data repository and extract the family history sections. This resulted in a corpus of 278,918 family history documents, which is denoted as FH Corpus hereafter. The family history is generally written as semi-structured texts, short sentences and narratives. Table 2 lists a few examples of the family history from the FH corpus.
Since the family history information recorded for a male is distinct from a female; and it also varies for patients at different ages, the gender and age statistics reported in this section can help understand the study cohort and corpus. Our first observation is that the family history appears significantly more frequent for females (58.5%) than for males (41.5%). This result is consistent with Endevelt et al.’s results 8. Female patients, according to a study, tend to “have longer visits, ask more questions, get more information, receive more counseling, send and receive more emotionally-concerned statements and appear more involved in the interaction than male patients” 24.
Figure 1 demonstrates the distribution of age in the cohort. About 33.3% of patients are at age <30. Apart from age <30, age 40~49 (15.7%) and 50~59 (15.5%) are two age ranges with more patients than other ranges.
Figure 2 shows the distribution of family members mentioned in the family history. Obviously, the number of “father” (24.7%) and “mother” (23.7%) is significantly larger than that of other family members since parental history of disease is highly associated with patient’s health. Together with “father” and “mother”, “grandfather” (7.6%), “grandmother” (10.0%), “sister” (8.9%) and “brother” (8.0%) contribute a total of 82.9% of family members. These family members are of great importance for understanding family health genealogy.
Previous study finds that medical, developmental and pregnancy outcomes of first-, second-, and third-degree relatives are the most useful family history for a patient 25. First-degree relatives including parents, offspring, and siblings, have 50% shared genes with the patient. Second-degree relatives including aunts and uncles, grandparents, half siblings, nieces and nephews, inherit 25% genes identical to the patient. Third-degree relatives including cousins and great-grandparents, share only 12.5% of genes with the patient 26. Therefore, physicians would consider more information of the first-degree relatives in the family history. Figure 2 validates this result by showing that the number of first-degree relatives accounts for 71.2%, the number of second-degree relatives 27.8%, and the number of third-degree only 1.0%.
There are 901,649 medical problem mentions corresponding to 9,646 unique medical problems extracted. The top ten frequent medical problems are listed in Table 3. Among the most frequent medical problems, hypertension, high blood pressure, high cholesterol and CAD are common cardiovascular diseases the family history has verified as a risk factor 10. Diabetes and heart disease are early cardiovascular-related events 27. Cancers are usually phenotypic diseases that can be revealed by the family history 28, 29. Thus, physicians are interested in those medical problems in the family. In addition, we find that physicians also pay attention to mental disorders in the family history (depression and alcohol abuse). This result is consistent with the result of some studies that family history may be enough to predict mental disorders due to the shared environment 30, 31.
Family history may also improve the chances for early detection of rare diseases, since many rare diseases are gene- related medical problems. For example, hemophilia is an X-linked disease 32. Apart from the frequent medical problems, a lot of rare diseases are also found in the free-text family history. For example, the frequencies of hemophilia and sickle cell anemia in the FH Corpus are 124 and 39, respectively. Though many standards and tools have been developed to gather information for common diseases in family history, there are currently no guidelines or standards on the collection of rare diseases at the point of care. However, the findings show that physicians at Mayo Clinic have paid attention to the family history of rare diseases.
Given the identified family members and medical problems in the FH Corpus (positive and negated), we would like to study the association between them, i.e., what medical problems are mostly considered for a specific family member. Using the hierarchical clustering for the frequencies of co-occurrence of medical problem and family member, we plot a heat map along with clusters in Figure 3.
We have the following observations: (1) There are roughly four clusters indicated by green rectangles in Figure 3. (2) Almost all medical problems are considered for “father” and “mother”. (3) Breast cancer frequently appeared for female family members (“mother”, “grandmother”, “sister”, and “aunt”) while alcohol abuse for both male and female family members (“father”, “mother”, “sister”, “grandfather”, “brother”). (4) Cancer and diabetes are clustered while high blood pressure, high cholesterol and hypertension are clustered. This is consistent with the known fact that cancer and diabetes are comorbidities33, 34. (5) It is interesting that CAD, alcohol abuse, heart disease and high cholesterol are the most considered problems for “father” while depression is relatively the most considered for “mother”.
In order to illustratively show the most concerned medical problems in genealogy, Figure 4 demonstrates a family tree where each family member is associated with the top 5 medical problems for that family member. Hypertension and depression are two mostly considered problems for each family member. CAD, diabetes, and MI are common mentioned problems for patient’s siblings, parents and grandparents. Asthma is not among the top 10 medical problems in the FH Corpus but it is one of the most frequent medical problems for patient’s siblings and children. So is colon cancer for grandfather, uncle and aunt; osteoporosis for mother; and ovarian cancer for aunt. Interestingly, alcohol abuse and mental illness are among top 5 considered medical problems for patient’s son and daughter. This may due to the fact that people and their children usually live in a common environment, which is a key factor to both alcohol abuse and mental illness.
In this section, we study the association between the medical problems in family history and those in patient’s diagnosis reports. For each patient in the FH Cohort, we count the positive and negated medical problems mentioned in diagnosis reports. For each positive and negated mentions, we then check whether it is positive or negated in the family history and count the frequency. The accumulated results for the FH Cohort are summarized in a contingency table, as shown in Table 4. Observed agreement and random agreement are used to evaluate the agreement of positive and negated medical problems mentioned in diagnosis and family history. According to the definitions in Equations (3) and (4), the observed agreement and random agreement are 0.8100 and 0.7929, respectively. These measures indicate a great agreement between medical problems mentioned in diagnosis and family history. This result implies that the family history might have prediction power for the diagnosis.
We list the twenty most frequent positive medical problems in diagnosis while negated in family history and twenty most frequent negated medical problems in diagnosis while positive in family history in Table 5. For the “not found” medical problems, it is interesting that 95.9% of positive medical problems mentioned in diagnosis are not found in family history. Those “not found” mentions might be positive or negated mentions that physicians regard as irrelevant information to patient’s illness or that are lack of physician’s input.
In order to show whether the diagnosed medical problems are mentioned in family history prior to the diagnosis date, we extracted the patients that had the identical medical problems in family history and diagnosis, and calculated the number of years between the diagnosis date of a medical problem and the first date of that medical problem mentioned in patient’s family history. The results for five most common medical problems, hypertension, hyperlipidemia, depression, asthma and cancer, are summarized in Figure 5. We observed that those medical problems were mentioned in the family history up to 15 years prior to the diagnosis date. 36.3%, 33.1%, 39.2%, 25.8% and 52.2% of patients had family history of hypertension, hyperlipidemia, depression, asthma and cancer before they were diagnosed with those medical problems, respectively, and 3.6%, 2.7%, 2.2%, 1.9% and 6.1% of the patients had family history of those problems 10~15 years prior to the diagnosis date.
To show personalized association between family history and diagnosis, we took two patients as examples to illustrate their medical problems in family history and diagnosis. Figure 6 displays the medical problems in their family history and diagnosis associated with the timeline since the first clinical note. For Patient 1, it is clearly shown that asthma and obesity were found in family history about one year before the first diagnosis, and hypertension more than two years before the first diagnosis. Hyperlipidemia was found in family history after the first diagnosis. Pharyngitis, URI, chest pain, abdominal pain and sinusitis were not found in family history because these were specific problems of which the information was not compiled in family history. For Patient 2, hypertension occurred in family history around 6 years before diagnosis and obesity occurred slightly earlier than diagnosis. These results also imply the prediction power of family history for diagnosis.
We have described a systematic analysis of family history information using a cohort of patients receiving their primary and specialty care at Mayo Clinic. We applied NLP to extract medical problems and family members from the free-text family history. We did not distinguish “maternal relative” and “paternal relative” in the analysis in spite of the importance of specification of side of family for familial disease study. The reason is that extraction of simple family member terms results in a higher accuracy. Future work would consider involving extraction of “maternal” and “paternal” information. In addition, this study focuses on analysis of unstructured free-text family history. A comparison of structured family history and unstructured free-text family history is also interesting and subject to a future study.
Semi-structured family history usually follows certain structures that are frequently used in clinical notes. For narrative family history, physicians spend time gathering relevant family history information, which can be more informative in clinical care. A comparison of semi-structured family history and narrative family history in supporting clinical decision-making would be of interest in a future study.
From our study, we observe that certain diseases are recorded in the family history while others are not. The reason is that recording the family history is highly influenced by the clinical context. For example, the patients with a specific familial condition will be asked by physicians about the relevant family history. Therefore, some rare diseases are included in the family history and some are not. What disease information should be considered and collected in the family history is still a challenge and needs future studies.
The study of agreement between medical problems in family history and those in diagnosis shows some evidence of using family history to predict a patient’s future health. Many researchers have found that it is possible to predict medical problems by joint use of family history and other factors 30, 31, 35–37. However, few studies utilize the free-text family history from EHR to predict a patient’s future health. An automatic system that utilizes NLP tools to extract information from family history and applies probabilistic models to calculate the probability of a patient’s future illness is our future study focus. In addition, a timeline visualization tool for showing the information in family history and diagnosis might also facilitate personalized health care, which requires further study. Note that we used exact matches in assessing agreement and did not take into consideration association among the medical problems. One of the future directions would be incorporating the association information leveraging ontologies or empirical data into our analysis.
Free-text family history contains important and valuable information for physicians and clinicians. This is the first systematic analysis of a large free-text family history data set. The aim of this study is to increase the awareness of importance of family history through analyzing the information contained in the free-text family history. We reported the family members and top medical problems mentioned in the corpus as well as their associations. The analysis of patient’s diagnosed medical problems and those problems in family history imply the potential use of family history for predicting medical problems. The results also have implications for physicians’ training and learning of family history.