One of our objectives was to propose a new application for measuring diversity of medical reports written in any language. The method is based on categorized attributes recorded in medical reports and on international classification systems. The general concept of diversity is derived from f-diversity and its modifications (relative f-diversity, self f-diversity and marginal f-diversity). Here we use f-diversities of Gini-Simpson and Number of Categories types. The method can be applied to compare diversities of two samples of medical reports. We compared diversities of the samples of Czech narrative medical reports and Czech structured medical reports. Both samples were collected in two outpatient departments of preventive cardiology of the Municipal Hospital in Čáslav. The first outpatient department was located in Prague and the second one in Čáslav. The Municipal Hospital in Čáslav approved using these medical reports for our research. We used categorized attributes selected from the Minimal Data Model for Cardiology (MDMC). Medical reports were recorded by four physicians. We analyzed 1119 structured medical reports collected in Prague and 110 narrative medical reports collected in Čáslav. We included narrative medical reports from Čáslav collected by the same physicians that collected also data for structured medical reports in Prague.
Minimal data model for cardiology
Nowadays, there is a big boom in the development of electronic health records (EHRs). There is a general agreement that EHR has the potential to improve quality of medical care [25
]. The most important seems to be the requirement on EHRs for exchanging and management of structured health information. For our study the field of cardiology has been chosen because since 1994 the EuroMISE Centre [26
] has been running two outpatient departments of preventive cardiology under the auspices of the Municipal hospital of Čáslav and therefore we have access mainly to the cardiological data and medical records focused on cardiology. In 2002 the Minimal Data Model for Cardiology (MDMC) was developed within this research center [27
]. MDMC is a set of approximately 150 attributes, their categorization, mutual relations, integrity restrictions, units, etc. Prominent professionals in the field of Czech cardiology agreed on these attributes as on the basic data necessary for an examination of a patient in cardiology.
MDMC consists of eight groups of attributes. The first one is the administrative part. Then, there is a family history part with information on parents and siblings. The next part is the social history and addiction focusing on the marital status, physical activities, mental stress, levels of smoking and alcohol consumption rates. One part of MDMC is devoted to allergies, mainly to drug allergies. The personal history part detects the presence of diabetes mellitus, there is observed whether a patient suffered from a stroke, whether he/she is treated with an ischemic disease of periphery arteries, there are attributes related to aortic aneurysm, other relevant diseases and menopause in women. In the part called Current difficulties of a possible cardiological origin physicians focus on shortness of breath, chest pain, palpitations, swellings, syncope, cough, hemoptysis, and claudication. Another part determines what kind of a treatment a patient undergoes, what type of a diet is prescribed and which medications he/she uses. In the part of the physical examination, patient's weight, height, body temperature, BMI, WHR, blood pressure, pulse and breathing rates, and pathological findings are determined. Laboratory testing is focused on blood glucose, uric acid, total cholesterol, HDL-cholesterol, LDL-cholesterol, and triaglycerols. The last part is focused on attributes related to ECG. The beat frequency, the average PQ and QRS intervals and results of ECG are fully described there.
Traditional measures of diversity
Traditional measures of diversity are based on categorized attributes. For a given attribute we determine categories A1,..., Ak-1. Then, we summarize the rest findings in the "others" category and we denote this category as Ak. The most known two measures of diversity are the following.
The Gini-Simpson index HGS (p) is calculated from the probability distribution p = (p1,..., pk) of k categories of a given attribute as
The Gini-Simpson index has its values in the interval [0, (k
], where the lower boundary 0 is reached if and only if there is only one category of the studied attribute and the upper boundary (k -
) for uniform probability distribution. Originally it was suggested as a measure of inequality in income by Gini [29
] and later discussed by Simpson [30
] as a measure of ecological diversity.
The second one, the Shannon information index HS (p), is calculated from p1,..., pk probabilities of k categories of a given attribute as
The Shannon information index has its values in the interval [0; log k], where the lower boundary 0 is reached if and only if there is only one category of the attribute and the upper boundary log k for uniform probability distribution p = u = (1/k,..., 1/k).
It is hard to give a universal preference to one of these two measures. Some researchers are more familiar with the Shannon entropy and it is easier for them to interpret particular numerical values of HS (p) than those of HGS (p). On the other hand, the Gini-Simpson index is a very well-known traditional measure of diversity.
f-diversity and relative f-diversity
Shannon information IS (X; Y) is defined in information theory as a measure of an association between two attributes X and Y.
where p(x; y) are the joint probabilities and p(x); p(y) marginal probabilities of categories of X and Y attributes.
Shannon information IS (X; Y) is nonnegative and equal to zero if and only if the attributes are independent. Maximal information is the Shannon entropy obtained if Y = X. In case that the attribute X has categories A1, A2,..., Ak occurring with probabilities p1, p2,..., pk respectively, then the Shannon entropy of the attribute X is the same as the Shannon information index
This measure of diversity will be further called Shannon diversity.
Shannon information can be generalized to the f-information
is a convex function on the interval [0;∞), strictly convex at t =
1 with f
(1) = 0. For more details about f-information derived from the concept of f-divergence see Vajda [31
]. In case of f(t)
, f-information If
) reduces to Shannon information IS
) that is widely used in pattern recognition and decision support, see e.g. [32
]. For the first time f-information was systematically studied by Zvárová [36
] who proved the representation of maximal f-information and called it f-entropy. In case that X
is an attribute with categories A1; A2;..., Ak
and probability distribution p
= (p1, p2,..., pk
), then f-entropy
of the attribute X
) can be interpreted as an average unpredictability of the individual categories Ai
of the attribute X
]. In this sense f-entropy Hf
) is a measure of diversity depending on the distribution p
) will be called f-diversity
if it moreover satisfies the following conditions:
• Hf (p) is non-negative,
• Hf (p) reaches its minimal value in case that there is one category with probability 1,
• Hf (p) reaches its maximal value in case that p = u is the uniform distribution,
• Hf (p) is a symmetric function of p,
• Hf (p) is a concave function on the system of all probability distributions p.
We can see that Hf
) is a sum of two expressions where the second one is nothing but the well-known Gini-Simpson index HGS
) multiplied by the constant f
(0). Further we will call Gini-Simpson index the Gini-Simpson diversity
. In the paper [36
] it was proved that f-diversities can be found among f-entropies satisfying the condition g
) = (f
) - f
is a concave function. Then f-entropy Hf
) of the attribute X
will reach its maximal value for uniform distribution of categories p
. We can see that Gini-Simpson diversity HGS
) is f-diversity with f
) = t
- 1 for t
> 1, otherwise f
) = 0. Similarly, Shannon diversity is f-diversity with f
) = t
) was defined in [37
] as f-diversity Hf
) divided by f-diversity of the uniform distribution Hf
Measures of rarity, self and marginal f-diversity
In case that X
is an attribute with categories A1
and a probability distribution p
), then according to Patil and Tailie [38
] the rarity of the category Ai
depends only on the numerical value of pi
. Denoting the rarity of the category Ai
the diversity index
associated with the measure of rarity R
is its average rarity calculated as
Three widely used diversity indexes are:
Number of categories (Number of categories diversity)
Gini-Simpson index (Gini-Simpson diversity)
and Shannon index (Shannon diversity)
These three diversity indexes belong to the family of diversity indexes of order β
] defined as
We can see that for β = 0 we receive the Shannon diversity, for β = 1 the Gini-Simpson diversity and for β = -1 the Number of categories diversity. As it was shown above, all of these three diversity indexes belong to the family of f-diversities.
Let us introduce the concept of self f-diversity [39
] that is a generalization of the rarity introduced by Patil and Tailie [38
]. Self f-diversity
of the i
-th category is defined as
Then it can be proved that f-diversity can be calculated from self f-diversities as
Therefore f-diversity Hf (p) is the weighted average of self f-diversities Rf,i(p).
For the often used Shannon diversity the Shannon self diversity is equal to
known in information theory also as self information. Similarly, for the Gini-Simpson diversity the Gini-Simpson self diversity is equal to
Another view on the impact of the i-th category is opened if we will not distinguish among other categories. In this case we formally work with two categories (dichotomy) with probabilities pi and 1- pi. Then marginal f-diversity of the i-th category is defined as
Next, we introduce relative self diversity and relative marginal diversity. We define the relative self- diversity of the i-th category as
and the relative marginal diversity of the i-th category we define as
Comparing diversities on the sample of 110 Czech narrative medical reports and 1119 Czech structured medical reports we used diversities of the Gini- Simpson type. The reason is that there was shown in [40
] that an ideal estimator of the Shannon type diversity does not exist.