This paper describes the use of a three-step clustering method to group cases of wild animals found dead with similar post-mortem findings, over a period of ten years in France, for syndrome definition.
The SAGIR network continuously collects data from investigations of causes of mortality in free-ranging animals in France. However, there is some variability in the intensity of surveillance both spatially and among species, which influences the representativeness of the database. The network provides a more accurate picture of health events for game species than for non-game animals [
31]. Furthermore, the network's activity is uneven from one
département to another. Nevertheless, these differences have been relatively stable over time, so the quantity and quality of data appeared suitable for trend analysis and detection of unusual health events [
51].
Despite the fact that laboratory staff involved in the network has been regularly trained in post-mortem examination of wildlife cadavers, differences in the precision of descriptions contributed to the complexity of the database. Nevertheless, these descriptions were assumed to be more reliable than diagnostic conclusions, because the process of arriving at a cause of death did not follow a standardised procedure.
Methods of classifying qualitative variables are dependent on the number of occurrences for each modality, and small counts make a minor contribution to the variance of the factorial axes [
38]. The number of terms used for coding the variables was reduced by preliminary work, and we tried to minimise the risk of misinterpretation by relying on the skill of experts and other sources of reference. For statistical reasons some categories had to be combined further (e.g. "genital organs" alone were mentioned only 193 times, so they were combined with "urinary organs"). For some other categories, the descriptions were more or less detailed (e.g. "respiratory organs" instead of "lung" or "trachea"). We decided not to group these categories together, in order to keep as much precision as possible. These choices may have influenced the outcomes of the classification. However, results were consistent, as "respiratory organs" together with "lung" and "trachea" were determining for Cluster 7, "lung" alone was determining for Clusters 4 and 5, and "trachea" alone for Cluster 9.
Variables were split into active and illustrative ones to avoid redundancy and limit insignificant noise, produced for example by information that was not necessarily linked to the case's cause of death. Noise reduction was also the reason for retaining only the coordinates on the first five axes of the MCA. These axes were used regardless of their rank, because each represented very different biological information that retained the most differentiating characteristics of the dataset.
The statistical classification procedure used here showed its ability to handle large datasets and identify pathologically relevant characteristics. However, it should be noted that the cluster description does not address the full range of lesions found on an animal. It merely indicates features that are characteristic and allow clusters to be distinguished. As a result, the cases which were infrequent or poorly defined were gathered in a cluster (Cluster 3) that is difficult to qualify as an entity. Diseases that remained rare or those that induced only unspecific lesions, such as congestion of different organs, could not be highlighted by our approach.
The clusters obtained in this study were of three different types: those which were species- and disease-specific (Clusters 7, 8 and 9), those suggestive of specific conditions but not species-specific (Clusters 1, 5 and 6), and the others, covering a broad pathologic condition (Clusters 2 and 4). It might be interesting to group Clusters 7 and 9 for further epidemiological analysis as they seem to present two different views of the same disease.
The characteristics of the clusters derived from our analysis are consistent with features found in previous epidemiological studies on wildlife diseases in this country [
42,
52-
55]. The clusters reflect the most distinct and most frequent disease entities on which the surveillance network focused. The importance of investigations into VHD and EBHS for example, which were emerging diseases in the early 1990 s [
50,
56], was decisive in defining two clusters.
The statistical classification of cases collected by the French SAGIR network could lead to the adoption by the surveillance community of eight distinct syndromes:
1) a hemorrhagic syndrome, interesting because it allows accidental wildlife intoxications to be monitored [
42] and could potentially also detect anthrax cases [
16];
2) an enteritic/diarrheic syndrome, which could reflect environmental constraints, such as changes in food supply [
57] or density related parasite burdens [
58,
59];
3) a multifactorial (parasites and toxigenic bacteria) syndrome, more specific to the difficult living conditions of wild ruminants [
55];
4) a respiratory syndrome, which is a disease complex that takes a regular toll on wildlife [
44];
5) a trauma-related syndrome, representing one of the foremost causes of death in our database, but less interesting from an epidemiological point of view;
6) a syndrome of acute hepatitis-like diseases, which reflects the importance of EBHS and VHD, especially during the study period, and could be useful for other emerging hepatites;
7) a syndrome of subacute or chronic diseases of the liver, kidney and spleen, caused mostly by endemic bacteria. This syndrome could be useful for the monitoring of tularemia and salmonella outbreaks, potentially threatening public health [
60,
61];
8) a miscellaneous syndrome; despite being difficult to understand, this syndrome is worth considering, because an unknown disease might probably first increase this group before being recognised as a distinct entity.
Future cases can be attributed to the defined syndromes by determining their MCA-derived representation and the cluster they belong to [
40]. We used this procedure on the remaining 14,519 cases collected between 1998 and 2007. Missing information was completed statistically by multivariate imputation. MCA with the above determined eigenvalues was used to calculate the coordinates of these additional cases in the five-dimensional space. These coordinates were used to determine the cluster to which each case belonged (smallest Euclidean distance to cluster centroid). Clustering quality of the whole dataset (R
2 = 0.605) was not substantially different from that of the initial dataset (R
2 = 0.62) (unpublished work).
As new diseases with distinct pathological profiles emerge in free-ranging wild animals over time, the syndrome definition might evolve. The statistical classification could be revised in the future, and historical data could be integrated in the classification process, thus allowing the analysis of continuous time series.
For the epidemiological study of the syndromic time series, we will develop models and anomaly detection algorithms on the number of cases of each syndrome per time unit from the historical database [
62].