|Home | About | Journals | Submit | Contact Us | Français|
The classification of bacteria by using genomic methods or expensive biochemical-based commercial kits is sometimes beyond the reach of many laboratories that need to perform numerous classifications of unknown bacterial strains in a fast, cheap, and reliable way. A new computer program, Identax, for the computer-assisted identification of microorganisms by using only results obtained from conventional biochemical tests is presented. Identax improves current microbial identification software and provides a multiplatform and user-friendly program. It can be executed from any operating system and can be downloaded without any cost from the Identax website (www.identax.org).
Traditional identification and classification of microorganisms are usually based on numerical taxonomy that was introduced at the end of the 1950s (15, 16, 17). Numerical taxonomic methods were applied extensively for classification and identification in subsequent years (6, 18), and they were strengthened by the extensive use of computers in various research fields (1, 8, 9, 14, 20). Later, computer-assisted identification programs were developed for the identification of bacterial groups, based mostly on phenotypic data obtained by traditional methods or raw data from commercial kits (2, 4, 5, 7, 11, 13).
Recently, genomic analyses have proven successful in defining taxa within different microbial groups. The use of 16S rRNA has been proposed as the most reliable tool for allocating bacterial strains to families and genera. Moreover, other molecular methods, such as multilocus sequencing typing, have also been needed for the assignation of strains to species (19). However, the use of these molecular methods for routine analyses and rapid diagnoses, which are usually required for clinical and environmental analyses involving high numbers of samples or strains, remains impractical. Identification procedures for routine practices need to be simple, low cost, and rapid in order to be effective and successful. In this way, conventional biochemical identification provides the optimal approach for microbial identification in lieu of complete genomic characterization. Evidently, genomic analyses should be carried out for studies of systematic bacteriology or biodiversity when molecular taxonomical criteria are needed.
Numerical approaches are currently applied to identify microorganisms, but there has been no wide-ranging software that could be applied to any microbial group and run on any computer system. Identax has been developed as a platform-independent computer program for numerical identification of microorganisms. It is user-friendly and runs on almost all existing operating systems and computer platforms, as it relies on the flexibility and power of the Java platform. It could be applied to any microbial group by creating, updating, or merging microbial databases of unlimited phenotypic characteristics. The software allows the identification of any microorganism that is listed in an available database. It can also generate interactive dichotomous trees for rapid offline analysis.
In this study, the Identax software was developed and evaluated by using recently updated biochemical databases for the genus Vibrio (12). Later, it was also evaluated with other databases for different bacterial groups that have been defined by other authors in previous taxonomical studies (3, 10).
Identax has two main features. The first is the fast identification of unknown bacterial strains from phenotypic data, represented as the dichotomous results (positive or negative) of a set of biochemical tests. The second main feature is the generation of dichotomous trees that will allow the isolation of one taxon from the others with the lowest possible number of tests.
The first feature is the most adequate from the point of view of decision support systems, as the software recommends, in real time, the test with the most discriminative potential. It can also detect and handle false positives and show the present candidates.
The second feature consists of the generation of a dichotomous tree. Each node represents a test, and its two branches correspond to a negative or to a positive result from the test. This tree offers an overview of the search space and allows rapid identification without the need for a computer.
The core of the software consists of a simple and efficient model of conditional probabilities. This model is based on a Bayes' theorem approach (4), but a few optimizations have been applied to ensure the scalability and efficiency of the algorithm, as well as avoid some inherent restrictions. To calculate the probability that an unknown isolate belongs to a given species or taxon, the following formula is used:
where P(x|y) is the conditional probability of event x, assuming the occurrence of event y, P(x) is the unconditional probability of event x, and j runs on the taxa. In this case, R represents a specific combination of results from all the experiments with the taxon in the data set used, ti is a specific taxon, and “n taxa” is the total number of taxa available.
To score the discriminative power of each test, all possible pairs of taxa are evaluated to see if they are distinguished by this test (i.e., one is expected to return a positive and the other a negative result). The pseudocode for this algorithm is the following:
for each test t
score[t] ← 0
for each nondiscarded taxon pair <x, y>
if t(x) differs from t(y) then
score[t] ← score[t] + 1
Apart from this simple count, the absolute value of the difference between the prior probabilities that taxon x and taxon y will give positive results in test t is used as a secondary sorting factor, as the software understands that, for example, discrimination between a probability of 0 and a probability of 1 is better than another one between 0.15 and 0.85. The application of industry-standard software-engineering elements such as three-layer architectural design and iterative development ensures the robustness of the software. Bacterial identification reliability is directly linked to the reliability of the matrices which are published in other articles (3, 10, 12).
The basis of this program is the matrix (database), which can be retrieved from the literature or can be created by the users themselves. The matrix contains, at position XY, the probability of retrieving a positive result for test X on taxon Y. The variability for a test is included in the probability of retrieving a positive result. Consequently, databases should be defined in accordance with findings from taxonomical studies of the bacterial group of interest by the user. Identax permits the user to tune its operation by setting the confidence thresholds desired. Once a matrix has been imported into the software system, the next step needed to start the identification is to input the set of results of the biochemical tests, in order to achieve the confidence threshold established. If the identification threshold is not achieved with the results introduced, Identax suggests additional tests to be performed to reach the threshold.
Due to its architecture, Identax works well on most operating systems. The wide range of supported operating systems and minimum hardware requirements are those specified by the Java Virtual Machine version 6 or higher (http://www.java.com). The software has been tested without any problem in Microsoft Windows XP/Vista, Apple Mac OS X 10.5, and Ubuntu Linux 9.04. Apart from the platform-independent executable Java, two system-customized installation packages for Windows XP/Vista and Mac OS X are provided.
Identax allows the import of reference matrices in XLS and CSV formats. Thus, any user can create his or her own data sets or use the ones published in the literature. The import of matrices from previous identification software files (4) is also supported. The generation of work summaries can be customized by a system of templates, and the trees generated can be exported in XML, TreeML, or DOT formats. Although there are many programs that allow a custom representation of the trees generated in the enumerated formats, Graphviz (http://www.graphviz.org/) is recommended for its capacity and ease of use. Identax also includes an interactive tree viewer that allows export to graphical formats (JPG, BMP, and PNG), although it may be difficult to customize for large-tree representations. Finally, as an open-source project, Identax code can be adapted by any user to his or her necessities, and for this reason, complete javadoc documentation is included with the source code.
For a detailed description of the program features, a users' manual is available at the project homepage at http://www.identax.org/. Identax can be downloaded at no cost from this website. Users can also find a set of matrices, based on data from previously published taxonomical studies (3, 10, 12), to identify different bacterial groups.
Identax is open-platform software that could be used by any laboratory for microbial identification and classification purposes. Users could easily adapt Identax to the analysis of the results of routine tests to identify microorganisms. There is no need to modify the traditional methods or commercial kits used in routine analyses. Historical data could be used to develop the most appropriate database to be used as a reference by Identax. Previously existing databases could be imported for their application in Identax. Databases could also be shared by Identax users if permitted by the database creators. Consequently, Identax is a valuable tool to support and facilitate microbial identification for several purposes (medical, ecological, and environmental, etc.), at any microbiology laboratory, by conventional biochemical methods.
Published ahead of print on 14 October 2009.