In biology, experiments can generate data that can be distributed into more than two categories. For example, you could determine the number of different eye colors that exist in a class of students. The categories might be blue, brown, green, and other. When analyzing categorical data with more than two categories, similar statistical methods are used (described above). However, determining how to best display data of multiple categories can be difficult. It is important to display your data in a way that is accessible to your readers and easy to interpret.
2.1 How to present three component categorical data in a bar graph or ternary diagram
Experimental biologists typically use some form of bar graph to show percentage categorical data. For example, in an experiment of 100 samples from a population, 53 show phenotype A, 16 show phenotype B, and 31 show phenotype C. Traditionally, the result could be shown as in .
Presenting categorical data with traditional bar graphs and a ternary diagram
However, when you need to compare multiple experiments across categories, bar graphs can become cluttered and complicated (). Alternatively, a ternary diagram can help to visualize the data in a more direct way. shows the same result as in , with ternary diagram. One can easily compare the results from many populations within the diagram. Ternary diagrams have been used in the current studies of genomics and bioinformatics (Raymond et al., 2003
; Steinke et al., 2006
; White et al., 2007
). Here we introduce a similar diagram to developmental biologists in order to represent and analyze the relative percentage proportional data of three categorical components. It is required that the sum of the three percentages be a constant, in this case, 100%. shows ternary diagram plots and how to interpret them. The diagrams graphically depict the percentages by plotting a point (M) within an equilateral triangle. First we determine how much of M is due to the percentage of samples with phenotype A (). A line is drawn from M parallel to BC, which crosses line CA at point D (line MD). Then we calculate the length ratio of line segments CD/CA. This ratio equals the percentage of samples with phenotype A that point M represents. Using the percentage abundance scale (the group of lines parallel to BC, divide CA into several segments of the same length), we can estimate M represents 53% of the samples are of phenotype A. As we can see, every point along a line parallel to BC shares the same percentage of phenotype A, and the closer a points locates to A, the higher percentage of A it represents. Thus, all the points on line BC represent 0% of A, while point A represents 100% of samples with phenotype A. Similarly, in order to get the percentage of B that M represent, we draw line ME parallel to CA, which crosses CA at point E. Then the ratio of AE/AB is the percentage of B that M represents (). Similarly, points on line AC shows 0% B, while point B means 100% of B. shows the calculation for C% (which is the ratio of BF/BC) and shows all the percentages together. The sum of ratios CD/CA, AE/AB and BF/BC is 100%.
Interpretation of a ternary diagram
Using a ternary diagram, one can easily plot proportional data of three categories and visualize the difference between data sets. For example, we have analyzed two alleles of the zebrafish pkd2
mutant cup; cuptc321
. These two alleles seem to behave differently in how they affect left-right patterning as visualized by the placement of organs. In these mutants, the positioning of the heart, liver and pancreas can be divided into 3 mutually exclusive categories: situs solitus (ss; correct pattern), situs inversus (si; reversed pattern), and heterotaxia (ht; any pattern not ss or si) (Schottenfeld et al., 2007
). In 97 mutant embryos from cuptc321
, the percentages of ss, si and ht are 35.0%, 33.0% and 32.0% respectively. In 98 mutant embryos from cupty30b
, the percentages of ss, si and ht are 37.8%, 45.9.0% and 16.3% respectively (). These results are displayed with a bar graph () or a ternary diagram (). There are two advantages that a ternary diagram has over a bar graph as seen in this example. First, the presentation of results is clearer when a lot of results are displayed together. Imagine we want to compare the left-right patterning phenotypes from 10 different mutant alleles. We would need draw to draw 30 bins in a bar graph, while we only need 10 dots in a ternary diagram. Secondly, we can present the error region of our results, which can only be done accurately with ternary diagrams. Although error bars can be used in a bar graph when presenting data of two categories, they cannot be used accurately to describe the error region associated with data of three categories in a bar graph.
The phenotypes of two cup alleles
Comparison of bar graph and ternary diagram
2.2 How to draw the error bar/region on your categorical data
We previously discussed how to calculate the error bar for proportional data with two categories. However, with three categories plotted in a ternary diagram, we cannot draw an error bar because the display is two dimensional and the bar is one dimensional. Thus, we need a two dimensional error region. Similar to the confidence interval in the previous section, the error region defines where the true result resides at a given CL. If we use statistical methods to draw a region of 95% CL around the point representing our result in the diagram, this method will give us the region containing the true result with 95% probability. Although this region can be accurately calculated, intensive computational power is required especially when the sample is large. Watson and Nguyen have suggested using the chi-square method to approximate this value (Watson and Nguyen, 1985
Here is an example: In Human cytomegalovirus infected cells, three types of enveloped particles can be seen in the cytoplasm: virions, non infectious enveloped particles (NIEPs) and dense bodies (DBs). Among 357 virus particles examined in BADwt
virus–infected cells, 120 were virions, 150 were NIEPs and 87 were DBs, or 33.5%, 42% and 24.5% respectively (Feng et al., 2006
). How can we draw the error region for this result in a ternary diagram at a given CI (for example 95%)? In our plot we denote A as virions, B as NIEPs and C as DBs. First we plot point M in diagram () to represent our result, 33.5% A, 42% B and 24.5% C. Then we look at every point in the diagram to see whether it is in the 95% confidence region of M. For example, point L represents 32% A, 45% B and 23% C in the diagram. If we assume the true result is L, then we can draw the region of which the experimental results would fall into with 95% probability if we count 357 samples from a population with distribution L. This can be done accurately by computing the trinomial distribution, but here we use chi-square test to approximate and improve the efficiency. Our results from this calculation define the region within the green boundary shown in . Since point M resides inside the green line, point L is in the error region of M at 95% CL. Now let’s look at point K, which represents 25% A, 40% B and 35% C. The blue line is the boundary of the 95% confidence region of which the experimental results would fall into with 95% probability if we count 357 samples from a population with distribution K. Since point M is outside the blue line, point K is not in the error region of M at 95% CL. In general, for a given point X, representing xa of A, xb of B and (100%-xa-xb) of C in a population, we use the following formula to approximately determine whether M is inside the error region of X at 95% CL. Again, we are utilizing the chi-square method.
Calculation of error region from a given data set with 95% CL
=150 and OC
=87. We are comparing two sets of proportional data, each with 3 categories. Thus, the degree of freedom is (2-1)*(3-1) = 2. The critical value of chi-square distribution with 2 degrees of freedom at 95% CL is 5.991 (). If χ2
is less than 5.991, then the 95% error region of point X includes M, and thus X is inside the error region of M at 95% CL. In the case of M and L, with the sample size of 357, M (33.5% A, 42% B, 24.5% C) is the observed value, while L (32% A, 45% B, 23% C) is the expected value. So OA
= 120, OB
For the chi-square value of M and K (25% A, 40% B 35% C), we have OA= 120, OB=150, OC=87; EA=89, EB=143, EC=125.
Because χ2 (M,L) < 5.991 while χ2 (M,K) >5.991, L is inside the error region of M while K is not. In order to draw the boundary of the error region of point M, we need to draw the curve with (xa, xb, 100%-xa-xb) defined by χ2 = 5.991. The analytical solution to this equation can be complicated. Therefore, we have designed a web-based program to calculate the error region for a ternary plot. We tested 20,000 points inside the diagram to evaluate the χ2, and then draw the boundary along those points whose χ2 values are less than 5.991 as shown in .
With this website, after drawing the error region, one can use the mouse cursor to track the boundary of the region and read out the corresponding percentage values in the table below the plotting graph.
For proportional data with four categories, the data should be plotted in a three dimensional space. Instead of in a unilateral triangle, we use a pyramid plot and error region will be a 3 dimensional cloud. However, in order to illustrate the result in a two dimensional webpage, we project the data to the four sides of a pyramid, respectively. For each set of four categorical data, we have four ternary diagrams to illustrate the error region. A data point is considered to reside inside the confidence region in a pyramid plot if and only if this point is within the confidence region in all of the 4 triangle projections from the pyramid plot.
2.3 How to determine whether one set of data is significantly different from another
Statistical significance is very useful in comparing two sets of data, to judge whether or not they are different from each other. As mentioned above, the error regions drawn in a ternary diagram can be used to visualize differences between data sets; alternatively, the chi-square test can also be used to judge the significance.
In the example of Human cytomegalovirus infected cells, 3 types of particles can be seen in the cytoplasm, virions, NIEPs and DBs. Among 357 virus particles examined in BADwt
virus–infected cells, 120 were virions, 150 were NIEPs and 87 were DBs. Among 320 virus particles examined in BADin
US24 virus–infected cells, 91 were virions, 154 were NIEPs and 75 were DBs (Feng et al., 2006
). illustrates the two data sets plotted in the ternary diagram with their error region of 95% CL, according to the procedures mentioned above. Since the two data points (red dot and green dot) reside within each other’s error region of 95% CL, we can not reject the hypothesis that the two different viruses affect virion particle formation in a similar way. Thus, the particle phenotypes of these two viruses are similar. On the other hand, if we analyze another virus mutant, and find out among 210 particles, 30 are virions, 40 are NIEPs and 140 are DBs. This data point ( blue dot) does not reside within the wildtype virus error region of 95% CL, and there is no overlap between the two error regions (blue circle and red circle), we can reject the hypothesis that the two viruses affect virion formation in a similar way with 95% CL and state that they have different effects.
Using ternary diagram to determine the difference between data set
However, when the error regions overlap, the conclusion is less clear. In the example of pkd2
mutant embryos we used before, there are two alleles, tc321
. Given the data presented (Schottenfeld et al., 2007
), we ask whether or not these two alleles affect left-right organ patterning in different ways (). Similarly, we plot the result in . However, the pattern is different from . In this case, neither of the points resides within the other’s error region of 95% CL, but the error regions overlap. The location of the data points seems to indicate that these alleles affect organ patterning differently, but since there is some overlap in their error regions, we should return to the chi-square test to be sure.
First we assume they are not different, in other words, they are just two sets of samples from the same population. This is our null hypothesis (H0). Now we calculate the chi-square value of our data and compare it with 5.991 (the critical value of chi-square distribution of 2 degrees of freedom at 95% CI) as mentioned above. If the value is greater than 5.991, this suggests the likelihood that the two sets of experimental data are chosen from the populations with the same distribution is less than 5%. Thus, we would conclude that these two alleles behave differently in affecting left-right patterning with statistical significance. On the other hand, if the chi-square value of the data is less than 5.991, we can not reject H0, that is to say, we can not claim they are different in affecting left-right patterning.
First we calculate the χ2
value. Since we start with the null hypothesis, we need to calculate the percentages of the whole population. As discussed in part 2, the percentages of ss, si and ht in the whole population (our best estimation from the two data sets) would be:
These are our expected values for the whole population under the hypothesis that the two data sets are drawn from the populations with the same left-right patterning distribution. Then we calculate the expected results in each experiment given these percentages. In other words, given the percentages of 36.4% ss, 39.5% si and 24.1% ht, what do we expect to observe in an experiment of 97 samples and what do we expect with 98 samples in another experiment?
Expected data in tc321
Expected data in ty30b
Now we summarize the chi-square values from all 6 cells in the table. Note, this method is mathematically equivalent to the method we used for two categories above. The chi-square value is:
Since the chi-square value is larger than the critical value, it is not likely that two data sets are drawn from the population of the same left-right patterning distribution with 95% CL. Thus we reject the null hypothesis, and claim that the two alleles probably affect left-right patterning in different ways. To further illustrate this point, you can calculate the chi-square value in the virus example. In that case, the degree of freedom is 2, so the critical value of 95% CL is 5.991. The chi-square value between two virus infections is 2.91, and it is less than the critical value. So the chi-square test gives the same result as the ternary diagram, though since the ternary diagram is clear, one does not have to compute the chi-square value for this set.
Using Microsoft Excel, we can calculate the p_value for our data with the function CHIDIST(chi-square value, degrees of freedom). In a table of j rows and k columns, the degree of freedom is (j-1)*(k-1). So has 2 degrees of freedom. We can calculate p = CHIDIST(7.14, 2)= 0.028. Since 0.028 is less than 0.05, we consider this to be statistically significant. By convention, a p_value less than 0.05 is considered statistically significant. Thus, if the two alleles of cup do have the same left-right patterning distribution, the chance that such an event could occur is less than 1 in 20.
The chi-square test is not limited to a 2 by 3 table as mentioned here. However, the critical chi-square value is affected by the CI depending on different degrees of freedom (so it is not always 5.991). One can look for the critical value in . For a quick guide, at 95% CI, the critical value is 3.842 for one degree of freedom, 7.815 for three degrees of freedom and 9.489 for four degrees of freedom. One thing worth mentioning is that the chi-square test also has its limitations as mentioned in part I. If more than 20% of the expected values in the cells of a data table are less than 5, instead of chi-square, Fisher’s exact test should be used (Norman and Streiner, 2000