|Home | About | Journals | Submit | Contact Us | Français|
Misled by animal studies and basic research? Whenever we take a closer look at the outcome of clinical trials in a field such as, most recently, stroke or septic shock, we see how limited the value of our preclinical models was. For all indications, 95% of drugs that enter clinical trials do not make it to the market, despite all promise of the (animal) models used to develop them. Drug development has started already to decrease its reliance on animal models: In Europe, for example, despite increasing R&D expenditure, animal use by pharmaceutical companies dropped by more than 25% from 2005 to 2008. In vitro studies are likewise limited: questionable cell authenticity, over-passaging, mycoplasma infections, and lack of differentiation as well as non-homeostatic and non-physiologic culture conditions endanger the relevance of these models. The standards of statistics and reporting often are poor, further impairing reliability. Alarming studies from industry show miserable reproducibility of landmark studies. This paper discusses factors contributing to the lack of reproducibility and relevance of pre-clinical research. The conclusion: Publish less but of better quality and do not rely on the face value of animal studies.
The prime goal of biomedicine is to understand, treat, and prevent diseases. Drug development represents a key goal of research and the pharmaceutical industry. A devastating attrition rate of more than 90% for substances entering clinical trials has received increasing attention. Obviously, we often are not putting our money on the right horses… Side effects not predicted in time from toxicology and safety pharmacology contribute 20-40% to these failures, indicating limitations of the toolbox, which is considerably larger than what is applied to environmental chemicals, with the exception of pesticides. Here, the question is raised whether quality problems of the disease models and basic (especially academic) research also contribute to this. In a simplistic view, clinical trials are based on the pillars of basic research/pre-clinical drug development, and toxicology (Fig.1).
What does this tell us for areas where we have few or no clinical trials to correct false conclusions? toxicology is a prime example, where regulatory decisions for products traded at $ 10 trillion per year are taken only on the basis of such testing (Bottini and Hartung, 2009, 2010). Are we sorting out the wrong candidate substances? Aspirin likely would fail the preclinical stage today (Hartung, 2009c). Rats and mice predict each other for complex endpoints with only 60% accuracy and, predicted together, only 43% of clinical toxicities of candidate drugs observed later (Olson et al., 2000). New approaches that rely on molecular pathways of human toxicity currently are emerging under the name “toxicology for the 21st Century”.
Doubt as to animal models also is increasing: A number of increasingly systematic reviews summarized here more and more show the limitations. A National Academy of Sciences panel recently analyzed the suitability of animal models to assess the human efficacy of countermeasures to bioterrorism: It could neither identify suitable models nor did it recommend their development; it did, however, call for the establishment of other human-relevant tools. In line with this, about $ 200 million have been made available by NIH, FDA, and DoD agencies over the last year to start developing a human-on-a-chip approach (Hartung and Zurlo, 2012).
Academic research represents a major stimulus for drug development. Obviously, basic research also is carried out in pharmaceutical industry, but quality standards are different and the lesser degree of publication makes them less accessible for analysis. Obviously, academic research comes in many favors, and when pinpointing some critical notions here, each and every one might be unfair and not hold for a given laboratory. Similarly, the author and his generations of students are not free from the alleged (mis)behaviors. It is the far too frequent, retrospective view, imprinted from experiences from quality assurance and validation that will be shared here.
The situation is clear: Companies spend more and more money on drug development, with an average of $ 4 and up to $ 11 billion quoted by Forbes for a successful launch to the market1. The number of substances making it to market launch is dropping, and their success does not necessarily compensate for the increased investment. The blockbuster model of drug industry seems largely busted.
The situation was characterized earlier (Hartung and Zurlo, 2012), and more recent figures do not suggest any turn for the better: Failure rates in the clinical phase of development now reach 95% (Arrowsmith, 2012). Analysis by the Centre for Medicines Research (CMR) of projects from a group of 16 companies (representing approximately 60% of global R&D spending) in the CMR International Global R&D database reveals that the Phase II success rates for new development projects have fallen from 28% (2006-2007) to 18% (2008-2009) (Arrowsmith, 2011a). 51% were due to insufficient efficacy, 29% were due to strategic reasons, and 19% were due to clinical or preclinical safety reasons. The average for the combined success rate at Phase III and submission has fallen to ~50% in recent years (Arrowsmith, 2011b). Taken together, clinical phases II & III now eliminate 95% of drug candidates.
This appeared to correspond to dropping numbers of new drugs, as observed between 1997 and 2006, as we have occasionally referenced (Bottini and Hartung, 2009, 2010), though this has been shown to be possibly largely an artifact (Ward et al., 2013). We also have to consider that attrition does not end with the market launch of drugs: Unexpected side effects lead to withdrawals – Wikipedia, who knows it all, lists 47 drugs withdrawn from the market since 19902, which represents roughly the number of new drug entities entering the market in two years. This does not even include the drugs for which indications had to be limited because of problems. There also are examples of drugs that made it through the trials to the market but, in retrospect, did not work (see for examples the AP press coverage in October 2009 following the US Government Accountability Office report analyzing 144 studies, and showing that the FDA has never pulled a drug off the market due to a lack of required follow-up about its actual benefits3).
At the same time, combining the results of 0.32% fatal adverse drug reactions (ADR) (Lazarou et al., 1998) (total 6.7% ADR) of all hospitalized patients in the US in 1998, with a 2.7-fold increase of fatal ADR from 1998-2005 (Moore et al., 2007), leads to about 1% of hospitalized patients in the US dying from ADR. This suggests that drugs are not very safe, even after all the precautionary tests, and corresponds to the relatively frequent market withdrawals.
The result of this disastrous situation is that pharma companies are eating each other up, often in the hope of acquiring a promising drug pipeline, only to find out that this was wishful thinking or losing so much time in the merger that the delay of development compromises the launch of the pipeline drugs.
A popular criticism of clinical drug development (as, e.g., prominently stressed in Ben Goldacre's recent book “Bad Pharma”, 2012) is the bias from the pressure to get drugs to the market. In fact, there is also a publication bias, i.e., the more successful a clinical study, the more likely it will be published. It has been shown that studies sponsored by industry are seven times more likely to have positive outcomes than those that are investigator-driven (Bekelman et al., 2003; Lexchin, 2003). However, this does not take into account how much more development efforts go into industrial preclinical drug development compared to what academic researchers have at their disposal.
Actually, clinical studies have extremely high quality standards: they are mostly randomized, double-blind, and placebo-controlled, as well as usually multi-centric. They require ethical review, follow Good Clinical Practice, and are carried out by skilled professionals. In recent years, the urge to publish and register has increased strongly. Clinical medicine also brought about evidence-based Medicine (EBM), which we have several times praised as an objective, transparent, and conscientious way to condense information for a given controversial question (Hoffmann and Hartung, 2006; Hartung, 2009a, 2010). All together, these attributes are difficult to match in other fields.
So we might say that clinical research is pretty good even in acknowledging its biases, if at all, of overestimating success. In a simple view, the clinical pipeline, despite enormous financial pressures, has very sophisticated tools to promote good science. If this is true, we put our money on the wrong horses in clinical research to begin with. We have to analyze the weaknesses of the preclinical phase to understand why we are not improving attrition rates.
Sure, to some extent. It is one purpose of this series of articles to collect arguments for transitioning to new tools. The quoted data from Arrowsmith would suggest that toxic side-effects contribute to 20% of attrition each in phase II and III. Probably, we need to add some percent for side-effects noted in phase I, i.e., first in humans, and post-market adverse reactions. Thus an overall figure of 30-40% seems realistic.
However, we first have to distinguish two matters: One is the observed effects in humans, which were not sufficiently anticipated. Another is the findings in animal toxicity studies done in parallel to the clinical studies. It is a common misunderstanding among lay audiences that clinical studies commence after toxicology has been completed. For reasons of timing, however, this is not possible, and the long-lasting studies are done at least in parallel to phase II. Currently, when first acquiring data on humans, animal toxicology is incomplete. The two types of toxicological data also are very different: the toxicological effects observed in human trials of necessarily short duration and little or no follow-up observation are necessarily different from the chronic systemic animal studies at higher doses. Fortunately, typical side-effects in clinical trials are mild, the most common one (about half of the cases) is drug-induced liver injury (DILI), observed as a painless and normally easily reversible increase in liver enzymes in blood work though possibly extending to the more severe and life-threatening liver failure. The Innovative Medicine Initiative has tackled this problem in a project based on an initiative we started with industry at ECVAM: “Many medicines are harmful to the liver, and drug-induced liver injury (DILI) now ranks as the leading cause of liver failure and transplantation in western countries. However, predicting which drugs will prove toxic to the liver is extremely difficult, and often problems are not detected until a drug is already on the market.”4 the hallmark paper by Olson et al. (2000) gives us some idea of this and the retrospective value of animal models in identifying such problems: “Liver toxicity was only the fourth most frequent HT [human toxicity]…, yet it led to the second highest termination rate. There was also less concordance between animal and human toxicity with regard to liver function, despite liver toxicity being common in such studies. There was no relation between liver HTs and therapeutic class.”
A completely different question is: What animal findings obtained parallel to clinical trials lead to abandoning substances? Probably not that many. Cancer studies are notoriously false positive (Basketter et al., 2012), even for almost half of the tested drugs on the market; furthermore, genotoxicants usually have been excluded earlier. Reproductive toxicity will lead mainly to a warning against using the substance in pregnancy, which is a default for any new drug, as nobody dares to test on pregnant women. The acute and topical toxicities have been evaluated before being applied to humans. The same holds true for safety pharmacology, i.e., the assessment of cardiovascular, respiratory, and neurobehavioral effects, as well as excess target pharmacology. This leaves us with organ toxicities in chronic studies. In fact, if not sorted out by “investigative” toxicology, this can impede or delay drug development. “Fortunately,” different animal species often do not agree as to the organ of toxicity manifestation, leaving open a lot of room for discussion as to translation to humans.
Compared to clinical studies, toxicology has some advantages and some disadvantages as to quality: First, there are internationally harmonized protocols (especially ICH and OECD) and Good laboratory Practice to quality-assure their execution. However, we use outdated methods, mainly introduced before 1970, which were systematically rendered precautionary/oversensitive, e.g., by using extremely high doses. The mechanistic thinking of a modern toxicology comes as “mustard after the meal,” mainly to argue why the findings are not relevant to humans. What is most evident when comparing approaches: clinical studies have one endpoint, good statistics, and hundreds to thousands of treated individuals with relevant exposures. Toxicology does just the opposite: Group sizes of identical twins (inbred strains) are minimal, and we study a large array of endpoints at often “maximum tolerated doses” without proper statistics. The only reason is feasibility, but these compromises combine in the end to determine the relevance of the prediction made. We have made these points in more detail earlier (Hartung, 2008a, 2009b). For a somewhat different presentation, please see Table 1 which combines arguments from different sources (Pound et al., 2004; Olson et al., 2000; Hartung, 2008a) showing reasons for differences between animal studies and human trials.
Perhaps the even more important question with regard to attrition is, which substances never make it to clinical trials, that would have succeeded but whose progress was hindered by wrong or precautionary toxicology? Again we have to ask, what findings lead to the abandonment of a substance. This is more complicated than it seems, because it depends on when in the development process such findings are obtained and what the indication of the drug is. To put it simply, a new chemotherapy will not be affected very much by any toxicological finding. In early screening, we tend to be generous in excluding substances that appear to have liabilities. An interesting case here is genotoxicity – due to the fear of contributing to cancer and the difficulty of identifying human carcinogens at all, this often is a brick wall. In addition, the relatively easy and cheap assessment of genotoxicity with a few in vitro tests allows front-loading of such tests. Typically, substances will be sorted out if found positive. The 2005 publication of Kirkland et al. gave the stunning result that while the combination of three genotoxicity tests achieves a reasonable sensitivity of 90+% for rat carcinogens, also more than 90% of non-carcinogens are false positive, i.e., a miserable specificity. Among the false positives are common table salt and sugar (Pottenger et al., 2007). With such a high false positive rate, we would eliminate an incredibly large part of the chemical universe at this stage.
This view has been largely adapted, leading to an ECVAM workshop (Kirkland et al., 2007) and follow-up work (lorge et al., 2008; Fellows et al., 2008; Pfuhler et al., 2009, 2010; Kirkland, 2010a,b; Fowler et al., 2012a,b) financed by Cosmetics Europe and ECVAM, and finally changes in the International Conference on Harmonization (ICH) guidance, though not yet at the OECD, which did not go along with the suggested 10-fold reduction in test dose for the mammalian assays.
However, the “false positive” genotoxicity issue (Mouse lymphoma assay and Chromosomal Aberration assay) has been challenged more recently. Gollapudi et al. from Dow presented an analysis of the Mouse lymphoma Assay at SOT 2012. “Since the MLA has undergone significant procedural enhancements in recent years, a project was undertaken to reevaluate the NTP data according to the current standards (IWGT) to assess the assay performance capabilities. Data from more than 1900 experiments representing 342 chemicals were examined against acceptance criteria for background mutant frequency, cloning efficiency, positive control values, and appropriate dose selection. In this reanalysis, only 17% of the experiments and 40% of the “positive” calls met the current acceptance standards. Approximately 20% of the test chemicals required >1000 ug /mL to satisfy the criteria for the selection of the top concentration. When the concentration is expressed in molarity, approximately 58, 32, and 10% of the chemicals required ≤1 mM, >1 to ≤10 mM, and >10 mM, respectively, to meet the criteria for the top concentration. More than 60% of the chemicals were judged as having insufficient data to classify them as positive, negative, or equivocal. Of the 265 chemicals from this list evaluated by Kirk-land et al. (2005, Mutat Res., 584, 1), there was agreement between Kirkland calls and our calls for 32% of the chemicals.”
Astra-Zeneca (Fellows et al., 2011) published their most recent assessment of 355 drugs and found 5% unexplained positives in the Mouse lymphoma Assay: “Of the 355 compounds tested, only 52 (15%) gave positive results so, even if it is assumed that all of these are non-carcinogens, the incidence of ‘false positive’ predictions of carcinogenicity is much lower than the 61% apparent from analysis of the literature. Furthermore, only 19 compounds (5%) were positive by a mechanism that could not be associated with the compounds primary pharmacological activity or positive responses in other genotoxicity assays.”
Snyder and Green (2001) earlier found less dramatic false positive rates for marketed drugs. FDA CDER did a survey on the most recent ~750 drugs and found that positive mammalian genotoxicity results (CA or MLA) did not affect drug approval substantially (Dr Rosalie Elesprue, personal communication). Only 1% was put on hold for this cause. However, this obviously addresses a much later stage of drug development, at which most genotoxic substances already have been excluded.
In contrast, an analysis by Dr Peter Kasper of nearly 600 pharmaceuticals submitted to the German medicines authority (BfArM) between 1995 and 2005, gave 25-36% positive results in one or more mammalian cell tests, and yet few were carcinogenic (Blakey et al., 2008). It is worth noting that an evaluation by the Scientific Committee on Consumer Products (SCCP) of genotoxicity/mutagenicity testing of cosmetic ingredients without animal experiments5 showed that 24 hair dyes tested positive in vitro were all then found negative in vivo. This would be very much in line with the Kirkland et al. analysis. However, we argued earlier (Hartung, 2008b): “The question might, however, be raised whether mutagenicity in human cells should be ruled out at all by an animal test. A genotoxic effect in vitro shows that the substance has a property, which could be hazardous. Differences in the in vivo test can be either species-specific (rat versus human) or due to kinetics (does not reach the tissue at sufficiently high concentrations). These do not necessarily rule out a hazard toward humans, especially in chronic situations or hypersensitive individuals. This means that the animal experiment may possibly hide a hazard for humans.”
In conclusion, flaws in the current genotoxicity test battery are obvious. There is promise of new methods, most obviously of the micronucleus test, which was formally validated and led to an OECD test guideline. There is some validation for the COMET assay (Ersson et al., 2013), which compared 27 samples in 14 laboratories using their own protocols; the variance observed was mainly between laboratories/protocols, i.e., 79%. Thus standardization of the COMET assay is essential, and we are desperately awaiting the results of the Japanese validation study for the COMET assay in vivo and in vitro. New assays based, e.g., on DNA repair measurement promise better accuracy (e.g., Walmsley, 2008; Moreno-Villanueva et al., 2009, 2011). Whether the current data justify eliminating the standard in vitro tests and adopting the in vivo comet assay as specified in the new ICH S2 guidance before validation can be debated. This guidance in fact decreases in vitro testing and increases in vivo testing (in its option 2 as it replaces in vitro mammalian tests entirely with two in vivo tests). It is claimed that they can be done within ongoing sub-chronic testing, but this still needs to be shown because the animal genotoxicity tests require a short term (2-3 day) high dose, while the sub-chronic testing necessitates lower doses.
What to do? We need an objective assessment of the evidence concerning the reality of “false positives.” this could be a very promising topic for an evidence-based toxicology collaboration (EBTC6) working group. Better still, we should try to find a better way to assess human cancer risk without animal testing. The animal tests are not sufficiently informative.
What does this mean in the context of the discussion here? It shows that even the most advanced use of in vitro assays to guide drug development is not really satisfactory. Though the extent of false positives, i.e., innocent substances not likely to be developed further to become drugs, is under debate, it appears that no definitive tool for such decisions is available. The respective animal experiment does not offer a solution to the problem, as it appears to lack sensitivity. Thus, the question remains whether genotoxicity as currently applied guides our drug development well enough.
A large part of biomedical research relies on animals. John Ioannidis recently showed that almost a quarter of the articles in PubMed show up with the search term “animal,” even a little more than with “patient” (Ioannidis, 2012). While there is increasing acknowledgement that animal tests have severe limitations for toxicity assessments, we do not see the same level of awareness for disease models. The hype about genetically modified animal models has fueled this naïve appreciation of the value of animal models.
The author had the privilege to serve on the National Academy of Science panel on animal models for countermeasures to bio-terrorism. We have discussed this recently (Hartung and Zurlo, 2012): the problem for developing and stockpiling drugs for the event of biological/chemical terrorism or warfare is that (fortunately) there are no patients to test on. So, the question to the panel was how to substitute in line with the animal rule of FDA with suitable animal models. In a nutshell, our answer is: there are no such things as sufficiently predictive animal models to substitute for clinical trials (NRC, 2011). Any drug company would long to have such models for drug development, as the bulk of development costs is incurred in the clinical phase; for counter-measures we have the even more difficult situation of unknown pathophysiology, limitations to experiment in biosafety facilities, disease agents potentially designed to resist interventions, and mostly peracute diseases to start with. So an important part of the committee's discussions dealt with the attrition (failure) rate of drugs entering clinical trials (see above), which does not encourage using animal models to substitute for clinical trials at all.
In line with this, a recent paper by Seok et al. (2013) showed the lack of correspondence of mouse and human responses in sepsis, probably the clinical condition closest to biological warfare and terrorism. We discussed this earlier (Leist and Hartung, 2013) and here only one point shall be repeated, i.e., though not necessarily as prominent and extensive, several assessments of animal models led to disappointing results, as referenced in the comment for stroke research.
In toxicology, we have seen that different laboratory species exposed to the same high doses predict each other no better than 60% – and there is no reason to assume that any of them predict humans better at low doses. We lack such analysis for drug efficacy models systematically comparing outcomes in different strains or species of laboratory animals. It is unlikely that results are much better.
In this series (Hartung, 2008a) we have addressed the shortcomings of animal tests in general terms. Since then, the weaknesses in quality and reporting of animal studies, especially, have been demonstrated (MacCallum, 2010; Macleod and van der Worp, 2010; Kilkenny et al., 2010; van der Worp and Macleod, 2011), further undermining their value. Randomization and blinding rarely are reported, which can have important implications, as it has been shown that animal experiments carried out without either are five times more likely to report a positive treatment effect (Bebarta et al., 2003). Baker et al. (2012) recently gave an illustration of poor reporting on animal experiments, stating that in “180 papers on multiple sclerosis listed on PubMed in the past 6 months, we found that only 40% used appropriate statistics to compare the effects of gene-knockout or treatment. Appropriate statistics were applied in only 4% of neuroimmunological studies published in the past two years in Nature Publishing Group journals, Science and Cell” (Baker et al., 2012).
Some more systematic reviews of the predictive value of animal models have been little favorable, see Table 2 (Roberts, 2002; Pound et al., 2004; Hackam and Redelmeier, 2006; Perel et al., 2007; Hackam, 2007; van der Worp et al., 2010). Hackman and Redelmeier (Hackam and Redelmeier, 2006), for example, found that of 76 highly cited animal studies, 28 (37%; 95% confidence interval [CI], 26%-48%) were replicated in human randomized trials, 14 (18%) were contradicted by randomized trials, and 34 (45%) remain untested. This is actually not too bad, but the bias to highly cited studies (range 639 to 2233) already indicates that these studies survived later repetitions and translation to humans. There are now even more or less “systematic” reviews of the systematic reviews (Pound et al., 2004; Mignini and Khan, 2006; Knight, 2007; Briel et al., 2013), showing that there is room for improvement. They definitely do not have the standard of evidence-based medicine. In the context of evidence-based medicine, “A systematic review involves the application of scientific strategies, in ways that limit bias, to the assembly, critical appraisal, and synthesis of all relevant studies that address a specific clinical question” (Cook et al., 1997). But the concept is maturing. See, for example, the NC3R whitepaper “Systematic reviews of animal research”7 or the “Montréal Declaration on Systematic Reviews of Animal Studies.”8 the ARRIVE guideline (Kilkenny et al., 2010) and the Gold Standard Publication Checklist (GSPC) to improve the quality of animal studies (Hooijmans et al., 2010) facilitate the evaluation and standardization of publications on animal studies.
No wonder that in vitro studies are increasingly considered: “According to a new market report by transparency Market Research, the global in vitro toxicity testing market was worth $1,518.7 million in 2011 and is expected to reach $4,114.1 million in 2018, growing at a CAGR of 15.3 percent from 2013 to 2018.”9 Compare this to our estimate of $ 3 billion for in vivo toxicology (Bottini and Hartung, 2009). The quality problem, however, is no less for in vitro: Our attempts to establish Good Cell Culture Practice (GCCP; Coecke et al., 2005) and publication guidance for in vitro studies (Leist et al., 2010) desperately await broader implementation (see below).
Two recent publications by authors from two major pharmaceutical companies provided an epiphany: Both Amgen and Bayer HealthCare showed that they essentially could not reproduce the key findings of many studies that had prompted drug development. Prinz et al. (2011) from Bayer HealthCare stated in Nature Reviews in Drug Discovery “Believe it or not: how much can we rely on published data on potential drug targets? …data from 67 projects, … This analysis revealed that only in ~20-25% of the projects were the relevant published data completely in line with our in-house findings… In almost two-thirds of the projects, there were inconsistencies between published data and in-house data that either considerably prolonged the duration of the target validation process or, in most cases, resulted in termination of the projects.”
Similarly, Begley and Ellis (2012) from Amgen in Nature “Raise standards for preclinical cancer research … Fifty-three papers were deemed ‘landmark’ studies …scientific findings were confirmed in only 6 (11%) cases. Even knowing the limitations of preclinical research, this was a shocking result.”
How is this possible? Basic researchers seem to be even more naïve in the interpretation of their results than clinical researchers. In a comparison of 108 studies (lumbreras et al., 2009), laboratory scientists were 19-fold more likely to over-interpret the clinical utility of molecular diagnostic tests than clinical ones. Basic research, at least in academia, the source of most of such papers, is done mostly unblinded in a single laboratory. It is executed by students learning on the job, normally without any formal quality assurance scheme. limited replicates due to limited resources and time as well as pressure to publish lead to publications, which do not always stand replication. Insufficient documentation aggravates the situation.
Figure 2 shows a cartoon of some of the problems. Having supervised some 50 PhD and a similar number of master and bachelor students, the author is not innocent of any of these misdoings.
The problem starts with setting the topic; this is rarely as precise as in drug development: Often it simply continues work of a previous student, who left uncompleted work behind after finishing a degree. In other cases it starts as pure exploration with the idea to go into a new direction. How often have we had to change topics or circumstances led us to take up new directions? Still, there is a desire to make use of the work done so far. It is always appealing to combine, reshuffle, etc. in order to make best use of the pieces. The quality of the pieces? let's be honest: “A typical result out of three” usually means “the best I have achieved.” especially critical is outlier removal: even if following a certain formal process, this is hardly ever properly documented. If things are not significant, we add more experiments, happily ignoring that this messes up the significance testing. Replications are a problem in themselves. How often are these just technical replicates, i.e., parallel experiments and not real reproductions on another day? If the reviewer is not very picky this will fy far too often. Who then combines the different independent experiments with an appropriate error propagation taking into account the variance of each reproduction? even among seasoned researchers, I have met few who know how to do this.
Using spreadsheets and other interactive data manipulation and analysis tools we do not provide a usable audit trail of how results were obtained and how many attempts were made until significant results were obtained (Harrell, 2011). Poor statistics are a more widespread problem than outsiders might believe. They are a core part of the “Follies and Fallacies in Medicine” (Skrabanek and McCormick, 1990). Des McHale coined it: “The average human has one breast and one testicle.” Awareness is a little better in clinical research (Andersen, 1990; Altman, 1994, 2002), but as reviewers or readers we too often see papers without statistics or with inappropriate statistics (such as the promiscuous use of t-tests where not justified). Some common mistakes were illustrated in (Festing, 2003; Lang, 2004; Altman, 1998) (see also Tab. 3).
Douglas Altman (Altman, 1998) summarized in 1998 thirteen previous analyses of the quality of statistics in medical journals (Tab. 4). The 1667 papers analyzed show that only about 37% have acceptable statistics. No trend to the better is visible.
An example from environmental chemistry is the most commonly used method to deal with values below detection limits, which is to substitute a fraction of the detection limit for each non-detect (Helsel, 2006): “Two decades of research has shown that this fabrication of values produces poor estimates of statistics, and commonly obscures patterns and trends in the data. Papers using substitution may conclude that significant differences, correlations, and regression relationships do not exist, when in fact they do. The reverse may also be true.”
When asking why many scientific papers are wrong, even if statistics are correctly applied, we also have to consider that a study usually does not depend on a single experiment. We report on a number of experiments that, when taken together, make the case. Even if we achieve a significance level of 95% in each given experiment, when combined, the probability of an error increases steadily (Fig. 3).
The purpose of this article is not a review of statistics and statistical practice. It serves more as an illustration of yet another contributor to non-reproducibility of results. We might leave it with Andrew Lang: “He uses statistics as a drunken man uses lamp-posts – for support rather than illumination.”
The problem lies not only in the data generated, their statistical analysis, and the way we form an overall story from them: publication practices have their share in impeding objective science. In an interesting article, “Why current publication practices may distort science,” Young et al. (2008) use an economic view on scientific publication behaviors: “the small proportion of results chosen for publication are unrepresentative of scientists' repeated samplings of the real world. The self-correcting mechanism in science is retarded by the extreme imbalance between the abundance of supply (the output of basic science laboratories and clinical investigations) and the increasingly limited venues for publication (journals with sufficiently high impact). This system would be expected intrinsically to lead to the misallocation of resources. The scarcity of available outlets is artificial, based on the costs of printing in an electronic age and a belief that selectivity is equivalent to quality. Science is subject to great uncertainty: we cannot be confident now which efforts will ultimately yield worthwhile achievements. However, the current system abdicates to a small number of intermediates an authoritative prescience to anticipate a highly unpredictable future. In considering society's expectations and our own goals as scientists, we believe that there is a moral imperative to reconsider how scientific data are judged and disseminated.” The authors make a number of recommendations regarding how to improve the system:
Please note that the authors' involvement with ALTEX, most recently with Peer Journal (https://peerj.com), and especially with the Evidence-based Toxicology Collaboration (http://www.ebtox.com) promotes some of these goals. While the former two foster digital open-access publication with new financial models reducing the costs to readers and authors, the series of Food for thought … articles, commissioned t4 white papers, and the systematic reviews under development in the EBTC aim to be exactly the “critical reviews, digests, and summaries of the large amounts of biomedical data now generated.” The variety of initiatives for “quality of study methods” will add to this.
Earlier in this series of articles, the shortcomings of typical cell culture were discussed (Hartung, 2007). This article summed up experiences gained from the validation of in vitro systems and in the course of developing the Good Cell Culture Practice guidance (Coecke et al., 2005). Six years later the arguments are largely the same: We do not manage to obtain in-vivo-like differentiation because we often use tumor cells (tens of thousands of mutations, loss and duplications of chromosomes), over-passage with selection of subpopulations, use non-physiologic culture conditions (hardly any cell contact, low cell density, no polarization, limited oxygen supply, non-homeostatic media exchange, temperature and electrolyte concentrations reflective of humans not rodents), force growth (fetal calf serum, growth factors), do not demand cell functions due to over-pampering, do not follow the in vitro kinetics giving consideration to the fate of test substances in the culture, and do not represent cell type interactions. For most aspects there are technical solutions, but few are applied, and if so, they are applied in isolation, solving some but not all of the problems. Beside this, there is a lack of quality control. If we take the estimates below, probably only 60% of studies use the intended cells without mycoplasma infection. Documentation practices in laboratories and publications are often lousy. There is some guidance (GLP increasingly adapted, GCCP see below) but it is rarely applied. The more recent mushrooming of cell culture protocol collections is an important step, but it is still not common to stick to them or at least to be clear in publications about deviations from them: We tend to toy around with the models until they work for us, and too often only for us.
There is some movement with regard to cell line authentication (see below). The earlier article summarizing the history and core ideas of GCCP (Hartung and Zurlo, 2012) did not address mycoplasma infection, a problem far from being solved. There are also some new aspects coming from the booming field of stem cells.
Hello, HeLa… – the cell you see more often than you would believe. Since 1967, cell line contaminations have been evident, i.e., another cell type was accidentally introduced into a culture and slowly took over. The most promiscuous so far are HeLa cells, actually the first human tumor cell line. The line was derived from cervical cancer cells taken on February 8, 1951, from Henrietta lacks, a patient at Johns Hopkins. The cells have contributed to more than 60,000 research papers and the development of a polio vaccine in the 1950s (more on the interesting history in (Skloot, 2010)). Recently the Hela genome has been sequenced (Landry et al., 2013) (please note some controversy around the paper which is currently being sorted out). It is most interesting to see the genetic make-up of the cells as summarized by Ewen Callawa in Nature10: “HeLa cells contain one extra version of most chromosomes, with up to five copies of some. Many genes were duplicated even more extensively, with four, five or six copies sometimes present, instead of the usual two. Furthermore, large segments of chromosome 11 and several other chromosomes were reshuffled like a deck of cards, drastically altering the arrangement of the genes.” Do we really expect such a cell monster to show normal physiology? the cell line was found to be remarkably durable and prolific, as illustrated by its contamination of many other cell lines. It is assumed that, today, 10-20% of cell lines are actually HeLa cells and, in total, 18-36% of all cell lines are wrongly identified. Table 5 shows studies analyzing the problem over time extracted from (Hughes et al., 2007).
A very useful list of such mistaken cell lines is available.11 the problem has been raised several times (Macleod et al., 1999; Stacey, 2000; Buehring et al., 2004; Rojas et al., 2008; Dirks et al., 2010). A study (Buehring et al., 2004) from 2004 showed that HeLa contaminants were used unknowingly by 9% of survey respondents, likely underestimating the problem; only about a third of respondents were testing their lines for cell identity. More recently, a technical solution for cell line identification has been introduced by the leading cell banks (ATCC, CellBank Australia, sDSMZ, ECACC, JCRB, and RIKEN), i.e., short tandem repeat (STR) microsatellite sequences. STR are highly polymorphic in human populations, and their stability makes STR profiling (typing) ideal as a reference technique for identity control of human cell lines. We have to see how the scientific community takes this up. Isn't it a scandal that a large percentage of in vitro research is done on cells other than the supposed ones and misinterpreted this way?
Another type of contamination that is astonishingly frequent and has a serious impact on in vitro results is microbial infection, especially with mycoplasma (Langdon, 2003): Screening by the FDA for more than three decades showed that, of 20,000 cell cultures examined, more than 3000 (15%) were contaminated with mycoplasma (Rottem and Barile, 1993). Studies in Japan and Argentina reported mycoplasma contamination rates of 80% and 65%, respectively (Rottem and Barile, 1993). An analysis by the German Collection of Microorganisms and Cell Cultures (DSMZ) of 440 leukemia-lymphoma cell lines showed that 28% were mycoplasma positive (Drexler and Uphoff, 2002).
laboratory personnel are the main sources of M. orale, M. fermentans, and M. hominis. These species of mycoplasma account for more than half of all mycoplasma infections in cell cultures and physiologically are found in the human oropharyngeal tract (Nikfarjam and Farzaneh, 2012). M. arginini and A. laidlawii are two other mycoplasmas contaminating cell cultures that originate from fetal bovine serum (FBS) or newborn bovine serum (NBS). Trypsin solutions provided by swine are a major source of M. hyorhinis. It is important to understand that the complete lack of a bacterial cell wall of mycoplasma implies resistance against penicillin (Bruchmüller et al., 2006), and they even pass 0.2 μm sterility filters, especially at higher pressure rates (Hay et al., 1989). Mycoplasma can have diverse negative effects on cell cultures (Tab. 6), and it is extremely difficult to eradicate this intracellular infection.
While there is good understanding in the respective fields of biotechnology, this is much less the case in basic research and mycoplasma testing is neither internationally harmonized with validated methods nor common practice in all laboratories on a regular basis. The recent production of reference materials (Dabrazhynetskaya et al., 2011) offers hope for the respective validation attempts. The problem lies in the fact that at least 20 different species are found in cell culture, though 5 of them appear to be responsible for 95% of the cases (Bruchmüller et al., 2006). For a comparison of the different mycoplasma detection platforms see (Lawrence et al., 2010; Young et al., 2010), and Table 7.
The advent of human embryonic and, soon after, induced pluripotent stem cells, appears to be something of a game changer. First it promises to overcome the problems of availability of human primary cells, though a variety of commercial providers nowadays make almost all relevant human cells available in reasonable quality but at costs that are challenging, at least for academia. We have to see, however, that we do not yet really have protocols to achieve full differentiation of any cell type from stem cells. This is probably a matter of time, but many of the non-physiologic conditions taken from traditional cell culture contribute here. Stem cells have been praised for their genetic stability, which appears to be better than for other cell lines, but we increasingly learn of their limitations in that respect too (Mitalipova et al., 2005; Lund et al., 2012; Steinemann et al., 2013). The limitations experienced first are costs of culture and slow growth; many protocols require months and labor, media, and supplement costs add up. The risk of infection unavoidably increases. Still we do not obtain pure cultures, often requiring a cell sorting, which, however, implies detachment of cells with the respective disruption of culture conditions and physiology.
Owing to the author's own experience with non-reproducible in vitro papers during his own PhD, in 1996 the author started an initiative toward Good Cell Culture Practice (GCCP), that led in 1999 to a workshop and declaration in the general assembly of the Third World Congress on Alternatives and Animal Use in the life Sciences in Bologna, Italy. We then established an ECVAM working group and finally produced GCCP guidance (Coecke et al., 2005). The details of this process recently were summarized in this series of articles (Hartung and Zurlo, 2012). Here, only a single epiphany shall be added: in the PhD thesis of my student Alessia Bogni we obtained commercial CHO cell lines declared to be only transfected with single CYP-450 enzymes. The karyograms in Figure 4 show the dramatic effects with losses and fusions of chromosomes, some of which, in the lower right corner, could not even be identified. We would have interpreted any differences in experimental results only by the presence or absence of a single gene product…
GCCP acknowledges the inherent variation of in vitro test systems calling for standardization. GLP gives only limited guidance for in vitro (Cooper-Hannan et al., 1999) though some parts of GCCP have been adapted into a GLP advisory document by OECD for in vitro studies (OECD, 2004). The topic of quality of the publication of in vitro studies in journal articles also has been addressed in our Food for thought … series earlier (Leist et al., 2010). GLP cannot normally be implemented in academia on the grounds of costs and lack of flexibility. For example, GLP requests that personnel be trained before they execute studies, while obviously students are “trained on the job.” We hope that GCCP also will be guidance for journals and funding bodies, thereby enforcing the use of these quality measures.
GCCP guidance was developed before the broad use of human stem cells. We attempted an update in a workshop, which, strangely, never has been published but was made available as a manuscript on the ECVAM website12: “hESC Technology for Toxicology and Drug Development: Summary of Current Status and Recommendations for Best Practice and Standardization. The Report and Recommendations of an ECVAM Workshop. Adler et al. Unpublished report.” We currently are aiming for an update workshop in early 2014 teaming up with FDA and the UK Stem Cell Bank.
Science is increasingly becoming aware of the shortcomings of its approaches. John Ioannidis has stirred us up with papers like those entitled “Why Most Published Research Findings Are False” (Ioannidis, 2005b) (“for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias”) or “Contradicted and Initially Stronger Effects in Highly Cited Clinical Research” (Ioannidis, 2005a). As early as 1994 Altman wrote on “The scandal of poor medical research” (Altman, 1994). This does not even address the contribution of fraud (Fang et al., 2012). These early warnings now have been substantiated with the unsuccessful attempts by industry to reproduce important basic research. Drummond Remmie phrased it like this: “Despite this system, anyone who reads journals widely and critically is forced to realize that there are scarcely any bars to eventual publication. There seems to be no study too fragmented, no hypothesis too trivial, no literature too biased or too egotistical, no design too warped, no methodology too bungled, no presentation of results too inaccurate, too obscure, too contradictory, no analysis too self serving, no argument too trifling or too unjustified, and no grammar and syntax too offensive for a paper to end up in print. The function of peer review, then, may be to help decide not whether but where papers are published.”
The situation is not very different whether this is in vitro or in vivo work, which now often is combined anyway. Similar things can be said about in silico work (Hartung and Hoffmann, 2009), which is not only limited by the in vitro and in vivo data it is based on (trash in, trash out), but inherent problems of lack of data accuration and overfitting.13 “Torture numbers, and they'll confess to anything” (Gregg easterbrook). One difference is that in vitro approaches have developed the principles of validation. There is no field more self-critical than the area of alternative methods, where we spend half to one million $ and, on average, ten years to validate a method. Basic research could learn from this, not to go to the same extreme, which is becoming increasingly as much a burden as it is a solution to the problem, but to put sufficient effort into establishing the reproducibility and relevance of our methods. We are not calling for GLP for academia, but for the spirit of GLP to be embraced.
While this series of articles focuses mostly on toxicology, here we have attempted to extend some critical observations to research in general. This shall first of all show that toxicology is not different in its problems, and is perhaps even advanced with regard to internationally harmonized methods and quality assurance. It is perhaps too easy to just criticize. Henri Poincaré said “To know how to criticize is good, to know how to create is better.” A simple piece of advice: the changes that clinical research has undergone should be adopted by basic research and regulatory sciences, especially weighing of evidence, documentation, and quality assurance. Publish less, but of better quality, or as Altman (1994) put it: “We need less research, better research, and research done for the right reasons.”
Discussions with friends and colleagues shaped many of the arguments made here, especiallly Dr Marcel Leist, Dr Rosalie Elesprue, the GCCP taskforce, and the collaborators in the NIH transformative research grant “Mapping the Human toxome by Systems toxicology” (RO1eS020750) and FDA grant “DNTox-21c Identification of pathways of developmental neurotoxicity for high throughput testing by metabolomics” (U01FD004230) as well as NIH “A 3D model of human brain development for studying gene/environment interactions” (U18tR000547).
2http://en.wikipedia.org/wiki/List_of_withdrawn_drugs (Accessed June 21, 2013)
9Global in-vitro toxicity testing market to take off as push towards alternatives grows By Michelle Yeomans, 05-Nov-2012. http://bit.ly/12OlQs8.ly/12OlQs8
12Available at: http://ihcp.jrc.ec.europa.eu/our_labs/eurl-ecvam/archive-publications/workshop-reports (last accessed 9 June 2013)
13A wonderful illustration by David J. Leinweber, Caltech: Stupid data miner tricks: overfitting the S&P 500, showing that the stock market behavior over 12 years could be almost perfectly explained by three variables, i.e., butter production in Bangladesh, United States cheese production, and sheep population in Bangladesh and United States; available at: http://nerdsonwallstreet.typepad.com/my_weblog/fles/dataminejune_2000.pdf