|Home | About | Journals | Submit | Contact Us | Français|
Over the past decade we have seen a growth in the provision of chemistry data and cheminformatics tools as either free websites or software as a service (SaaS) commercial offerings. These have transformed how we find molecule-related data and use such tools in our research. There have also been efforts to improve collaboration between researchers either openly or through secure transactions using commercial tools. A major challenge in the future will be how such databases and software approaches handle larger amounts of data as it accumulates from high throughput screening and enables the user to draw insights, enable predictions and move projects forward. We now discuss how information from some drug discovery datasets can be made more accessible and how privacy of data should not overwhelm the desire to share it at an appropriate time with collaborators. We also discuss additional software tools that could be made available and provide our thoughts on the future of predictive drug discovery in this age of big data. We use some examples from our own research on neglected diseases, collaborations, mobile apps and algorithm development to illustrate these ideas.
A lot can happen in a decade. We have gone from having few if any free resources such as databases of small molecules or software for drug discovery on the web, to literally thousands . For example databases like ChemSpider [2, 3] have grown to house not just tens of millions of molecules but have become repositories for reactions and a vast treasure trove of chemistry data. At the other extreme, commercial offerings running software as a service (SaaS) ventures is nothing new, and being on a vendor-hosted cloud or an internal data center is not new either, even though companies continue to define their products by referring to it. What the pioneers of this approach do next will redefine the types of products we see created for drug discovery for years to come. As an example from our own experiences GeneGo (Thomson Reuters) was one of the earliest providers of drug discovery technologies (MetaDrug, MetaCore [4-10]) related to systems biology and integrated cheminformatics tools as SaaS. Collaborative Drug Discovery, Inc. (CDD) was likely the first to offer a private vault for storing chemistry and biology data as a multitenant SaaS . Meanwhile, other larger software companies have acquired similar smaller companies  to give them a presence ‘on the cloud’ and a collaborative software offering. Companies like CDD have built a viable business around the software and grants focused on using their technologies alongside other tools to advance research on neglected and other diseases [11, 13, 14].
Research collaborations are increasingly seen as being key to accelerating biomedical research and these are likely to be facilitated by computational methods . However, we have suggested earlier that scientists are rarely collaborative or open with their data until publication or patenting, due to intellectual property (IP) concerns [13, 16]. Emerging collaborative software technologies allow researchers to specifically draw the line between pre-competitive and competitive areas and data more than previously possible. Scientific collaborations increasingly are becoming more important for the pharmaceutical industry especially in difficult areas or those suggested to have less commercial viability. The industry has had to adapt by acquiring or partnering to bring in innovation or products  as well as outsourcing many aspects of R&D. At some point companies have to share their data whether this is during a collaboration, licensing, due diligence, pre-purchase or post-purchase. Each of these processes has challenges when it comes to sharing molecular structures and associated data representing the IP for the companies or research groups involved. Increasingly such groups are involved in multi-organization collaborations such as public-private partnerships (PPP).
As an example of PPPs, CDD is involved in several collaborations such as More Medicines for Tuberculosis (MM4TB), Bill and Melinda Gates Foundation (BMGF), TB Accelerator and the NIH Blueprint for Neuroscience Research (BPN). In all of these initiatives, molecules and screening data with IP are securely shared between collaborators. Other similar software is likely required and available to share genomic and proteomic data but is outside the scope of this discussion. Currently data and associated molecules are selectively shared as a function of complex negotiations. Often important information is then missing for the other groups involved which could help the global goals of the project. Ideally this could be shared too in a way that did not interfere with other projects, IP or relationships outside the scope of the current project of interest. There is also a growing need for public collaborations through initiatives that require open data [3, 18] though some of these may not truly be open themselves e.g. Open Source Drug Discovery . One European IMI funded PPP initiative European Lead Factory  is focused on high throughput screening (analogous to what the NIH has funded at its many screening centers), and another initiative, Elixir is a pan European infrastructure for biological information . What cannot be denied is the growing mountain of data in the public domain and the likely growth in the need for collaboration to move projects rapidly and make sense of the accumulated information.
A decade ago the amount of HTS data available was just a fraction of what it is now. The arrival of PubChem  and the mandate of publishing NIH-funded experimental data into this database, has obviously had a big impact, putting thousands of assays and millions of data points onto the internet. But for such data to be valuable it requires the underlying data be consistent, reliable and well-linked. The data also has to be of high quality as errors in structure can multiply from database to database [23, 24]. Then we can apply or build algorithms to mine the data, find patterns in it and help make well-informed decisions. Can we really call this ‘big data’ though? It is all relative as one scientist's big data is another's small data. Relative to many nonscientific fields, what cheminformatics data lacks in size, it makes up for in terms of inconvenience and difficulty of handling. Perhaps we can just call this biomedical related data “bigger data” compared to what we had access to in the past (e.g. tens to hundreds of compounds for quantitative structure activity relationships).
One area in which we are seeing larger amounts of screening data being useful and more accessible is in neglected disease research. Neglected diseases are a group of biologically unrelated diseases that are grouped together because they disproportionately affect marginalized populations or they lack effective treatments or vaccines, or existing products to treat them are not accessible to the populations affected . While the definition of a neglected disease varies, the category generally includes: tuberculosis (TB), malaria, Chagas disease, African sleeping sickness, schistosomiasis, leishmaniasis and others for which there is a lack of economic incentives or “market” to provide motivation for product development [26-28]. Many of the pathogens involved, whether bacterial, parasitic, or viral, have complex life cycles and diverse approaches for evading the host immune system, rendering the development of new drugs and vaccines all the more challenging. Furthermore, these neglected diseases receive a relatively small amount of research investment ($80M to approximately $500M ) from governments and pharmaceutical companies in the developed world when we know it costs over a $1 billion to bring a drug to market . The scientific challenges and limited funding available for neglected disease drug discovery and development highlight the importance of doing as much as possible with the data. These diseases are not seen as commercially viable next to major diseases, so many companies donate patents and fund some limited research efforts and participate in PPPs. Currently available data relevant to neglected disease drug discovery is extremely diffuse, existing in an array of public or private databases (e.g. ChemSpider, PubChem, CDD, ChEMBL). One example is Mycobacterium tuberculosis (Mtb) which is the causative agent of TB that has infected approximately 2 billion people, and continues to kill 1.3 million people annually. We are seeing more companies making increasing quantities of screening data publically accessible, as well as the need to collaborate and share this data as GlaxoSmithKline have made 177 compounds with Mtb activity  and 14,000 compounds with antimalarial activity  available. Surprisingly, we are still making very slow progress in finding novel therapeutics  for TB and the clinical pipeline is limited . Ideally we should be learning from the past efforts in TB drug discovery and yet we do not appear to be doing something that is simple yet effective, learning from the data that already exists . The current predominant method for identifying compounds active against Mtb is to use phenotypic high throughput screening (HTS) [36-39] and the hit rate of these screens tends to be in the low single digits. We can estimate that upwards of 5 million compounds have been screened against Mtb over the last 5-10 years . There are around 1500 Mtb hits of interest from one laboratory alone [38-41]. Leveraging this prior knowledge (by curating the data) to produce validated computational models is an approach that can be taken to improve screening efficiency both in terms of cost and relative hit rates. Machine learning and classification methods have been used in TB drug discovery , and have enabled rapid virtual screening of compound libraries for novel chemotypes [43, 44]. The use of cheminformatics for tuberculosis drug discovery has been summarized [45-47] and can be readily implemented early in the process as a means to limit the number of compounds needing to be screened, therefore saving time and money [48-52]. Recent publications in this area have hit rates >20% and focus on favorable compounds with low or no cytotoxicity [51, 52]. More recently, combining datasets to use all 350,000 molecules with in vitro data from a single laboratory for computational models has been attempted. Interestingly our recent data suggests that smaller models with thousands of compounds may perform just as well as these “bigger data” models .
Throughout all of this work using the Mtb datasets for over 5 years, we have shown how additional value can be generated from such published data. Similar cheminformatics approaches have also been applied to other diseases [54-57]. Computational methods result in cost savings by eliminating the need for some experiments or testing many hypotheses which would not normally be possible without such models. While there has been considerable screening and identification of hits, a possible bottleneck is the progression of compounds and expansion of structure activity relationships that could result in viable leads. To date we estimate that there are ca. 2000 in vitro Mtb hits that need prioritizing before progressing. The in vitro, in vivo and clinical data for TB do not exist in a single database. Our own efforts to collate mouse in vivo data for modeling took many months and were recently described . We see this lack of data coordination as a major limitation to progress. There is also no centralized organization for project management and minimal collaboration or coordination in the field. This suggests that even though we are drowning in data, actually a bigger challenge is the integration and analysis of it before ultimately being able to use it for predictive models and prospective testing. These observations may also be broadly applicable beyond Mtb, but illustrate what can be achieved with generally much larger datasets than were available in the past.
Do we take the importance of privacy concerns for our data too far or not far enough? Should we think more carefully about what is the “real high value” data and perhaps loosen our belts and share more than we hoard data? Should we just find new ways to share data? For example we have already seen several companies compare their compound libraries to each other e.g. Bayer and Schering , Bayer and AstraZeneca  or in the case of Pfizer to the literature  using fingerprints, physicochemical properties and matching/similar compounds to show minimal overlap. While this is not the same as openly sharing molecules and their proprietary data on assays, many companies are involved in PPPs like those described earlier. What steps could be taken to increase the amount of secure data sharing?
Finding new ways to share relevant chemical information about screening data that leaves structures blinded could open the door for increased collaboration. These methods include better strategies for identifying active molecules from primary screens, which leverages information from fingerprints , scaffold groupings [62, 63], economic modeling [64-66], and improved processing of raw data [67-69]. They also include automatic methods of organizing screening data into workflows  and a series of approaches for visualizing how biological activity maps to chemical space [71-74]. Secure methods of sharing molecules and data could make outsourcing of chemical analysis possible (without sharing the structure itself). Outsourcing is increasingly important in drug discovery because it reduces the cost of many R&D efforts and enables centralization of expertise [75-77]. As more data is made available through these efforts it is possible unexpected connections and patterns in data can be identified that could have an impact on research. These connections certainly are impossible to predict. They include unexpected signals in screening data that indicate either specific molecules or mechanisms by which to treat human disease, or indications that might relate to adverse effects. Sharing large collections of proprietary assay data, with structures blinded, would enable researchers not part of the original data collection process to potentially improve how we do drug discovery. For example, a recent study used a small dataset published in patents from AstraZeneca, to show how different liquid dispensing methods can severely impact the IC50 data generated in high throughput screening and in turn impact the computational models that are built and decisions based on them . Collaboration across multiple pharmaceutical companies and academia could potentially address this on a much larger and more convincing scale, but it likely awaits the use of secure sharing methods that do not reveal structures.
Nearly a decade ago there were attempts at securely sharing molecule-related structure activity relationship data but these stalled when it was suggested that the proposed encryption methods were all fallible. For example, a 2005 American Chemical Society meeting, co-chaired by Dr. Christopher Lipinski and Dr. Tudor Oprea included a session on securely sharing chemical information to support collaborative development of absorption, distribution, metabolism and excretion (ADME) predictors [79-89]. Swamidass and co-workers recently proposed several approaches to the problem of sharing molecules securely  that may overcome the previous failings. First, they propose a totally new, secure method of sharing useful chemical information from small-molecule screens, without revealing the structures of the molecules . The method generates scaffold networks for compounds, enabling sharing of: molecule identifiers with assay data; how molecules in a screen are connected to one another in a screening network; how molecules are grouped together into scaffold groups; how these groups are connected into trees; how these groups are connected into networks; and how molecules are connected together into R-group networks. Statistical analysis using the PubChem data also clearly demonstrated that scaffold networks do not convey enough information to reliably reveal chemical structure .
A second proposed approach from the same group uses a new, secure way of measuring the overlap between two private datasets. This method uses an algorithm to construct a private dataset's shareable summary, which is called a “cryptoset” . The overlap between two private datasets can be estimated by comparing their cryptosets. At the same time, it is not possible to determine which specific items are in a private dataset from its cryptoset. Unlike other approaches to this problem [92-94], the item-level security arises from statistical properties of cryptosets rather than the secrecy of the algorithm or computational difficulty, so cryptosets can be shared in public, untrusted environments.
We are aware of at least one company, MedChemica which has successfully developed a business model around technology closely related (but not identical) to what Swamidass and co-workers are proposing above. MedChemica successfully negotiated agreements with three big pharma companies (AstraZeneca, Hoffman La Roche, Genentech) to share anonymized match-pair  data for the purpose of improving ADME optimization of lead compounds . MedChemica's partners pay them to provide software to process the structures in internal ADME data into an anonymized form, very similar to the R-group networks described earlier. This anonymized data is then transferred to MedChemica, where it is analyzed, and specific rules to guide ADME optimization are extracted. These rules are then offered back to MedChemica's clients to aid in lead optimization.
Approaches like these for secure data sharing need to be integrated into software tools that are used by scientists to store their data to provide confidence when they do decide to share subsets of their data with different collaborators. This is becoming even more apparent as drug companies reach out increasingly to academics to fill the internal research gaps by externalizing their fundamental chemistry, biology and screening research efforts.
One of the challenges after high throughput screening is to learn as much as possible about the hits or potential probe compounds being developed. Are they cytotoxic? What liabilities do they have? What off-targets do they have? Could we predict as much as possible about the molecules before we invest more time and efforts in them? This obviously assumes that the computational models for absorption, distribution, metabolism, excretion and toxicity (ADME/Tox) we use for particular properties are predictive and cover enough chemistry space. A major parameter to understand is drug metabolism.
Some of the major issues in drug metabolism include identifying: the enzyme/s involved, the site/s of metabolism, the resulting metabolite/s and the rate of metabolism. Methods for predicting human drug metabolism from in vitro and computational methodologies, and determining relationships between the structure and metabolic activity of molecules are also critically important for understanding potential drug interactions and toxicity. The cytochrome P450 (P450) enzymes are of considerable interest both in terms of metabolism and drug-drug interactions. Computational methodologies can be used for prioritization, and uncovering the relationships between the structure and metabolic activity of novel molecules. A recent approach describes a method called XenoSite  for building models that predict CYP-mediated sites of metabolism (SOM) for drug-like molecules with predictive accuracies of 87% on average for nine distinct CYP substrate sets. While this approach focused on phase I metabolism it is possible such approaches could be applied to phase II enzymes also.
Introducing such predictive approaches into software that stores screening data or integrating with such tools may be important for creating a pipeline process. This would enable the likely enzymes involved in metabolism to be predicted for a compound. This may be very important for avoiding specific patient populations that are perhaps poor or extensive metabolizers of a drug which could present problems such as hepatotoxicity or lack of efficacy. Being able to provide information on this level for metabolism and other properties like toxicity  in software used for storing and sharing chemistry and biology data is likely to be of value in overall decision making. For example there are already efforts like qsardb.org and ochem.eu which enable public model sharing and development [99, 100]. In addition, websites such as Chembench provide models and tools for modeling to registered users . Our earlier work proposed that open source descriptors and algorithms may be comparable with some commercial software, and that this might facilitate more sharing of computational models . There have also been developments such as QSAR-ML which was developed to enable standards for interoperability of QSAR models [103, 104]. One could imagine that software for secure sharing of models could be carried out similarly to that described earlier for data, such that they can be accessed by selected users. None of the websites for creating or storing QSAR models appear to offer this level of selectivity, and many companies may be wary of accessing them without some idea of security. Vendors that can guarantee that a company's IP will be secure are likely to be more successful in getting big pharmaceutical and biotechnology companies to use and share models in this way. Some advantages of sharing models may be that a collaborator can benefit from models developed with your proprietary data, which in turn benefits your shared goals. Sharing models openly with a community may foster addition of a groups own data to update the models and make them more relevant to internal projects if indeed the data were generated under similar conditions. If you were sharing a model and you wanted to ensure that the user could not identify compounds in the training set, you might disable any features that would measure the distance, similarity etc. to compounds in the training set, or at the very least make these outputs fuzzy. It is likely that more work and discussion on secure computational model sharing and development will happen in future.
We have previously suggested some of the needs and opportunities for cheminformatics which we termed the “missing pieces” . A decade ago commercial tools and academic tools were virtually the only choices. In recent years we have seen a greater effort towards open source cheminformatics software [104, 106-108]. Also, a decade ago systems biology was piecing together small biology experiments such as protein-protein interactions to understand the “big picture” . Now the amount of data available in some areas of biology (for many diseases or specific targets) is overwhelming. The challenges are to know where to look for the data you need in the first place. It may be feasible to turn this around and say that databases or data sources should be more proactive about making their data more accessible (or telling you what may be of interest). One way to do this is to use different avenues to create more value from the data.
Recently we have taken the approach we have called “appification”, that is to make a discrete molecule dataset available as a mobile application (app). This has become a common theme in the world of software, but is relatively new to structure-centric chemistry data. To our knowledge this was first achieved with the Green Solvents mobile app which used the American Chemical Society Green Chemistry Institute Pharmaceutical Roundtable Solvent Selection Guide (as a PDF document) . This document lists the 60 solvents by chemical name (and excludes structures) and rates the solvents against safety, health, air, water and waste categories with scores from 1 (few issues) to 10 (most concern) with additional color coding (green, yellow and red). This appification involved curation of the public data and development of a novel interface . The limitations in access and utility of the original document encouraged us to recast the content in a novel manner to greatly enhance visibility and availability to practicing chemists. The data was also used to enable predictions for solvents outside the guide. A similar approach has also been taken with data on 800 molecules with known targets in TB [112, 113] to create the app called TB Mobile. The data originated from a dataset in CDD public  but it was felt that the impact could be extended by creating a tool that could be useful for scientists and educators. The resultant app enables the user to view the molecules and known targets alongside other data related to the biology of the target. This represents one relatively simple way to bring cheminformatics and bioinformatics together. We have recently also implemented naïve Bayesian models using our own implementation of open source ECFP_6 descriptors in the app, to enable an alternative approach to target prediction as well as clustering molecules .
A further novel approach to creating open chemistry and biology databases can be achieved by building on tools we take for granted like Twitter and RSS feeds. A mobile app called Open Drug Discovery Teams (ODDT)  harvests Twitter feeds on several hashtags (e.g., #malaria, #tuberculosis, #huntingtons, #hivaids, #greenchemistry, #chagas, #leishmaniasis and #sanfilipposyndrome, and many other additional rare diseases and other topics). Harvesting in this way enables open data and molecules to be collected in an app. You could also think of this as a database with each topic being a subsection (e.g., a database on tuberculosis and a database on malaria etc.). The architecture of the currently deployed ODDT project is shown in Figure 1. The cheminformatics framework that powers the molsync.com web service has been extended to include continual querying of Twitter and RSS feeds for relevant content, and collecting them in a database. We and others have tweeted in to these topics, links to molecules, data and papers. We then added the ability to endorse or reject tweets. In addition, the ability to visualize a fingernail image for each tweet was added, as well as recognition of molecule images and a summary ticker tape.
The ODDT app can be used to manage multiple twitter accounts for the user too (Figure 2). The entry screen to the app displays the topics ranked by use. Tapping an image opens a topic on the incoming page and the content is listed on the right. Each tweet can be endorsed and the hyperlinks followed. The “recent” content page in the ODDT app shows entries with at least one endorsement while the content section shows the most popular voted content in rank order. Molecules can be tapped to open in other apps and could be the start of a workflow [116, 117]. If you imagine that one of the hurdles to putting data in public databases is the upload of data files, ODDT represents a simplistic approach enabling true one-click upload of molecules and data via a tweet! Perhaps this is an approach that could be used for secure upload via other messaging systems or direct messaging. It could also be an approach from which the bigger web-based databases could learn.
From our experiences in neglected disease research we think there is an opportunity to bring together a range of data and tools (Figure 3) that would facilitate and catalyze the identification of novel therapeutic candidates by combining bioinformatics, cheminformatics data, publications, models and data visualization tools, and curated in vitro and in vivo data. This would enable novel algorithms to be developed to infer candidate drug molecules, targets and mechanisms of drug action. This may in turn allow scientists to generate hypotheses in a single interface. The scientific challenges and limited funding available for neglected disease drug discovery and development highlight the importance of exploring alternative, lower cost approaches to advance drug discovery using cheminformatics, and maximizing the data in the public domain.
Other challenges we see as commercial opportunities are how to turn the databases and tools into assistants that make you aware of what data you might want to know about. For example, how can you find collaborators who might have interesting molecules or data? Methods like those described earlier for encrypting or sharing data securely might be valuable in this regard to help you find the data or alert you to its availability. Designing algorithms that can discern the most useful data for connecting researchers could reduce the serendipity involved in building collaborations . Creating a tool that uses social networking features for serious applications such that the software users can “like” a molecule rather than a person might be appealing in some cases for finding researchers with orthogonal preliminary results. Such a system could hasten the pace of research and allow for the sharing of negative data, which is often not published.
Our laboratories (if we still have them) may be like our homes, that is, an “internet of things”. Our databases and software tools should be able to talk wirelessly to devices such as analytical tools and automatically upload data (which we term “no click upload”). Perhaps more likely all of our science will be outside our office. We can leverage contract research organizations (CROs) as well as other contractors via sources like Assay Depot  and Science Exchange  and our personal connections and networks of collaborators can all do the science we need following our extensive mining of published data [121, 122] and predictions, perhaps even using virtual screening to decide which compounds to test.
How can we use the published data available to help tailor medicines to overcome our own genetic variability and side effects? For example, variability in metabolism is one issue, but what about variability in transporters and regulation of different proteins that can impact drug disposition? We are at a stage where there is increasing interest in computational models for human drug transporters which could be used proactively in the same way that we use models for P450s . Such metabolism and transporter models should probably be used in parallel to profile compounds and predict liabilities and drug-drug and drug-transporter interactions.
Thinking about what is feasible by integrating data on diseases or at least making it available alongside tools to facilitate collaboration and drug discovery, you can begin to think of how non-scientists or non-specialists can leverage them also. For example can we bring non-scientists in to help us develop “outside the box” thinking to tackle tough problems, whether in design of molecules, or biological problems to help cure rare diseases [124, 125]. We need to think about developing new tools that leverage the crowd (Table 1).
In summary, collaboration and tools to enable data sharing in drug discovery are likely to continue in their importance. Therefore some of the developments we propose in enabling secure or encrypted sharing methods may be important to consider. As databases are integrated or linked together, how we handle and license the data will be key, and some simple rules have already been proposed . The mountain of data available across databases that are either public or private will undoubtedly continue to grow, and this will present challenges we will need to overcome in order to manipulate, mine and model it. We will need some creativity to develop new visualization paradigms that enable insights and lead to the next experiment. On the other hand, as mobile devices continue to expand their utility , useful tools and abilities to interact with data are possible, as are extended workflows . While such devices may not be able to handle massive datasets within them just yet, they do present an access point to databases and more powerful tools on the cloud. The utility of being able to take your data with you and explore it on a tablet has some advantages. As we have shown here and elsewhere, mobile devices also represent a way to prototype how we can use published data and cheminformatics tools in new ways. The future may not look at all like the past; we may now be able to make cheminformatics more accessible to the masses as it is essential to turn our accumulated data into something of real value that leads to biomedical advances. Our efforts in applying these various approaches to neglected diseases are just one example. That impact of cheminformatics in itself, is an accomplishment that is worthy of more support whether governmental or otherwise.
S.E. gratefully acknowledges colleagues at CDD, Dr. Joel S. Freundlich (Rutgers), Dr. Malabika Sarker (SRI) and Dr. Katalin Nadassy (Accelrys) for valuable discussions and assistance in developing some of the projects discussed. S.E. acknowledges that the Bayesian models were developed with support from Award Number R43 LM011152-01 “Biocomputation across distributed private datasets to enhance drug discovery” from the National Library of Medicine. TB Mobile and the associated datasets was made possible with funding from Award Number 2R42AI088893-02 “Identification of novel therapeutics for tuberculosis combining cheminformatics, diverse databases and logic based pathway analysis” from the National Institutes of Allergy and Infectious Diseases.
Competing Financial Interests
NL is an employee and SE is a consultant for CDD Inc. SE is on the advisory board for Assay Depot. AJW is an employee of the Royal Society of Chemistry. AMC is the founder of Molecular Materials Informatics and a consultant for CDD.