|Home | About | Journals | Submit | Contact Us | Français|
Diversity-oriented organic synthesis (DOS) is a strategy to make compound collections to probe biological systems1-7. Designing better DOS libraries requires having methods to assess the consequences of different synthesis decisions on the biological performance of resulting library members8. Since we are particularly interested in how stereochemistry affects performance in biological assays, we prepared a disaccharide library containing systematic stereochemical variations, assayed the library for different biological effects, and developed methods to assess the similarity of performance between members across multiple assays. These methods allow us to ask which subsets of stereochemical features best predict similarity in patterns of biological performance between individual members and which features produce the greatest variation of outcomes. We anticipate that the data-analysis approach presented here can be generalized to other sets of biological assays and other chemical descriptors. Methods to assess which structural features of library members produce the greatest similarity in performance for a given set of biological assays should help prioritize synthesis decisions in second-generation library development targeting the underlying cell-biological processes. Methods to assess which structural features of library members produce the greatest variation in performance should help guide decisions about what synthetic methods need to be developed to make optimal small-molecule screening collections.
There is a growing interest in making small-molecule libraries with diverse three-dimensional structures. Stereochemical features of small molecules affect their biological performance8-11, but efforts to quantify the roles of such features have been limited. In order to enable a rigorous study of the effects of stereochemistry on biological performance, a collection of small molecules containing systematic variation of multiple stereocenters is required. Carbohydrates offer an opportunity to vary individual stereocenters independently of changes in physical properties, topology, and appendage diversity, but like many complex scaffolds with three-dimensionality, it is not easy to make a large number of different oligosaccharide frameworks. Chemists must make choices about what to synthesize.
In practice, the biological performance of collections of oligosaccharides has been underexplored, in part due to the difficulty in synthesizing all possible stereochemical variants, but also due to lengthy syntheses required to make even monomers12. Another possible reason is that oligosaccharides are often thought to be unsuitable as small-molecule probes of cell biology, conceivably due to a lack of cell permeability or to metabolic transformation within the cell13. Given the synthetic difficulties in making these molecules, we wondered whether this impression may actually result from insufficient biological testing of oligosaccharide collections. In this study, we wanted to see if we could make a relatively small number of different disaccharide skeletons and determine which structural features correspond to similarity and variation in biological performance. These experiments would teach us the stereochemical features on which to focus in second-generation libraries intended to be enriched in likelihood and diversity of biological activities.
We first tested a subset of disaccharides in several cell-biological assays that we had previously developed, representing different cellular states, and chose two readouts to optimize our observation of dose-dependent biological effects. We defined the glycosidic bond combinatorially based on regiochemistry and relative stereochemistry, synthesized a library of 64 disaccharides, and assessed biological activities of these molecules at multiple concentrations. In order to derive relationships more sophisticated than those determined by analyzing individual compound effects, we developed methods to match biological performance similarity across multiple assays with chemical structure similarity using a stereochemical description of the library. Unsupervised clustering of the resulting biological measurements revealed patterns corresponding to particular stereochemical features of the disaccharides. To refine these relationships, we used an optimization algorithm to determine subsets of stereochemical features most important to biological performance similarity. Our results suggest that sets of stereocenters responsible for activity patterns can be determined in a systematic fashion. This approach allows data-driven decisions about synthetic choices for follow-up chemistry.
We sought to represent maximal disaccharide-based structural diversity, using a minimal set of disaccharide skeletons, by focusing on variations of the glycosidic bond. The effect of the glycosidic bond on the overall structural diversity of disaccharides has been well-appreciated for decades14,15. The conformation of a disaccharide is largely determined by the glycosidic linkage between the two monosaccharide rings. Endo-lexo- anomeric effects result in a relatively rigid conformation of the glycosidic bond that controls the overall shape of the disaccharide16-19, yielding compounds that are not merely flat conjugations of ring systems. We varied the following structural features (Figure 1A): the anomeric bond configuration (α or β), the linkage position on the reducing-end sugar monomer (1,2-, 1,3-, or 1,4-linked), the chirality of sugar monomers around the glycosidic bond (DD, LD, DL, or LL), and the stereochemistry of the acceptor hydroxyl group (equatorial or axial). These structural definitions enabled us to represent the full diversity of 48 combinations describing the glycosidic bond using a limited number of disaccharide skeletons.
We chose four commercially available monosaccharides as monomers (D-glucose, D-ribose, D-galactose, D-mannose), two corresponding enantiomers (L-glucose and L-ribose), and two 6-deoxy-L-variants (L-fucose or 6-deoxy-L-galactose, and L-rhamnose or 6-deoxy-L-mannose). To assess the importance of substituent effects, we also included monosaccharides containing an alternative functional group, in this case two amino sugars (D-N-acetyl-glucosamine and D-N-acetyl-galactosamine), for a total of 10 monosaccharide subunits. Recognizing that many biologically active molecules contain both hydrophilic and hydrophobic moieties, we chose to use the p-methoxyphenyl (OMP) group at the reducing end of the disaccharides, since OMP is easy to introduce or remove and is stable under most protecting-group manipulations. We chose the sulfoxide glycosylation methodology (Figure 1B)20,21 and used a total of 9 different protected sulfoxide donors representing 8 monosaccharide subunits (Figure 2), and 14 different protected acceptors representing 9 monosaccharide subunits (Figure 3), in a total of 40 glycosylation reactions (Supplementary Table T1 to synthesize 59 compounds. We supplemented these products with 5 additional compounds (1, 23, 48, 49, 50) prepared from commercially available disaccharides (Sigma-Aldrich; see Supplementary Information for CAS registry numbers and synthetic details) for a total library of 64 disaccharides (Table 1). Our library synthesis strategy relied on post-synthetic chromatography to separate some mixtures, resulting in a total number of compounds (64) greater than the number of design subsets (48). Each disaccharide was characterized by 1H and 13C NMR and by ESIMS (compound characterization provided as Supplementary Information).
We initially tested a subset of these disaccharides for general effects such as cell viability and cellular metabolism (Supplementary Figure S1). While none of the compounds was overtly cytotoxic, a few compounds caused an increase in mitochondrial membrane potential (ΔΨm), measured by JC-1 dye22,23, in murine preadipocytes. The maintenance of ΔΨm is essential for cellular ATP synthesis. One possible cause of an increased ΔΨm is enhanced oxidative metabolism of benefit to the organism; for example, an increase in ΔΨm is strongly correlated with glucose-induced insulin secretion in pancreatic beta cells24. Because mitochondrial biogenesis occurs during adipocyte differentiation in these cells25, we reasoned that compounds that affect mitochondrial function in preadipocytes might also affect the differentiation process itself. We had previously developed a plate-reader assay for the differentiation of pre-adipocytes to mature adipocyte cells, measured by staining lipid droplets with the fluorescent dye Nile Red26-28. To characterize the full 64-compound collection, we focused on these two readouts (Figure 4) in white preadipocytes and brown preadipocytes isolated from different murine genetic backgrounds29-31. We treated cells in duplicate with eight concentrations of each compound, ranging from 20μM to 6.3nM, for a total of ten parallel cell-biological assays (Figure 5).
Assay data were first scored as described for high-throughput screening data in ChemBank32. We transformed these initial scores to p-values representing confidence in signal (relative to mock-treated cells) to determine the top-performing compounds, as judged by the greatest concentration-dependent change (increase or decrease) in signal amplitude (Figure 6). The biological effects of some top-scoring compounds were verified individually. The observations of dose-dependent effects and assay-specific activities increased our confidence that the observed activities are compound-induced outcomes. For example, compounds 54, 26, and 18 demonstrated dose-dependent increases in signal in assay 1, and these activities were confirmed in follow-up experiments at the highest concentrations tested (Supplementary Figure S2). These values compare favorably to previous reports; for example, 1mM oleate treatment33 or a combination of 100nM H2O2 and 100μM GDP34 both increase the mitochondrial membrane potential in 3T3-L1 adipocytes. While we are interested in individual compound performances in these cells, in this study we chose to focus on methods to link stereochemical features with patterns of performance among these primary assay results. Further biological testing of individual compound effects will be described in the future.
We sought to identify stereochemical features that explain the observed patterns of biological performance across all assays. For this analysis, we represented the measurements made for each compound as an 80-dimensional vector comprising variation in both concentration and assay identity. This dataset (Figure 7A) contains the same scores as the full set of dose-responses (exemplified in Figure 6), but arranged into a single vector of measurements across all assays (a biological performance profile for each compound). To detect relationships between these profiles, we performed hierarchical clustering and applied a threshold to group small molecules into discrete clusters representing distinct biological performance patterns (Figure 7B), at least some of which contain structures closely related in stereochemistry among their members (Figure 8). This analysis also affords the complete matrix of pairwise similarities in biological performance among all 64 compounds (Figure 9).
As a preliminary attempt to uncover relationships between stereochemistry and biological profiles, we examined the members of the two most active clusters (Figure 7B and and9;9; clusters I and V). We observed an enrichment in cluster V for disaccharides containing rhamnose at either monomer position (all but one member contains rhamnose), at the expense of disaccharides containing rhamnose in cluster I (no members contain rhamnose). We established the significance of this observation by chi-squared hypothesis testing against expectations resulting from a random distribution of rhamnose-containing disaccharides across all five clusters (p < 0.009).
To refine the relationship between biological performance and stereochemistry, we created stereochemical descriptors for the disaccharide library that encode individual stereocenters contained within the monomer building blocks, rather than simply considering monomer identities. We represented these descriptors as a 20-dimensional binary fingerprint of features (Figure 10A; see also Supplementary Figure S3) for each disaccharide, with bits representing the (L-/D-) chirality of the sugar monomers (4 bits), the anomeric bond configurations (4 bits), and the relative stereochemistry of each additional stereocenter in the molecule (12 bits).
To explain similarities and differences among biological performance profiles (Figure 7B or or9)9) using these stereochemical descriptors, we designed and implemented an optimization algorithm to identify stereochemical features important in determining similarity of biological performance. Analyzing pairwise stereochemical similarity using these 20-dimensional descriptors results in a pairwise similarity matrix (Figure 10B), where each element is defined by the Tanimoto coefficient35-37 for chemical similarity using the two 20-dimensional fingerprints for that pair of compounds.
Starting with this description of stereochemical features, we performed an optimization focused on biological cluster V. We considered two distributions of stereochemical similarities: those between pairs of library members both within cluster V, and those between pairs of library members, at least one of which is not in cluster V. We sought a subset of the 20 stereochemical features that maximized the difference between these distributions (Figure 10C). In other words, we picked a subset of the 20 features as a candidate description of relevant stereochemistry, and computed stereochemical similarity using only this subset of the 20 bits as the chemical “fingerprint” of each library member. We scored the candidate description by how well it distinguished similarities among cluster V members from other similarities. Our optimization then sought to add or remove stereochemical features, one at a time, to improve this score.
By focusing on individual stereocenters, rather than arbitrary monomer building blocks, we identified a set of stereochemical features enriched (p < 7.6 × 10-15; see Methods) among the members of biological cluster V (Figure 11A). Specifically, we identified that the syn configuration of substituents at C3 and C5 of the acceptor monomer is present in all members of cluster V. Additionally, the absolute configuration of C5 in both donor and acceptor monomers was enriched for the L-configuration; every member of biological cluster V contains an L-sugar monomer, and two-thirds of the members are LL-disaccharides. To understand the role of individual stereocenters in explaining differences in biological performance among library members, we performed a similar optimization seeking a subset of the 20 stereochemical features that maximized intra-cluster similarities (“signal”) for multiple biological clusters simultaneously, while minimizing inter-cluster similarities (“noise”). When we considered the most-active clusters (I and V) together, we identified a second subset of features that best explain (p < 0.007; see Methods) similarities within both of these clusters and the differences between their memberships (Figure 11B). Scoring candidate stereochemical descriptions in this way allows biological data to reveal which combinations of stereocenters might explain similarities and differences in biological performance.
Carbohydrates provide access to shape diversity. They are rigid and allow control over even distant relative substituent orientations by exercising synthetic control over stereochemical diversity among monosaccharide subunits. We have shown that systematic synthesis and biological testing of collections of disaccharides can identify compounds with specific, dose-dependent cellular activities. Conventional wisdom holds that disaccharides will have little activity due either to cell impermeability or to metabolism as nutrient sources13. In contrast, we observed multiple dose-dependent effects in a small collection of assays, suggesting that some of these compounds may have other activities as well. Ongoing screening activities with these compounds in our laboratories will determine whether this proves to be the case.
That rhamnose-containing compounds appear to have more similarity between biological activities in this set of assays is of potential clinical interest; test subjects who consumed L-rhamnose for four weeks experienced a significant reduction in serum triglyceride levels38. Further, L-rhamnose-rich polysaccharides have been shown to have proliferation-inducing effects on human skin fibroblasts39. L-Rhamnose was originally thought to be an inert sugar, though recent studies have suggested that it is indeed metabolized in vivo40. These results and observations should encourage future consideration of disaccharides as candidate compounds for cell-biological assays directed at probe discovery.
In addition to detecting enrichment at the level of individual monomer identities (e.g., rhamnose), we can refine this description to individual stereocenters by identifying enrichments among subsets of stereochemical features that nature has embedded in the monomers. Such a mapping between stereochemical features and biological performance similarity can readily form the basis for hypothesis generation. For example, our 64 library compounds were selected as a subset of a 540-member “virtual library” of molecules (Supplementary Table T2) we thought were accessible using our chemistry. Based on the foregoing analysis and a desire for more compounds that behave like those in cluster V, we could choose from the remaining prospective members to synthesize only LL-disaccharides with the syn configuration of C3 in acceptor sugars, of which there are only 22 examples (4.6%) among the 476 remaining compounds. Had we applied such a rule prospectively to the compounds we did synthesize, we would have made only 14 of our 64 compounds, of which 6 (42.9%) would have exhibited the pattern of biological performance corresponding to cluster V. This example illustrates the improvements in synthetic efficiency that might be achieved by testing in multiple assays relatively small subsets of larger prospective libraries before investing in the larger library synthesis effort. Importantly, this is just one illustration of the type of question to which our overall approach allows access. Each observed pattern of biological activity could form the basis for an optimization to find a relevant subset of stereochemical features, and compounds from each pair of patterns could yield information about the determinants of differences between their memberships. While it is unlikely that each test would result in a statistically significant set of features, the fact that such questions can be asked (and significance tested) in an automated and systematic fashion is an important step toward informing synthetic decision-making with biological performance data.
We have shown that, by testing compounds in even a small and focused set of cell-biological assays, we can find statistically significant enrichments in structural features among members with similar biological performance, even within a relatively small disaccharide library. We concede that our studies do not provide any causal link between specific stereocenters and biological performance, and certainly provide no mechanistic information about the action of these molecules. Nevertheless, the correlations we observe can be used as guidelines to improve the efficiency of future library synthesis relative to desired biological outcomes or the diversity of such outcomes. The difficulty of synthesizing large numbers of disaccharides, and the barriers to high-throughput split-and-pool methods, meant that we were required to find alternative ways to restrict the scope of stereochemical space. Our solution was to probe widely and shallowly; we reduced the complexity of carbohydrate chemistry by synthesizing and testing one or two representatives of each subset comprising our combinatorial definition of the glycosidic bond (Figure 1A). As a result, we minimized the synthetic effort while attempting to maximize the biological information obtained. In spite of this design, our initial finding that monomer identities were enriched or depleted among compounds sharing biological activity patterns was fortuitous. Many fewer monomers (10) were used than subsets in our glycosidic bond definition (48), so many library members from different subsets shared monomer constituents, allowing us to assess the statistical significance of enrichment.
Our use of disaccharides was particularly well-suited to address questions about the importance of stereochemical features. The library members are effectively identical in chemical composition, differing primarily in stereochemistry. We were thus able to focus purely on stereochemical diversity, uncontaminated by considerations such as molecular weight or appendage diversity. While this focus on disaccharides allowed us to simplify the structural representation of the molecules considerably, it also perforce limited the application of our precise findings to disaccharides, and to the assays under consideration, but our analysis methods can readily be generalized to other sets of compounds and assays. To make this approach more generally useful, our future work will focus on discovering the substructure features systematically from a collection of structures, allowing patterns of activity to be predicted for any prospective small molecule. Moreover, the limited size of the library in the present work made impractical an independent “test set” of compounds to validate choices of stereochemical descriptors resulting from optimization, but future applications of the methods (e.g., in mining databases of high-throughput screening data) could readily include this important step. Similarly, we acknowledge that we limited our choice of assays by our initial observations with candidate disaccharides, and that our choice to resolve the biological assay data into five patterns of activity was arbitrary. Work currently underway will address the multiple possibilities for this choice explicitly, with the aim of finding significant enrichments in structural features at multiple levels of resolution of biological performance patterns. Nevertheless, this study provides a clear path forward for thinking about how biological activity patterns can be used to identify responsible chemical substructure features.
In summary, we have described a generalizable method for taking primary data from multiple cell-biological assays of a small-molecule library and determining structural features enriched in compounds sharing a pattern of biological activities. These results have implications for the recommendation of particular subsets of structures for resynthesis and second-generation library design, a highly sought-after ability when considering compound collections for high-throughput screening activities. If it were feasible to synthesize or select only small numbers of representatives from many defined areas of “chemical space”, a minimal number of compounds could be sufficient in initial screening experiments. Subsequent syntheses and biological assays could then be chosen to test specific hypotheses emerging from this data-analysis approach.
Compound characterization is provided in the Supplementary Information, including all library compounds and examples for key monosaccharide building blocks and intermediates. Unless stated otherwise, all reactions were performed under an argon atmosphere using dry, deoxygenated solvents either distilled or passed through an activated alumina column under argon. All other commercially obtained reagents were used without purification, unless otherwise noted. Reactions were monitored by thin-layer chromatography (TLC) using silica gel 60 F254 precoated plates (250 μm thickness, Sorbent Technologies). Flash chromatography was carried out on silica gel (particle size 40-75 μm, 60Å porosity, Sorbent Technologies). NMR spectra were obtained from a Varian Inova 500 (500 MHz for 1H, 125 MHz for 13C) or a Varian Inova 600 (600 MHz for 1H) or a Bruker DMX500 (125 MHz for 13C). Proton chemical shifts are reported in parts per million (ppm) with solvent residual peaks (D2O: 4.79 ppm, CD3OD: 3.31 ppm) as internal standards. Carbon chemical shifts are reported in ppm with solvent residual peak (CD3OD: 49.0 ppm) as an internal standard. Coupling constants (J) are given in hertz (Hz) and the abbreviations s, d, t, q, br and m refer to singlet, doublet, triplet, quartet, broad and multiplet, respectively. All assignments were made based on 1H-1H correlation spectroscopy (COSY), heteronuclear single quantum coherence (HSQC), selective 1D total correlation spectroscopy (TOCSY), and gradient-enhanced 1D nuclear Overhauser enhancement (nOe) methods. Low-resolution mass spectra (LRMS) were obtained on an Agilent Technologies LC/MSD instrument using electrospray ionization (ESI).
3T3-L1 preadipocytes (ATCC; Manassas, VA) were grown in Dulbecco's Modified Eagle's Medium (DMEM, Mediatech; Herndon, VA) supplemented with 10% donor calf serum and antibiotics (100 μg/mL penicillin/streptomycin mix) in a humidified atmosphere at 37C with 5% CO2. Immortalized brown preadipocytes were cultured similarly, in media containing 10% fetal bovine serum.
For all assays, 5,000 cells were seeded per well of black 384-well optical bottom plates (Nunc; Rochester, NY) at 50μL/well. The following day, 100nL compound was pin-transferred in duplicate into fresh media with a steel pin array, using the CyBi-Well robot (CyBio; Woburn, MA). In order to increase the number of mock-treated wells included in the control distribution, we pin-transferred DMSO vehicle to every well of an additional, parallel assay plate. All assay measurements were performed using the EnVision plate reader (PerkinElmer; Waltham, MA).
Upon depolarization, the dye is converted from a diffuse green form to red fluorescent J-aggregates22,23. The ratio of red to green fluorescence serves as a readout of the mitochondrial membrane potential. After either a 1-h (acute) or 24-h (long-term) incubation with compound, media was aspirated from plates, and 20μL/well 3.25μM JC-1 (Molecular Probes; Invitrogen Corp.; Carlsbad, CA) in phenol red-free media was added. Plates were incubated for 2h at 37C and washed three times with 50μL/well PBS. Fluorescence was measured, first at ex/em 530nm/580nm (“red”), followed by ex/em 485nm/530nm (“green”).
Upon differentiation, cells accumulate intracellular lipid droplets, which can be stained due to their hydrophobic properties. After 48-h incubation with compound, cells were washed once with PBS, stained for 1h at room temperature with 1μM Nile Red in PBS, washed once with PBS, and fluorescence measured at 485nm/530nm.
ChemBank scores reflecting compound performance as compared to a mock-treated (DMSO) distribution were calculated as described32, and reflect background subtraction and normalization based on assay noise, which is represented by a distribution of vehicle-control wells. We converted these scores to p-values using a conservative estimate of confidence for unimodal distributions41, then negative-logarithm-transformed the p-values to reflect orders of confidence. We applied a negative algebraic sign to those values in the low-signal tail of the distribution of vehicle-control measurements, to give signed log(p) values as the scores for each well, with negative values representing significant decreases in signal and positive values representing significant increases in signal. Dose-dependent effects were identified using a three-point moving average of scores along the concentration axis and compounds were rank-ordered by the slope of the relationships between concentration and these averaged values; each concentration extreme was counted twice in calculating the average values of the endpoints. Our optimization method begins by choosing at random a subset of stereochemical features to include in a pairwise stereochemical similarity calculation. This candidate design is evaluated using a quantitative metric to distinguish the distance distributions, either the K-S statistic or a signal-to-noise measure, as described in the text. Next, a single stereochemical feature is either added or removed (again randomly), and the new candidate design tested; if the new design receives a better score than the previous design, the change is kept, otherwise, it is discarded before trying a new change in the design. Iteration proceeds until no candidate change results in further improvements to the scoring metric. Multiple applications of this procedure result in a set of candidate designs each representing the “local” best score for that set of iterations, and from this list a winning design is selected with the best overall score. Hierarchical clustering visualization was performed in Spotfire DecisionSite (Spotfire, Inc.; Somerville, MA). Distance calculations and all optimizations were performed in MATLAB (The Mathworks, Inc.; Natick, MA).
The authors thank Stephanie Norton, Jason Burbank, and Nicky Tolliday for assistance with performing biological assays, and Nicole Bodycombe, Hyman Carrinski, Joshua Gilbert, and J. Anthony Wilson for helpful discussions about computational methods. Immortalized brown preadipocytes were generously provided by Yu-Hua Tseng and C. Ronald Kahn. Tony Stapon provided monomer precursors for library synthesis. This work was supported by the Broad Institute Center of Excellence in Chemical Methodology and Library Development (P50-GM069721) and the Broad Institute Exploratory Center for Cheminformatics Research (P20-HG003895). I.C.J. was supported by the Broad Institute's Summer Research Program in Genomics.
Supplementary Information Additional details of the synthetic methods and compound characterization, the cell-biological and cheminformatic data, a description of the source code organization, and supplementary Figures and Tables are provided as Supplementary Information; the source code itself is available on request.