The cores of globular proteins are densely packed, resulting in complicated networks of structural interactions. These interactions in turn give rise to dynamic structural correlations over a wide range of time scales. Accurate analysis of these complex correlations is crucial for understanding biomolecular mechanisms and for relating structure to function. Here we report a highly accurate technique for inferring the major modes of structural correlation in macromolecules using likelihood-based statistical analysis of sets of structures. This method is generally applicable to any ensemble of related molecules, including families of nuclear magnetic resonance (NMR) models, different crystal forms of a protein, and structural alignments of homologous proteins, as well as molecular dynamics trajectories. Dominant modes of structural correlation are determined using principal components analysis (PCA) of the maximum likelihood estimate of the correlation matrix. The correlations we identify are inherently independent of the statistical uncertainty and dynamic heterogeneity associated with the structural coordinates. We additionally present an easily interpretable method (“PCA plots”) for displaying these positional correlations by color-coding them onto a macromolecular structure. Maximum likelihood PCA of structural superpositions, and the structural PCA plots that illustrate the results, will facilitate the accurate determination of dynamic structural correlations analyzed in diverse fields of structural biology.
Biological macromolecules comprise extensive networks of interconnected atoms. These complex coupled networks result in correlated structural dynamics, where atoms and residues move and evolve together as concerted conformational changes. The availability of a wealth of macromolecular structures necessitates the use of robust strategies for analyzing the correlated modes of motion found in molecular ensembles. Current strategies use a combination of least-squares superpositions and statistical analysis of the structural covariance matrix. However, the least-squares treatment implicitly requires that atoms are uncorrelated and that each atom has the same positional uncertainty, two assumptions which are violated in structural ensembles. For example, the atoms in the proteins are connected by chemical bonds, covalent and non-covalent, resulting in strong correlations. Furthermore, different atoms have different variances, because some atoms are known with less precision or have greater mobility. Using maximum likelihood (ML) analysis, we have developed a technique that is markedly more accurate than the classical least-squares approach by accounting for both correlations and heterogeneous variances. The improved ability to accurately analyze the major modes of dynamic structural correlations will benefit a diverse range of biological disciplines, including nuclear magnetic resonance (NMR) spectroscopy, crystallography, molecular dynamics, and molecular evolution.
The large number of available HIV-1 protease structures provides a remarkable sampling of conformations of the different conformational states, which can be viewed as direct structural information about its dynamics. After structure matching, we apply principal component analysis (PCA) to obtain the important apparent motions, including bound and unbound structures. There are significant similarities between the first few key motions and the first few low-frequency normal modes calculated from a static representative structure with an elastic network model (ENM), strongly suggesting that the variations among the observed structures and the corresponding conformational changes are facilitated by the low-frequency, global motions intrinsic to the structure. Similarities are also found when the approach is applied to an NMR ensemble, as well as to molecular dynamics (MD) trajectories. Thus, a sufficiently large number of experimental structures can directly provide important information about protein dynamics, but ENM can also provide similar sampling of conformations.
Prion Proteins (PrP) are among a small number of proteins for which large numbers of NMR ensembles have been resolved for sequence mutants and diverse species. Here, we perform a comprehensive principle components analysis (PCA) on the tertiary structures of PrP globular proteins to discern PrP subdomains that exhibit conformational change in response to point mutations and clade-specific evolutionary sequence mutation trends. This is to our knowledge the first such large-scale analysis of multiple NMR ensembles of protein structures, and the first study of its kind for PrPs. We conducted PCA on human (n = 11), mouse (n = 14), and wildtype (n = 21) sets of PrP globular structures, from which we identified five conformationally variable subdomains within PrP. PCA shows that different non-local patterns and rankings of variable subdomains arise for different pathogenic mutants. These subdomains may thus be key areas for initiating PrP conversion during disease. Furthermore, we have observed the conformational clustering of divergent TSE-non-susceptible species pairs; these non-phylogenetic clusterings indicate structural solutions towards TSE resistance that do not necessarily coincide with evolutionary divergence. We discuss the novelty of our approach and the importance of PrP subdomains in structural conversion during disease.
Prion Proteins (PrP) cause a variety of incurable TSE diseases, and are among a small number of proteins for which large numbers of NMR ensembles have been resolved for sequence mutants and diverse species. Here, we perform a comprehensive PCA study to assess conformational variation and discern the landscape of the PrP structural response to sequence mutation. This is to our knowledge the first large-scale analysis of multiple NMR ensembles for a specific protein, and the first study to perform a multivariate PCA on the native globular structures of PrP. We conducted exhaustive PCA on three PrP subsets: human and mouse subsets that include structures of sequence mutants, and the set of wild-type PrP (16 PrP species). PCA shows that different non-local patterns of variable subdomains arise for different pathogenic mutants. These subdomains may thus be key areas for initiating PrP conversion during disease. Furthermore, we observed that some evolutionarily divergent species that are non-susceptible to TSEs have surprising structural similarities in their PrPs. We discuss the novelty of our approach with respect to prions, and the advantage of this analysis as a fast, reliable starting point to identify interesting domains that may warrant further experimental and computational analysis.
We present the codimensional PCA, a novel and straightforward method for resolving sample heterogeneity within a set of cryo-EM 2D projection images of macromolecular assemblies. The method employs Principal Component Analysis (PCA) of resmapled 3D structures computed using subsets of 2D data obtained with a novel hypergeometric sampling scheme. PCA provides us with a small subset of dominating “eingevolumes” of the system, whose reprojections are compared with experimental projection data to yield their factorial coordinates constructed in a common framework of the 3D space of the macromolecule. Codimensional PCA is unique in the dramatic reduction of dimensionality of the problem, which facilitates rapid determination of both the plausible number of conformers in the sample and their 3D structures. We applied the codimensional PCA to a complex data set of T. thermophilus 70S ribosome, and we identified four major conformational states and visualized high mobility of the stalk base region.
Protein folding is considered here by studying the dynamics of the folding of the triple β-strand WW domain from the Formin binding protein 28 (FBP). Starting from the unfolded state and ending either in the native or nonnative conformational states, trajectories are generated with the coarse-grained united residue (UNRES) force field. The effectiveness of principal component analysis (PCA), an already-established mathematical technique for finding global, correlated motions in atomic simulations of proteins, is evaluated here for coarse-grained trajectories. The problems related to PCA and their solutions are discussed. The folding and non-folding of proteins are examined with free energy landscapes. Detailed analyses of many folding and non-folding trajectories at different temperatures show that PCA is very efficient for characterizing the general folding and non-folding features of proteins. It is shown that the first principal component captures and describes in detail the dynamics of a system. Anomalous diffusion in the folding/non-folding dynamics is examined by the mean-square displacement, (MSD), and the fractional diffusion and fractional kinetic equations. The collision-less (or ballistic) behavior of a polypeptide undergoing Brownian motion along the first few principal components is accounted for.
principal component analysis; 1E0L; UNRES force field; folding dynamics; anomalous diffusion
The GeoPCA package is the first tool developed for multivariate analysis of dihedral angles based on principal component geodesics. Principal component geodesic analysis provides a natural generalization of principal component analysis for data distributed in non-Euclidean space, as in the case of angular data. GeoPCA presents projection of angular data on a sphere composed of the first two principal component geodesics, allowing clustering based on dihedral angles as opposed to Cartesian coordinates. It also provides a measure of the similarity between input structures based on only dihedral angles, in analogy to the root-mean-square deviation of atoms based on Cartesian coordinates. The principal component geodesic approach is shown herein to reproduce clusters of nucleotides observed in an η–θ plot. GeoPCA can be accessed via http://pca.limlab.ibms.sinica.edu.tw.
Ensemble based virtual screening refers to the use of conformational ensembles from crystal structures, NMR studies or molecular dynamics simulations. It has gained greater acceptance as advances in the theoretical framework, computational algorithms, and software packages enable simulations at longer time scales. Here we focus on the use of computationally generated conformational ensembles and emerging methods that use these ensembles for discovery, such as the Relaxed Complex Scheme or Dynamic Pharmacophore Model. We also discuss the more rigorous physics-based computational techniques such as accelerated molecular dynamics and thermodynamic integration and their applications in improving conformational sampling or the ranking of virtual screening hits. Finally, technological advances that will help make virtual screening tools more accessible to a wider audience in computer aided drug design are discussed.
Conformational ensembles are increasingly recognized as a useful representation to describe fundamental relationships between protein structure, dynamics and function. Here we present an ensemble of ubiquitin in solution that is created by sampling conformational space without experimental information using “Backrub” motions inspired by alternative conformations observed in sub-Angstrom resolution crystal structures. Backrub-generated structures are then selected to produce an ensemble that optimizes agreement with nuclear magnetic resonance (NMR) Residual Dipolar Couplings (RDCs). Using this ensemble, we probe two proposed relationships between properties of protein ensembles: (i) a link between native-state dynamics and the conformational heterogeneity observed in crystal structures, and (ii) a relation between dynamics of an individual protein and the conformational variability explored by its natural family. We show that the Backrub motional mechanism can simultaneously explore protein native-state dynamics measured by RDCs, encompass the conformational variability present in ubiquitin complex structures and facilitate sampling of conformational and sequence variability matching those occurring in the ubiquitin protein family. Our results thus support an overall relation between protein dynamics and conformational changes enabling sequence changes in evolution. More practically, the presented method can be applied to improve protein design predictions by accounting for intrinsic native-state dynamics.
Knowledge of protein properties is essential for enhancing the understanding and engineering of biological functions. One key property of proteins is their flexibility—their intrinsic ability to adopt different conformations. This flexibility can be measured experimentally but the measurements are indirect and computational models are required to interpret them. Here we develop a new computational method for interpreting these measurements of flexibility and use it to create a model of flexibility of the protein ubiquitin. We apply our results to show relationships between the flexibility of one protein and the diversity of structures and amino acid sequences of the protein's evolutionary family. Thus, our results show that more accurate computational modeling of protein flexibility is useful for improving prediction of a broader range of amino acid sequences compatible with a given protein. Our method will be helpful for advancing methods to rationally engineer protein functions by enabling sampling of conformational and sequence diversity similar to that of a protein's evolutionary family.
Catalytic loop motions facilitate substrate recognition and binding in many enzymes. While these motions appear to be highly flexible, their functional significance suggests that structure-encoded preferences may play a role in selecting particular mechanisms of motions. We performed an extensive study on a set of enzymes to assess whether the collective/global dynamics, as predicted by elastic network models (ENMs), facilitates or even defines the local motions undergone by functional loops. Our dataset includes a total of 117 crystal structures for ten enzymes of different sizes and oligomerization states. Each enzyme contains a specific functional/catalytic loop (10–21 residues long) that closes over the active site during catalysis. Principal component analysis (PCA) of the available crystal structures (including apo and ligand-bound forms) for each enzyme revealed the dominant conformational changes taking place in these loops upon substrate binding. These experimentally observed loop reconfigurations are shown to be predominantly driven by energetically favored modes of motion intrinsically accessible to the enzyme in the absence of its substrate. The analysis suggests that robust global modes cooperatively defined by the overall enzyme architecture also entail local components that assist in suitable opening/closure of the catalytic loop over the active site.
Protein loops have critical roles in ligand binding and catalysis. An unresolved issue in this context is the extent to which the intrinsic dynamics of proteins predispose loops to perform their molecular function. In this work, we (i) critically examine the structural changes undergone by functional/catalytic loops based on a set of enzyme crystal structures in the presence/absence of a ligand, and (ii) examine to what extent those motions are correlated with, or driven by, the global modes that are predictable using simplified, physics-based models. Using a dataset of 117 structures for ten enzymes of different sizes and oligomerization states, we show that the collective modes defined by the protein topology favor loop rearrangements in reasonable agreement with those experimentally observed upon activation. These results suggest that simple but robust motions encoded by the entire architecture, not the local binding site only, assist in binding of the ligand, positioning of the catalytic loop, and/or sequestration of the catalytic site, which in turn, enable efficient catalysis.
In conjunction with the recognition of the functional role of internal dynamics of proteins at various timescales, there is an emerging use of dynamic structural ensembles instead of individual conformers. These ensembles are usually substantially more diverse than conventional NMR ensembles and eliminate the expectation that a single conformer should fulfill all NMR parameters originating from 1016 - 1017 molecules in the sample tube. Thus, the accuracy of dynamic conformational ensembles should be evaluated differently to that of single conformers.
We constructed the web application CoNSEnsX (Consistency of NMR-derived Structural Ensembles with eXperimental data) allowing fast, simple and convenient assessment of the correspondence of the ensemble as a whole with diverse independent NMR parameters available. We have chosen different ensembles of three proteins, human ubiquitin, a small protease inhibitor and a disordered subunit of cGMP phosphodiesterase 5/6 for detailed evaluation and demonstration of the capabilities of the CoNSEnsX approach.
Our results present a new conceptual method for the evaluation of dynamic conformational ensembles resulting from NMR structure determination. The designed CoNSEnsX approach gives a complete evaluation of these ensembles and is freely available as a web service at http://consensx.chem.elte.hu.
A key question when analyzing high throughput data is whether the information provided by the measured biological entities (gene, metabolite expression for example) is related to the experimental conditions, or, rather, to some interfering signals, such as experimental bias or artefacts. Visualization tools are therefore useful to better understand the underlying structure of the data in a 'blind' (unsupervised) way. A well-established technique to do so is Principal Component Analysis (PCA). PCA is particularly powerful if the biological question is related to the highest variance. Independent Component Analysis (ICA) has been proposed as an alternative to PCA as it optimizes an independence condition to give more meaningful components. However, neither PCA nor ICA can overcome both the high dimensionality and noisy characteristics of biological data.
We propose Independent Principal Component Analysis (IPCA) that combines the advantages of both PCA and ICA. It uses ICA as a denoising process of the loading vectors produced by PCA to better highlight the important biological entities and reveal insightful patterns in the data. The result is a better clustering of the biological samples on graphical representations. In addition, a sparse version is proposed that performs an internal variable selection to identify biologically relevant features (sIPCA).
On simulation studies and real data sets, we showed that IPCA offers a better visualization of the data than ICA and with a smaller number of components than PCA. Furthermore, a preliminary investigation of the list of genes selected with sIPCA demonstrate that the approach is well able to highlight relevant genes in the data with respect to the biological experiment.
IPCA and sIPCA are both implemented in the R package mixomics dedicated to the analysis and exploration of high dimensional biological data sets, and on mixomics' web-interface.
The prostate cancer antigen 3 (PCA3/DD3) gene is a highly specific biomarker upregulated in prostate cancer (PCa). In order to understand the importance of PCA3 in PCa we investigated the organization and evolution of the PCA3 gene locus.
We have employed cDNA synthesis, RTPCR and DNA sequencing to identify 4 new transcription start sites, 4 polyadenylation sites and 2 new differentially spliced exons in an extended form of PCA3. Primers designed from these novel PCA3 exons greatly improve RT-PCR based discrimination between PCa, PCa metastases and BPH specimens. Comparative genomic analyses demonstrated that PCA3 has only recently evolved in an anti-sense orientation within a second gene, BMCC1/PRUNE2. BMCC1 has been shown previously to interact with RhoA and RhoC, determinants of cellular transformation and metastasis, respectively. Using RT-PCR we demonstrated that the longer BMCC1-1 isoform - like PCA3 – is upregulated in PCa tissues and metastases and in PCa cell lines. Furthermore PCA3 and BMCC1-1 levels are responsive to dihydrotestosterone treatment.
Upregulation of two new PCA3 isoforms in PCa tissues improves discrimination between PCa and BPH. The functional relevance of this specificity is now of particular interest given PCA3's overlapping association with a second gene BMCC1, a regulator of Rho signalling. Upregulation of PCA3 and BMCC1 in PCa has potential for improved diagnosis.
Principal component analysis (PCA) enables the building of statistical shape models of bones and joints. This has been used in conjunction with computer assisted surgery in the past. However, PCA of the clavicle has not been performed. Using PCA, we present a novel method that examines the major modes of size and three-dimensional shape variation in male and female clavicles and suggests a method of grouping the clavicle into size and shape categories.
Materials and methods
Twenty-one high-resolution computerized tomography scans of the clavicle were reconstructed and analyzed using a specifically developed statistical software package. After performing statistical shape analysis, PCA was applied to study the factors that account for anatomical variation.
The first principal component representing size accounted for 70.5 percent of anatomical variation. The addition of a further three principal components accounted for almost 87 percent. Using statistical shape analysis, clavicles in males have a greater lateral depth and are longer, wider and thicker than in females. However, the sternal angle in females is larger than in males. PCA confirmed these differences between genders but also noted that men exhibit greater variance and classified clavicles into five morphological groups.
Discussion And Conclusions
This unique approach is the first that standardizes a clavicular orientation. It provides information that is useful to both, the biomedical engineer and clinician. Other applications include implant design with regard to modifying current or designing future clavicle fixation devices. Our findings support the need for further development of clavicle fixation devices and the questioning of whether gender-specific devices are necessary.
Motivation: Nuclear magnetic resonance (NMR) spectroscopy has been used to study mixtures of metabolites in biological samples. This technology produces a spectrum for each sample depicting the chemical shifts at which an unknown number of latent metabolites resonate. The interpretation of this data with common multivariate exploratory methods such as principal components analysis (PCA) is limited due to high-dimensionality, non-negativity of the underlying spectra and dependencies at adjacent chemical shifts.
Results: We develop a novel modification of PCA that is appropriate for analysis of NMR data, entitled Sparse Non-Negative Generalized PCA. This method yields interpretable principal components and loading vectors that select important features and directly account for both the non-negativity of the underlying spectra and dependencies at adjacent chemical shifts. Through the reanalysis of experimental NMR data on five purified neural cell types, we demonstrate the utility of our methods for dimension reduction, pattern recognition, sample exploration and feature selection. Our methods lead to the identification of novel metabolites that reflect the differences between these cell types.
Supplementary Information: Supplementary data are available at Bioinformatics online.
This paper presents a complete implementation of the Principal Component Analysis (PCA) algorithm in Field Programmable Gate Array (FPGA) devices applied to high rate background segmentation of images. The classical sequential execution of different parts of the PCA algorithm has been parallelized. This parallelization has led to the specific development and implementation in hardware of the different stages of PCA, such as computation of the correlation matrix, matrix diagonalization using the Jacobi method and subspace projections of images. On the application side, the paper presents a motion detection algorithm, also entirely implemented on the FPGA, and based on the developed PCA core. This consists of dynamically thresholding the differences between the input image and the one obtained by expressing the input image using the PCA linear subspace previously obtained as a background model. The proposal achieves a high ratio of processed images (up to 120 frames per second) and high quality segmentation results, with a completely embedded and reliable hardware architecture based on commercial CMOS sensors and FPGA devices.
FPGA; PCA; CMOS sensor; object detection; image processing
Existing methods to ascertain small sets of markers for the identification of human population structure require prior knowledge of individual ancestry. Based on Principal Components Analysis (PCA), and recent results in theoretical computer science, we present a novel algorithm that, applied on genomewide data, selects small subsets of SNPs (PCA-correlated SNPs) to reproduce the structure found by PCA on the complete dataset, without use of ancestry information. Evaluating our method on a previously described dataset (10,805 SNPs, 11 populations), we demonstrate that a very small set of PCA-correlated SNPs can be effectively employed to assign individuals to particular continents or populations, using a simple clustering algorithm. We validate our methods on the HapMap populations and achieve perfect intercontinental differentiation with 14 PCA-correlated SNPs. The Chinese and Japanese populations can be easily differentiated using less than 100 PCA-correlated SNPs ascertained after evaluating 1.7 million SNPs from HapMap. We show that, in general, structure informative SNPs are not portable across geographic regions. However, we manage to identify a general set of 50 PCA-correlated SNPs that effectively assigns individuals to one of nine different populations. Compared to analysis with the measure of informativeness, our methods, although unsupervised, achieved similar results. We proceed to demonstrate that our algorithm can be effectively used for the analysis of admixed populations without having to trace the origin of individuals. Analyzing a Puerto Rican dataset (192 individuals, 7,257 SNPs), we show that PCA-correlated SNPs can be used to successfully predict structure and ancestry proportions. We subsequently validate these SNPs for structure identification in an independent Puerto Rican dataset. The algorithm that we introduce runs in seconds and can be easily applied on large genome-wide datasets, facilitating the identification of population substructure, stratification assessment in multi-stage whole-genome association studies, and the study of demographic history in human populations.
Genetic markers can be used to infer population structure, a task that remains a central challenge in many areas of genetics such as population genetics, and the search for susceptibility genes for common disorders. In such settings, it is often desirable to reduce the number of markers needed for structure identification. Existing methods to identify structure informative markers demand prior knowledge of the membership of the studied individuals to predefined populations. In this paper, based on the properties of a powerful dimensionality reduction technique (Principal Components Analysis), we develop a novel algorithm that does not depend on any prior assumptions and can be used to identify a small set of structure informative markers. Our method is very fast even when applied to datasets of hundreds of individuals and millions of markers. We evaluate this method on a large dataset of 11 populations from around the world, as well as data from the HapMap project. We show that, in most cases, we can achieve 99% genotyping savings while at the same time recovering the structure of the studied populations. Finally, we show that our algorithm can also be successfully applied for the identification of structure informative markers when studying populations of complex ancestry.
The purpose of this research is to gain a greater insight into the hydrate formation processes of different carbamazepine (CBZ) anhydrate forms in aqueous suspension, where principal component analysis (PCA) was applied for data analysis. The capability of PCA to visualize and to reveal simplified structures that often underlie large data sets are explored. Different CBZ polymorphs were dispersed separately in aqueous solution, and then recovered and measured by FT-Raman spectroscopy. PCA was employed for visualizing the dynamics of the phase transformation from each CBZ polymorph to the dihydrate (DH). As a comparison to PCA visualization, the transformation process of each CBZ polymorph was quantified using PLS modeling. The results demonstrated that PCA has advantages in presenting the original data in terms of the differences and similarities, and also directly identify the statistical patterns in the data even when the data set is large. These advantages provided greater insight into the measured Raman spectra as well as the phase transformation process of CBZ polymorphs to the DH in aqueous environment.
carbamazepine; dihydrate; principal component analysis
Cell penetrating peptides (CPPs) have attracted recent interest as drug delivery tools, although the mechanisms by which CPPs are internalized by cells are not well defined. Here we report a new experimental approach for the detection and secondary structure determination of CPPs in live cells using Raman microscopy with heavy isotope labeling of the peptide. As a first demonstration of principle Penetratin, a sixteen-residue CPP derived from the Antennapedia homeodomain protein of Drosophila, was measured in single, living melanoma cells. Carbon-13 labeling of the Phe residue of penetratin was used to shift the intense aromatic ring-breathing vibrational mode from 1003 cm−1 to 967 cm−1, thereby enabling the peptide to be traced in cells. Difference spectroscopy and principal components analysis (PCA) were used independently to resolve the Raman spectrum of the peptide from the background cellular Raman signals. Based on the position of the amide I vibrational band in the Raman spectra, the secondary structure of the peptide was found to be mainly random coil and β-strand in the cytoplasm, and possibly assembling as β-sheets in the nucleus. The rapid entry and almost uniform cellular distribution of the peptide, as well as the lack of correlation between peptide and lipid Raman signatures, indicated that the mechanism of internalization under the conditions of study was probably non-endocytotic. This experimental approach can be used to study a wide variety of CPPs as well as other classes of peptides in living cells.
The dynamics of macromolecular conformations are critical to the action of cellular networks. Solution X-ray scattering studies, in combination with macromolecular X-ray crystallography (MX) and nuclear magnetic resonance (NMR), strive to determine complete and accurate states of macromolecules, providing novel insights describing allosteric mechanisms, supramolecular complexes, and dynamic molecular machines. This review addresses theoretical and practical concepts, concerns, and considerations for using these techniques in conjunction with computational methods to productively combine solution-scattering data with high-resolution structures. I discuss the principal means of direct identification of macromolecular flexibility from SAXS data followed by critical concerns about the methods used to calculate theoretical SAXS profiles from high-resolution structures. The SAXS profile is a direct interrogation of the thermodynamic ensemble and techniques such as, for example, minimal ensemble search (MES), enhance interpretation of SAXS experiments by describing the SAXS profiles as population-weighted thermodynamic ensembles. I discuss recent developments in computational techniques used for conformational sampling, and how these techniques provide a basis for assessing the level of the flexibility within a sample. Although these approaches sacrifice atomic detail, the knowledge gained from ensemble analysis is often appropriate for developing hypotheses and guiding biochemical experiments. Examples of the use of SAXS and combined approaches with X-ray crystallography, NMR, and computational methods to characterize dynamic assemblies are presented.
Small-angle X-ray scattering (SAXS); Macromolecular flexibility; Rigid-body modeling; Ensemble analysis
Non-random patterns of genetic variation exist among individuals in a population owing to a variety of evolutionary factors. Therefore, populations are structured into genetically distinct subpopulations. As genotypic datasets become ever larger, it is increasingly difficult to correctly estimate the number of subpopulations and assign individuals to them. The computationally efficient non-parametric, chiefly Principal Components Analysis (PCA)-based methods are thus becoming increasingly relied upon for population structure analysis. Current PCA-based methods can accurately detect structure; however, the accuracy in resolving subpopulations and assigning individuals to them is wanting. When subpopulations are closely related to one another, they overlap in PCA space and appear as a conglomerate. This problem is exacerbated when some subpopulations in the dataset are genetically far removed from others. We propose a novel PCA-based framework which addresses this shortcoming.
A novel population structure analysis algorithm called iterative pruning PCA (ipPCA) was developed which assigns individuals to subpopulations and infers the total number of subpopulations present. Genotypic data from simulated and real population datasets with different degrees of structure were analyzed. For datasets with simple structures, the subpopulation assignments of individuals made by ipPCA were largely consistent with the STRUCTURE, BAPS and AWclust algorithms. On the other hand, highly structured populations containing many closely related subpopulations could be accurately resolved only by ipPCA, and not by other methods.
The algorithm is computationally efficient and not constrained by the dataset complexity. This systematic subpopulation assignment approach removes the need for prior population labels, which could be advantageous when cryptic stratification is encountered in datasets containing individuals otherwise assumed to belong to a homogenous population.
While ground state structures combined with chemical tools and enzyme kinetics deliver useful information on possible chemical mechanisms of enzyme catalysis, they do not unravel the finely balanced energy inventory to explain the impressive rate enhancement of enzymes. For this goal, a complete description of enzyme catalysis in the form of an energy landscape is needed. Since the rate of catalysis is determined by the climb over a sequence of energy barriers, we focus here on the critical question of transition pathways. A combination of time-resolved NMR and simulation deliver a glimpse into how proteins can so efficiently move within the ensemble of the native conformations while avoiding unfolding during that journey. The loss of energy due to breakage of native contacts is compensated by non-native transient hydrogen bonds during the transition thereby “holding on” to the energy until the new native contacts form that define the alternate functional state. The use of kinetic isotope effects (KIE) to study the chemical step show that coordinated atomic fluctuations of the protein component dictate the probability of “correct” distance and orientation, due to its extreme sensitivity to distance. The examples here stress the point that highly choreographed conformational sampling together with chemical integrity is a prerequisite for efficient enzyme catalysis.
Successful implementation of feature selection in nuclear magnetic resonance (NMR) spectra not only improves classification ability, but also simplifies the entire modeling process and, thus, reduces computational and analytical efforts. Principal component analysis (PCA) and partial least squares (PLS) have been widely used for feature selection in NMR spectra. However, extracting meaningful metabolite features from the reduced dimensions obtained through PCA or PLS is complicated because these reduced dimensions are linear combinations of a large number of the original features. In this paper, we propose a multiple testing procedure controlling false discovery rate (FDR) as an efficient method for feature selection in NMR spectra. The procedure clearly compensates for the limitation of PCA and PLS and identifies individual metabolite features necessary for classification. In addition, we present orthogonal signal correction to improve classification and visualization by removing unnecessary variations in NMR spectra. Our experimental results with real NMR spectra showed that classification models constructed with the features selected by our proposed procedure yielded smaller misclassification rates than those with all features.
false discovery rate; metabolomics; nuclear magnetic resonance; orthogonal signal correction; feature selection
We describe a promoter recognition method named PCA-HPR to locate eukaryotic promoter regions and predict transcription start sites (TSSs). We computed codon (3-mer) and pentamer (5-mer)
frequencies and created codon and pentamer frequency feature matrices to extract informative and discriminative features for effective classification. Principal component analysis (PCA) is applied
to the feature matrices and a subset of principal components (PCs) are selected for classification. Our system uses three neural network classifiers to distinguish promoters versus exons, promoters
versus introns, and promoters versus 3' un-translated region (3'UTR). We compared PCA-HPR with three well-known existing promoter prediction systems such as DragonGSF, Eponine and FirstEF. Validation
shows that PCA-HPR achieves the best performance with three test sets for all the four predictive systems.
promoter recognition; sequence feature; CpG islands; transcription start sites; principal component analysis
Summary: We developed a Python package, ProDy, for structure-based analysis of protein dynamics. ProDy allows for quantitative characterization of structural variations in heterogeneous datasets of structures experimentally resolved for a given biomolecular system, and for comparison of these variations with the theoretically predicted equilibrium dynamics. Datasets include structural ensembles for a given family or subfamily of proteins, their mutants and sequence homologues, in the presence/absence of their substrates, ligands or inhibitors. Numerous helper functions enable comparative analysis of experimental and theoretical data, and visualization of the principal changes in conformations that are accessible in different functional states. ProDy application programming interface (API) has been designed so that users can easily extend the software and implement new methods.
Availability: ProDy is open source and freely available under GNU General Public License from http://www.csb.pitt.edu/ProDy/.
Contact: email@example.com; firstname.lastname@example.org
The ever increasing sizes of population genetic datasets pose great challenges for population structure analysis. The Tracy-Widom (TW) statistical test is widely used for detecting structure. However, it has not been adequately investigated whether the TW statistic is susceptible to type I error, especially in large, complex datasets. Non-parametric, Principal Component Analysis (PCA) based methods for resolving structure have been developed which rely on the TW test. Although PCA-based methods can resolve structure, they cannot infer ancestry. Model-based methods are still needed for ancestry analysis, but they are not suitable for large datasets. We propose a new structure analysis framework for large datasets. This includes a new heuristic for detecting structure and incorporation of the structure patterns inferred by a PCA method to complement STRUCTURE analysis.
A new heuristic called EigenDev for detecting population structure is presented. When tested on simulated data, this heuristic is robust to sample size. In contrast, the TW statistic was found to be susceptible to type I error, especially for large population samples. EigenDev is thus better-suited for analysis of large datasets containing many individuals, in which spurious patterns are likely to exist and could be incorrectly interpreted as population stratification. EigenDev was applied to the iterative pruning PCA (ipPCA) method, which resolves the underlying subpopulations. This subpopulation information was used to supervise STRUCTURE analysis to infer patterns of ancestry at an unprecedented level of resolution. To validate the new approach, a bovine and a large human genetic dataset (3945 individuals) were analyzed. We found new ancestry patterns consistent with the subpopulations resolved by ipPCA.
The EigenDev heuristic is robust to sampling and is thus superior for detecting structure in large datasets. The application of EigenDev to the ipPCA algorithm improves the estimation of the number of subpopulations and the individual assignment accuracy, especially for very large and complex datasets. Furthermore, we have demonstrated that the structure resolved by this approach complements parametric analysis, allowing a much more comprehensive account of population structure. The new version of the ipPCA software with EigenDev incorporated can be downloaded from http://www4a.biotec.or.th/GI/tools/ippca.