Search tips
Search criteria

Results 1-25 (25)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
Document Types
1.  Evaluation of the power and type I error of recently proposed family-based tests of association for rare variants 
BMC Proceedings  2014;8(Suppl 1):S36.
Until very recently, few methods existed to analyze rare-variant association with binary phenotypes in complex pedigrees. We consider a set of recently proposed methods applied to the simulated and real hypertension phenotype as part of the Genetic Analysis Workshop 18. Minimal power of the methods is observed for genes containing variants with weak effects on the phenotype. Application of the methods to the real hypertension phenotype yielded no genes meeting a strict Bonferroni cutoff of significance. Some prior literature connects 3 of the 5 most associated genes (p <1 × 10−4) to hypertension or related phenotypes. Further methodological development is needed to extend these methods to handle covariates, and to explore more powerful test alternatives.
PMCID: PMC4143711  PMID: 25519321
2.  Evaluating the concordance between sequencing, imputation and microarray genotype calls in the GAW18 data 
BMC Proceedings  2014;8(Suppl 1):S22.
Genotype errors are well known to increase type I errors and/or decrease power in related tests of genotype-phenotype association, depending on whether the genotype error mechanism is associated with the phenotype. These relationships hold for both single and multimarker tests of genotype-phenotype association. To assess the potential for genotype errors in Genetic Analysis Workshop 18 (GAW18) data, where no gold standard genotype calls are available, we explored concordance rates between sequencing, imputation, and microarray genotype calls. Our analysis shows that missing data rates for sequenced individuals are high and that there is a modest amount of called genotype discordance between the 2 platforms, with discordance most common for lower minor allele frequency (MAF) single-nucleotide polymorphisms (SNPs). Some evidence for discordance rates that were different between phenotypes was observed, and we identified a number of cases where different technologies identified different bases at the variant site. Type I errors and power loss is possible as a result of missing genotypes and errors in called genotypes in downstream analysis of GAW18 data.
PMCID: PMC4143748  PMID: 25519374
3.  Application of family-based tests of association for rare variants to pathways 
BMC Proceedings  2014;8(Suppl 1):S105.
Pathway analysis approaches for sequence data typically either operate in a single stage (all variants within all genes in the pathway are combined into a single, very large set of variants that can then be analyzed using standard "gene-based" test statistics) or in 2-stages (gene-based p values are computed for all genes in the pathway, and then the gene-based p values are combined into a single pathway p value). To date, little consideration has been given to the performance of gene-based tests (typically designed for a smaller number of single-nucleotide variants [SNVs]) when the number of SNVs in the gene or in the pathway is very large and the genotypes come from sequence data organized in large pedigrees. We consider recently proposed gene-based tests for rare variants from complex pedigrees that test for association between a large set of SNVs and a qualitative phenotype of interest (1-stage analyses) as well as 2-stage approaches. We find that many of these methods show inflated type I errors when the number of SNVs in the gene or the pathway is large (>200 SNVs) and when using standard approaches to estimate the genotype covariance matrix. Alternative methods are needed when testing very large sets of SNVs in 1-stage approaches.
PMCID: PMC4143675  PMID: 25519359
4.  Genetic Analysis Workshop 18: Methods and strategies for analyzing human sequence and phenotype data in members of extended pedigrees 
BMC Proceedings  2014;8(Suppl 1):S1.
Genetic Analysis Workshop 18 provided a platform for developing and evaluating statistical methods to analyze whole-genome sequence data from a pedigree-based sample. In this article we present an overview of the data sets and the contributions that analyzed these data. The family data, donated by the Type 2 Diabetes Genetic Exploration by Next-Generation Sequencing in Ethnic Samples Consortium, included sequence-level genotypes based on sequencing and imputation, genome-wide association genotypes from prior genotyping arrays, and phenotypes from longitudinal assessments. The contributions from individual research groups were extensively discussed before, during, and after the workshop in theme-based discussion groups before being submitted for publication.
PMCID: PMC4143625  PMID: 25519310
5.  A geometric framework for evaluating rare variant tests of association 
Genetic epidemiology  2013;37(4):10.1002/gepi.21722.
The wave of next-generation sequencing data has arrived. However, many questions still remain about how to best analyze sequence data, particularly the contribution of rare genetic variants to human disease. Numerous statistical methods have been proposed to aggregate association signals across multiple rare variant sites in an effort to increase statistical power; however, the precise relation between the tests is often not well understood. We present a geometric representation for rare variant data in which rare allele counts in case and control samples are treated as vectors in Euclidean space. The geometric framework facilitates a rigorous classification of existing rare variant tests into two broad categories: tests for a difference in the lengths of the case and control vectors, and joint tests for a difference in either the lengths or angles of the two vectors. We demonstrate that genetic architecture of a trait, including the number and frequency of risk alleles, directly relates to the behavior of the length and joint tests. Hence, the geometric framework allows prediction of which tests will perform best under different disease models. Furthermore, the structure of the geometric framework immediately suggests additional classes and types of rare variant tests. We consider two general classes of tests which show robustness to non-causal and protective variants. The geometric framework introduces a novel and unique method to assess current rare variant methodology and provides guidelines for both applied and theoretical researchers.
PMCID: PMC3718063  PMID: 23526307
rare variants; sequencing; burden tests
6.  Evaluating the impact of genotype errors on rare variant tests of association 
The new class of rare variant tests has usually been evaluated assuming perfect genotype information. In reality, rare variant genotypes may be incorrect, and so rare variant tests should be robust to imperfect data. Errors and uncertainty in SNP genotyping are already known to dramatically impact statistical power for single marker tests on common variants and, in some cases, inflate the type I error rate. Recent results show that uncertainty in genotype calls derived from sequencing reads are dependent on several factors, including read depth, calling algorithm, number of alleles present in the sample, and the frequency at which an allele segregates in the population. We have recently proposed a general framework for the evaluation and investigation of rare variant tests of association, classifying most rare variant tests into one of two broad categories (length or joint tests). We use this framework to relate factors affecting genotype uncertainty to the power and type I error rate of rare variant tests. We find that non-differential genotype errors (an error process that occurs independent of phenotype) decrease power, with larger decreases for extremely rare variants, and for the common homozygote to heterozygote error. Differential genotype errors (an error process that is associated with phenotype status), lead to inflated type I error rates which are more likely to occur at sites with more common homozygote to heterozygote errors than vice versa. Finally, our work suggests that certain rare variant tests and study designs may be more robust to the inclusion of genotype errors. Further work is needed to directly integrate genotype calling algorithm decisions, study costs and test statistic choices to provide comprehensive design and analysis advice which appropriately accounts for the impact of genotype errors.
PMCID: PMC3978329  PMID: 24744770
SKAT; gene-based; genotype uncertainty; misclassification; dosage
7.  Optimal methods for using posterior probabilities in association testing 
Human heredity  2013;75(1):2-11.
The use of haplotypes to impute the genotypes of unmeasured single nucleotide variants continues to rise in popularity. Simulation results suggest that the use of the dosage as a one-dimensional summary statistic of imputation posterior probabilities may be optimal both in terms of statistical power and computational efficiency, however little theoretical understanding is available to explain and unify these simulation results. In our analysis, we provide a theoretical foundation for the use of the dosage as a one-dimensional summary statistic of genotype posterior probabilities from any technology.
We analytically evaluate the dosage, mode and the more general set of all one-dimensional summary statistics of two-dimensional (three posterior probabilities that must sum to 1) genotype posterior probability vectors.
We prove that the dosage is an optimal one-dimensional summary statistic under a typical linear disease model and is robust to violations of this model. Simulation results confirm our theoretical findings.
Our analysis provides a strong theoretical basis for the use of the dosage as a one-dimensional summary statistic of genotype posterior probability vectors in related tests of genetic association across a wide variety of genetic disease models.
PMCID: PMC3706784  PMID: 23548776
Imputation; dosage; genome-wide association studies
8.  Assessing Methods for Assigning SNPs to Genes in Gene-Based Tests of Association Using Common Variants 
PLoS ONE  2013;8(5):e62161.
Gene-based tests of association are frequently applied to common SNPs (MAF>5%) as an alternative to single-marker tests. In this analysis we conduct a variety of simulation studies applied to five popular gene-based tests investigating general trends related to their performance in realistic situations. In particular, we focus on the impact of non-causal SNPs and a variety of LD structures on the behavior of these tests. Ultimately, we find that non-causal SNPs can significantly impact the power of all gene-based tests. On average, we find that the “noise” from 6–12 non-causal SNPs will cancel out the “signal” of one causal SNP across five popular gene-based tests. Furthermore, we find complex and differing behavior of the methods in the presence of LD within and between non-causal and causal SNPs. Ultimately, better approaches for a priori prioritization of potentially causal SNPs (e.g., predicting functionality of non-synonymous SNPs), application of these methods to sequenced or fully imputed datasets, and limited use of window-based methods for assigning inter-genic SNPs to genes will improve power. However, significant power loss from non-causal SNPs may remain unless alternative statistical approaches robust to the inclusion of non-causal SNPs are developed.
PMCID: PMC3669368  PMID: 23741293
9.  Assessing the Impact of Differential Genotyping Errors on Rare Variant Tests of Association 
PLoS ONE  2013;8(3):e56626.
Genotyping errors are well-known to impact the power and type I error rate in single marker tests of association. Genotyping errors that happen according to the same process in cases and controls are known as non-differential genotyping errors, whereas genotyping errors that occur with different processes in the cases and controls are known as differential genotype errors. For single marker tests, non-differential genotyping errors reduce power, while differential genotyping errors increase the type I error rate. However, little is known about the behavior of the new generation of rare variant tests of association in the presence of genotyping errors. In this manuscript we use a comprehensive simulation study to explore the effects of numerous factors on the type I error rate of rare variant tests of association in the presence of differential genotyping error. We find that increased sample size, decreased minor allele frequency, and an increased number of single nucleotide variants (SNVs) included in the test all increase the type I error rate in the presence of differential genotyping errors. We also find that the greater the relative difference in case-control genotyping error rates the larger the type I error rate. Lastly, as is the case for single marker tests, genotyping errors classifying the common homozygote as the heterozygote inflate the type I error rate significantly more than errors classifying the heterozygote as the common homozygote. In general, our findings are in line with results from single marker tests. To ensure that type I error inflation does not occur when analyzing next-generation sequencing data careful consideration of study design (e.g. use of randomization), caution in meta-analysis and using publicly available controls, and the use of standard quality control metrics is critical.
PMCID: PMC3589406  PMID: 23472072
10.  Assessing the Impact of Non-Differential Genotyping Errors on Rare Variant Tests of Association 
Human Heredity  2011;72(3):152-159.
We aim to quantify the effect of non-differential genotyping errors on the power of rare variant tests and identify those situations when genotyping errors are most harmful.
We simulated genotype and phenotype data for a range of sample sizes, minor allele frequencies, disease relative risks and numbers of rare variants. Genotype errors were then simulated using five different error models covering a wide range of error rates.
Even at very low error rates, misclassifying a common homozygote as a heterozygote translates into a substantial loss of power, a result that is exacerbated even further as the minor allele frequency decreases. While the power loss from heterozygote to common homozygote errors tends to be smaller for a given error rate, in practice heterozygote to homozygote errors are more frequent and, thus, will have measurable impact on power.
Error rates from genotype-calling technology for next-generation sequencing data suggest that substantial power loss may be seen when applying current rare variant tests of association to called genotypes.
PMCID: PMC3214826  PMID: 22004945
Sequencing data; Power; Case-control; Misclassification
11.  Evaluating the consistency of gene sets used in the analysis of bacterial gene expression data 
BMC Bioinformatics  2012;13:193.
Statistical analyses of whole genome expression data require functional information about genes in order to yield meaningful biological conclusions. The Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) are common sources of functionally grouped gene sets. For bacteria, the SEED and MicrobesOnline provide alternative, complementary sources of gene sets. To date, no comprehensive evaluation of the data obtained from these resources has been performed.
We define a series of gene set consistency metrics directly related to the most common classes of statistical analyses for gene expression data, and then perform a comprehensive analysis of 3581 Affymetrix® gene expression arrays across 17 diverse bacteria. We find that gene sets obtained from GO and KEGG demonstrate lower consistency than those obtained from the SEED and MicrobesOnline, regardless of gene set size.
Despite the widespread use of GO and KEGG gene sets in bacterial gene expression data analysis, the SEED and MicrobesOnline provide more consistent sets for a wide variety of statistical analyses. Increased use of the SEED and MicrobesOnline gene sets in the analysis of bacterial gene expression data may improve statistical power and utility of expression data.
PMCID: PMC3462729  PMID: 22873695
Gene ontology; KEGG; SEED; Operons; Consistency
12.  Identification of Genetic Association of Multiple Rare Variants Using Collapsing Methods 
Genetic Epidemiology  2011;35(Suppl 1):S101-S106.
Next-generation sequencing technology allows investigation of both common and rare variants in humans. Exomes are sequenced on the population level or in families to further study the genetics of human diseases. Genetic Analysis Workshop 17 (GAW17) provided exomic data from the 1000 Genomes Project and simulated phenotypes. These data enabled evaluations of existing and newly developed statistical methods for rare variant sequence analysis for which standard statistical methods fail because of the rareness of the alleles. Various alternative approaches have been proposed that overcome the rareness problem by combining multiple rare variants within a gene. These approaches are termed collapsing methods, and our GAW17 group focused on studying the performance of existing and novel collapsing methods using rare variants. All tested methods performed similarly, as measured by type I error and power. Inflated type I error fractions were consistently observed and might be caused by gametic phase disequilibrium between causal and noncausal rare variants in this relatively small sample as well as by population stratification. Incorporating prior knowledge, such as appropriate covariates and information on functionality of SNPs, increased the power of detecting associated genes. Overall, collapsing rare variants can increase the power of identifying disease-associated genes. However, studying genetic associations of rare variants remains a challenging task that requires further development and improvement in data collection, management, analysis, and computation.
PMCID: PMC3289287  PMID: 22128049
1000 Genomes Project; association; collapsing methods; next-generation sequencing
13.  The Cost-Effectiveness of Reclassification Sampling for Prevalence Estimation 
PLoS ONE  2012;7(2):e32058.
Typically, a two-phase (double) sampling strategy is employed when classifications are subject to error and there is a gold standard (perfect) classifier available. Two-phase sampling involves classifying the entire sample with an imperfect classifier, and a subset of the sample with the gold-standard.
Methodology/Principal Findings
In this paper we consider an alternative strategy termed reclassification sampling, which involves classifying individuals using the imperfect classifier more than one time. Estimates of sensitivity, specificity and prevalence are provided for reclassification sampling, when either one or two binary classifications of each individual using the imperfect classifier are available. Robustness of estimates and design decisions to model assumptions are considered. Software is provided to compute estimates and provide advice on the optimal sampling strategy.
Reclassification sampling is shown to be cost-effective (lower standard error of estimates for the same cost) for estimating prevalence as compared to two-phase sampling in many practical situations.
PMCID: PMC3278465  PMID: 22348146
14.  Inference of the Transcriptional Regulatory Network in Staphylococcus aureus by Integration of Experimental and Genomics-Based Evidence▿† 
Journal of Bacteriology  2011;193(13):3228-3240.
Transcriptional regulatory networks are fine-tuned systems that help microorganisms respond to changes in the environment and cell physiological state. We applied the comparative genomics approach implemented in the RegPredict Web server combined with SEED subsystem analysis and available information on known regulatory interactions for regulatory network reconstruction for the human pathogen Staphylococcus aureus and six related species from the family Staphylococcaceae. The resulting reference set of 46 transcription factor regulons contains more than 1,900 binding sites and 2,800 target genes involved in the central metabolism of carbohydrates, amino acids, and fatty acids; respiration; the stress response; metal homeostasis; drug and metal resistance; and virulence. The inferred regulatory network in S. aureus includes ∼320 regulatory interactions between 46 transcription factors and ∼550 candidate target genes comprising 20% of its genome. We predicted ∼170 novel interactions and 24 novel regulons for the control of the central metabolic pathways in S. aureus. The reconstructed regulons are largely variable in the Staphylococcaceae: only 20% of S. aureus regulatory interactions are conserved across all studied genomes. We used a large-scale gene expression data set for S. aureus to assess relationships between the inferred regulons and gene expression patterns. The predicted reference set of regulons is captured within the Staphylococcus collection in the RegPrecise database (
PMCID: PMC3133287  PMID: 21531804
15.  Inflated Type I Error Rates When Using Aggregation Methods to Analyze Rare Variants in the 1000 Genomes Project Exon Sequencing Data in Unrelated Individuals: Summary Results from Group 7 at Genetic Analysis Workshop 17 
Genetic epidemiology  2011;35(Suppl 1):S56-S60.
As part of Genetic Analysis Workshop 17 (GAW17), our group considered the application of novel and standard approaches to the analysis of genotype-phenotype association in next-generation sequencing data. Our group identified a major issue in the analysis of the GAW17 next-generation sequencing data: type I error and false-positive report probability rates higher than those expected based on empirical type I error levels (as high as 90%). Two main causes emerged: population stratification and long-range correlation (gametic phase disequilibrium) between rare variants. Population stratification was expected because of the diverse sample. Correlation between rare variants was attributable to both random causes (e.g., nearly 10,000 of 25,000 markers were private variants, and the sample size was small [n = 697]) and nonrandom causes (more correlation was observed than was expected by random chance). Principal components analysis was used to control for population structure and helped to minimize type I errors, but this was at the expense of identifying fewer causal variants. A novel multiple regression approach showed promise to handle correlation between markers. Further work is needed, first, to identify best practices for the control of type I errors in the analysis of sequencing data and then to explore and compare the many promising new aggregating approaches for identifying markers associated with disease phenotypes.
PMCID: PMC3249221  PMID: 22128060
population structure; correlated markers; next-generation sequencing
16.  Identifying rare variants from exome scans: the GAW17 experience 
BMC Proceedings  2011;5(Suppl 9):S1.
Genetic Analysis Workshop 17 (GAW17) provided a platform for evaluating existing statistical genetic methods and for developing novel methods to analyze rare variants that modulate complex traits. In this article, we present an overview of the 1000 Genomes Project exome data and simulated phenotype data that were distributed to GAW17 participants for analyses, the different issues addressed by the participants, and the process of preparation of manuscripts resulting from the discussions during the workshop.
PMCID: PMC3287821  PMID: 22373325
17.  Evaluating methods for the analysis of rare variants in sequence data 
BMC Proceedings  2011;5(Suppl 9):S119.
A number of rare variant statistical methods have been proposed for analysis of the impending wave of next-generation sequencing data. To date, there are few direct comparisons of these methods on real sequence data. Furthermore, there is a strong need for practical advice on the proper analytic strategies for rare variant analysis. We compare four recently proposed rare variant methods (combined multivariate and collapsing, weighted sum, proportion regression, and cumulative minor allele test) on simulated phenotype and next-generation sequencing data as part of Genetic Analysis Workshop 17. Overall, we find that all analyzed methods have serious practical limitations on identifying causal genes. Specifically, no method has more than a 5% true discovery rate (percentage of truly causal genes among all those identified as significantly associated with the phenotype). Further exploration shows that all methods suffer from inflated false-positive error rates (chance that a noncausal gene will be identified as associated with the phenotype) because of population stratification and gametic phase disequilibrium between noncausal SNPs and causal SNPs. Furthermore, observed true-positive rates (chance that a truly causal gene will be identified as significantly associated with the phenotype) for each of the four methods was very low (<19%). The combination of larger than anticipated false-positive rates, low true-positive rates, and only about 1% of all genes being causal yields poor discriminatory ability for all four methods. Gametic phase disequilibrium and population stratification are important areas for further research in the analysis of rare variant data.
PMCID: PMC3287843  PMID: 22373354
18.  Evaluating methods for combining rare variant data in pathway-based tests of genetic association 
BMC Proceedings  2011;5(Suppl 9):S48.
Analyzing sets of genes in genome-wide association studies is a relatively new approach that aims to capitalize on biological knowledge about the interactions of genes in biological pathways. This approach, called pathway analysis or gene set analysis, has not yet been applied to the analysis of rare variants. Applying pathway analysis to rare variants offers two competing approaches. In the first approach rare variant statistics are used to generate p-values for each gene (e.g., combined multivariate collapsing [CMC] or weighted-sum [WS]) and the gene-level p-values are combined using standard pathway analysis methods (e.g., gene set enrichment analysis or Fisher’s combined probability method). In the second approach, rare variant methods (e.g., CMC and WS) are applied directly to sets of single-nucleotide polymorphisms (SNPs) representing all SNPs within genes in a pathway. In this paper we use simulated phenotype and real next-generation sequencing data from Genetic Analysis Workshop 17 to analyze sets of rare variants using these two competing approaches. The initial results suggest substantial differences in the methods, with Fisher’s combined probability method and the direct application of the WS method yielding the best power. Evidence suggests that the WS method works well in most situations, although Fisher’s method was more likely to be optimal when the number of causal SNPs in the set was low but the risk of the causal SNPs was high.
PMCID: PMC3287885  PMID: 22373429
19.  Inclusion of a Priori Information in Genome-Wide Association Analysis 
Genetic epidemiology  2009;33(Suppl 1):S74-S80.
Genome-wide association studies (GWAS) continue to gain in popularity. To utilize the wealth of data created more effectively, a variety of methods have recently been proposed to include a priori information (e.g., biologically interpretable sets of genes, candidate gene information, or gene expression) in GWAS analysis. Six contributions to Genetic Analysis Workshop 16 Group 11 applied novel or recently proposed methods to GWAS of rheumatoid arthritis and heart disease related phenotypes. The results of these analyses were a variety of novel candidate genes and sets of genes, in addition to the validation of well known genotype-phenotype associations. However, because many methods are relatively new, they would benefit from further methodological research to ensure that they maintain type I error rates while increasing power to find additional associations. When methods have been adapted from other study types (e.g., gene expression data analysis or linkage analysis) the lessons learned there should be used to guide implementation of techniques. Lastly, many open research questions exist concerning the logistic details of the origin of the a priori information and the way to incorporate it. Overall, our group has demonstrated a strong potential for identifying novel genotype-phenotype relationships by including a priori data in the analysis of GWAS, while also uncovering a series of questions requiring further research.
PMCID: PMC2922922  PMID: 19924705
gene set analysis; external information; gene expression; hierarchical Bayesian model; candidate regions; candidate genes; pathway
20.  Incorporating Duplicate Genotype Data into Linear Trend Tests of Genetic Association: Methods and Cost-Effectiveness 
The genome-wide association (GWA) study is an increasingly popular way to attempt to identify the causal variants in human disease. Duplicate genotyping (or re-genotyping) a portion of the samples in a GWA study is common, though it is typical for these data to be ignored in subsequent tests of genetic association. We demonstrate a method for including duplicate genotype data in linear trend tests of genetic association which yields increased power. We also consider the cost-effectiveness of collecting duplicate genotype data and find that when the relative cost of genotyping to phenotyping and sample acquisition costs is less than or equal to the genotyping error rate it is more powerful to duplicate genotype the entire sample instead of spending the same money to increase the sample size. Duplicate genotyping is particularly cost-effective when SNP minor allele frequencies are low. Practical advice for the implementation of duplicate genotyping is provided. Free software is provided to compute asymptotic and permutation based tests of association using duplicate genotype data as well as to aid in the duplicate genotyping design decision.
PMCID: PMC2861316  PMID: 19492982
21.  Smoking initiation and nicotine dependence symptoms in Ukraine: Findings from the Ukraine World Mental Health survey 
Public health  2007;121(9):663-672.
Cigarette smoking is a major cause of morbidity and mortality in former Soviet countries. This study examined the personal, familial and psychiatric risk factors for smoking initiation and development of nicotine dependence symptoms in Ukraine.
Study Design
Cross-sectional survey.
Smoking history and dependence symptoms were ascertained from N=1,711 adults in Ukraine as part of a national mental health survey conducted in 2002. Separate analyses were conducted for men and women.
The prevalence of lifetime regular smoking was 80.5% in men and 18.7% in women, with median ages at initiation among smokers of 17 and 18, respectively. Furthermore, 61.2% of men and 11.9% of women were current smokers; among the subgroup of lifetime smokers, 75.9% of men and 63.1% of women currently smoked. The youngest female cohort (born 1965–1984) was 26 times more likely to start smoking than the oldest. Smoking initiation was also linked to childhood externalizing behaviors and antecedent use of alcohol in both genders, as well as marital status and personal alcohol abuse in men, and childhood urbanicity and birth cohort in women. Dependence symptoms developed in 61.7% of male and 47.1% of female smokers. The rate increased sharply in the first four years after smoking initiation. Dependence symptoms were related to birth cohort and alcohol abuse in both genders, as well as growing up in a suburb or town and childhood externalizing behaviors in men, and parental antisocial behavior in women.
Increased smoking in young women heralds a rising epidemic in Ukraine and underscores the need for primary prevention programs, especially in urban areas. Our findings support the importance of childhood and alcohol-related risk factors, especially in women, while pre-existing depression and anxiety disorders were only weakly associated with starting to smoke or developing dependence symptoms.
PMCID: PMC2793595  PMID: 17544466
Ukraine; Soviet; smoking; nicotine dependence; risk factors; alcohol
22.  Comparing gene set analysis methods on single-nucleotide polymorphism data from Genetic Analysis Workshop 16 
BMC Proceedings  2009;3(Suppl 7):S96.
Recently, gene set analysis (GSA) has been extended from use on gene expression data to use on single-nucleotide polymorphism (SNP) data in genome-wide association studies. When GSA has been demonstrated on SNP data, two popular statistics from gene expression data analysis (gene set enrichment analysis [GSEA] and Fisher's exact test [FET]) have been used. However, GSEA and FET have shown a lack of power and robustness in the analysis of gene expression data. The purpose of this work is to investigate whether the same issues are also true for the analysis of SNP data. Ultimately, we conclude that GSEA and FET are not optimal for the analysis of SNP data when compared with the SUMSTAT method. In analysis of real SNP data from the Framingham Heart Study, we find that SUMSTAT finds many more gene sets to be significant when compared with other methods. In an analysis of simulated data, SUMSTAT demonstrates high power and better control of the type I error rate. GSA is a promising approach to the analysis of SNP data in GWAS and use of the SUMSTAT statistic instead of GSEA or FET may increase power and robustness.
PMCID: PMC2796000  PMID: 20018093
23.  The cost effectiveness of duplicate genotyping for testing genetic association 
Annals of human genetics  2009;73(Pt 3):370-378.
We consider a modification to the traditional genome wide association (GWA) study design: duplicate genotyping. Duplicate genotyping (re-genotyping some of the samples) has long been suggested for quality control reasons, however has not been evaluated for its statistical cost-effectiveness. We demonstrate that when genotyping error rates are at least m%, duplicate genotyping provides a cost-effective (more statistical power for the same price) design alternative when relative genotype to phenotype/sample acquisition costs are no more than m%. In addition to cost and error rate, duplicate genotyping is most cost-effective for SNPs with low minor allele frequency. In general, relative genotype to phenotype/sample acquisition costs will be low when following up a limited number of SNPs in the second stage of a two-stage GWA study design, and, thus, duplicate genotyping may be useful in these situations. In cases where many SNPs are being followed up at the second stage, duplicate genotyping only low-quality SNPs with low minor allele frequency may be cost-effective. We also find that in almost all cases where duplicate genotyping is cost-effective, the most cost-effective design strategy involves duplicate genotyping all samples. Free software is provided which evaluates the cost-effectiveness of duplicate genotyping based on user inputs.
PMCID: PMC2739690  PMID: 19344449
Genotype Error; Genome-wide Association Study; Re-genotype; Two-Stage; Power; Duplicate genotype
24.  Gene set analyses for interpreting microarray experiments on prokaryotic organisms 
BMC Bioinformatics  2008;9:469.
Despite the widespread usage of DNA microarrays, questions remain about how best to interpret the wealth of gene-by-gene transcriptional levels that they measure. Recently, methods have been proposed which use biologically defined sets of genes in interpretation, instead of examining results gene-by-gene. Despite a serious limitation, a method based on Fisher's exact test remains one of the few plausible options for gene set analysis when an experiment has few replicates, as is typically the case for prokaryotes.
We extend five methods of gene set analysis from use on experiments with multiple replicates, for use on experiments with few replicates. We then use simulated and real data to compare these methods with each other and with the Fisher's exact test (FET) method. As a result of the simulation we find that a method named MAXMEAN-NR, maintains the nominal rate of false positive findings (type I error rate) while offering good statistical power and robustness to a variety of gene set distributions for set sizes of at least 10. Other methods (ABSSUM-NR or SUM-NR) are shown to be powerful for set sizes less than 10. Analysis of three sets of experimental data shows similar results. Furthermore, the MAXMEAN-NR method is shown to be able to detect biologically relevant sets as significant, when other methods (including FET) cannot. We also find that the popular GSEA-NR method performs poorly when compared to MAXMEAN-NR.
MAXMEAN-NR is a method of gene set analysis for experiments with few replicates, as is common for prokaryotes. Results of simulation and real data analysis suggest that the MAXMEAN-NR method offers increased robustness and biological relevance of findings as compared to FET and other methods, while maintaining the nominal type I error rate.
PMCID: PMC2587482  PMID: 18986519
25.  Characteristics of replicated single-nucleotide polymorphism genotypes from COGA: Affymetrix and Center for Inherited Disease Research 
BMC Genetics  2005;6(Suppl 1):S154.
Genetic Analysis Workshop 14 provided re-genotyped single-nucleotide polymorphism (SNP) data. Specifically, both Center for Inherited Disease Research (CIDR) and Affymetrix genotyped the same 11,560 SNPs from the Affymetrix GeneChip Mapping 10K Array marker set on the same 184 individuals from the Collaborative Study on the Genetics of Alcoholism database. While the inconsistency rate between CIDR and Affymetrix (two different genotypes for the same subject) was low (0.2%), the non-replication rate (two different genotypes for the same subject or one identified genotype and one missing genotype) was substantial (9.5%). The missing data could be from no-call regions, which is inconsistent with recent recommendations about the use of no-call regions in association tests. In addition, no-call regions would suggest that the actual inconsistency rate is higher than reported. A high inconsistency rate has significant impact on power in related hypothesis tests. In addition, the data are consistent with assumptions made in a recently proposed likelihood ratio test of association for re-genotyped data.
PMCID: PMC1866780  PMID: 16451615

Results 1-25 (25)