Alu elements are trans-mobilized by the autonomous non-LTR retroelement, LINE-1 (L1). Alu-induced insertion mutagenesis contributes to about 0.1% human genetic disease and is responsible for the majority of the documented instances of human retroelement insertion-induced disease. Here we introduce a SINE recovery method that provides a complementary approach for comprehensive analysis of the impact and biological mechanisms of Alu retrotransposition. Using this approach, we recovered 226 de novo tagged Alu inserts in HeLa cells. Our analysis reveals that in human cells marked Alu inserts driven by either exogenously supplied full length L1 or ORF2 protein are indistinguishable. Four percent of de novo Alu inserts were associated with genomic deletions and rearrangements and lacked the hallmarks of retrotransposition. In contrast to L1 inserts, 5′ truncations of Alu inserts are rare, as most of the recovered inserts (96.5%) are full length. De novo Alus show a random pattern of insertion across chromosomes, but further characterization revealed an Alu insertion bias exists favoring insertion near other SINEs, highly conserved elements, with almost 60% landing within genes. De novo Alu inserts show no evidence of RNA editing. Priming for reverse transcription rarely occurred within the first 20 bp (most 5′) of the A-tail. The A-tails of recovered inserts show significant expansion, with many at least doubling in length. Sequence manipulation of the construct led to the demonstration that the A-tail expansion likely occurs during insertion due to slippage by the L1 ORF2 protein. We postulate that the A-tail expansion directly impacts Alu evolution by reintroducing new active source elements to counteract the natural loss of active Alus and minimizing Alu extinction.
SINEs are mobile elements that are found ubiquitously throughout a large diversity of genomes from plants to mammals. The human SINE, Alu, is among the most successful mobile elements, with more than one million copies in the genome. Due to its high activity and ability to insert throughout the genome, Alu retrotransposition is responsible for the majority of diseases reported to be caused by mobile element activity. To further evaluate the genomic impact of SINEs, we recovered and characterized over 200 de novo Alu inserts under controlled conditions. Our data reinforce observations on the mutagenic potential of Alu, with newly retrotransposed Alu elements favoring insertion into genic and highly conserved elements. Alu-mediated deletions and rearrangements are infrequent and lack the typical hallmarks of TPRT retrotransposition, suggesting the use of an alternate method for resolving retrotransposition intermediates or an atypical insertion mechanism. Our data also provide novel insights into SINE retrotransposition biology. We found that slippage of L1 ORF2 protein during reverse transcription expands the A-tails of de novo insertions. We propose that the L1 ORF2 protein plays a major role in minimizing Alu extinction by reintroducing active Alu elements to counter the natural loss of Alu source elements.
Autism spectrum disorders (ASD) represent a group of neurodevelopmental disorders characterized by a core set of social-communicative and behavioral impairments. Gamma-aminobutyric acid (GABA) is the major inhibitory neurotransmitter in the brain, acting primarily via the GABA receptors (GABR). Multiple lines of evidence, including altered GABA and GABA receptor expression in autistic patients, indicate that the GABAergic system may be involved in the etiology of autism.
As copy number variations (CNVs), particularly rare and de novo CNVs, have now been implicated in ASD risk, we examined the GABA receptors and genes in related pathways for structural variation that may be associated with autism. We further extended our candidate gene set to include 19 genes and regions that had either been directly implicated in the autism literature or were directly related (via function or ancestry) to these primary candidates. For the high resolution CNV screen we employed custom-designed 244 k comparative genomic hybridization (CGH) arrays. Collectively, our probes spanned a total of 11 Mb of GABA-related and additional candidate regions with a density of approximately one probe every 200 nucleotides, allowing a theoretical resolution for detection of CNVs of approximately 1 kb or greater on average. One hundred and sixty-eight autism cases and 149 control individuals were screened for structural variants. Prioritized CNV events were confirmed using quantitative PCR, and confirmed loci were evaluated on an additional set of 170 cases and 170 control individuals that were not included in the original discovery set. Loci that remained interesting were subsequently screened via quantitative PCR on an additional set of 755 cases and 1,809 unaffected family members.
Results include rare deletions in autistic individuals at JAKMIP1, NRXN1, Neuroligin4Y, OXTR, and ABAT. Common insertion/deletion polymorphisms were detected at several loci, including GABBR2 and NRXN3. Overall, statistically significant enrichment in affected vs. unaffected individuals was observed for NRXN1 deletions.
These results provide additional support for the role of rare structural variation in ASD.
AUTISM; CGH; CNV; GABA; NRXN1
Autism spectrum disorder (ASD) is a complex neurodevelopmental disorder with a strong genetic component. The skewed prevalence toward males and evidence suggestive of linkage to the X chromosome in some studies suggest the presence of X-linked susceptibility genes in people with ASD.
We analyzed genome-wide association study (GWAS) data on the X chromosome in three independent autism GWAS data sets: two family data sets and one case-control data set. We performed meta- and joint analyses on the combined family and case-control data sets. In addition to the meta- and joint analyses, we performed replication analysis by using the two family data sets as a discovery data set and the case-control data set as a validation data set.
One SNP, rs17321050, in the transducin β-like 1X-linked (TBL1X) gene [OMIM:300196] showed chromosome-wide significance in the meta-analysis (P value = 4.86 × 10-6) and joint analysis (P value = 4.53 × 10-6) in males. The SNP was also close to the replication threshold of 0.0025 in the discovery data set (P = 5.89 × 10-3) and passed the replication threshold in the validation data set (P = 2.56 × 10-4). Two other SNPs in the same gene in linkage disequilibrium with rs17321050 also showed significance close to the chromosome-wide threshold in the meta-analysis.
TBL1X is in the Wnt signaling pathway, which has previously been implicated as having a role in autism. Deletions in the Xp22.2 to Xp22.3 region containing TBL1X and surrounding genes are associated with several genetic syndromes that include intellectual disability and autistic features. Our results, based on meta-analysis, joint analysis and replication analysis, suggest that TBL1X may play a role in ASD risk.
Despite the ever-increasing throughput and steadily decreasing cost of next
generation sequencing (NGS), whole genome sequencing of humans is still not a
viable option for the majority of genetics laboratories. This is particularly
true in the case of complex disease studies, where large sample sets are often
required to achieve adequate statistical power. To fully leverage the potential
of NGS technology on large sample sets, several methods have been developed to
selectively enrich for regions of interest. Enrichment reduces both monetary and
computational costs compared to whole genome sequencing, while allowing
researchers to take advantage of NGS throughput. Several targeted enrichment
approaches are currently available, including molecular inversion probe ligation
sequencing (MIPS), oligonucleotide hybridization based approaches, and PCR-based
strategies. To assess how these methods performed when used in conjunction with
the ABI SOLID3+, we investigated three enrichment techniques: Nimblegen
oligonucleotide hybridization array-based capture; Agilent SureSelect
oligonucleotide hybridization solution-based capture; and Raindance
Technologies' multiplexed PCR-based approach. Target regions were selected
from exons and evolutionarily conserved areas throughout the human genome. Probe
and primer pair design was carried out for all three methods using their
respective informatics pipelines. In all, approximately 0.8 Mb of target space
was identical for all 3 methods. SOLiD sequencing results were analyzed for
several metrics, including consistency of coverage depth across samples,
on-target versus off-target efficiency, allelic bias, and genotype concordance
with array-based genotyping data. Agilent SureSelect exhibited superior
on-target efficiency and correlation of read depths across samples. Nimblegen
performance was similar at read depths at 20× and below. Both Raindance
and Nimblegen SeqCap exhibited tighter distributions of read depth around the
mean, but both suffered from lower on-target efficiency in our experiments.
Raindance demonstrated the highest versatility in assay design.
The Eigenstrat method, based on principal components analysis (PCA), is commonly used both to quantify population relationships in population genetics and to correct for population stratification in genome-wide association studies. However, it can be difficult to make appropriate inference about population relationships from the principal component (PC) scatter plot. Here, to better understand the working mechanism of the Eigenstrat method, we consider its theoretical or “population” formulation. The eigen-equation for samples from an arbitrary number () of populations is reduced to that of a matrix of dimension , the elements of which are determined by the variance-covariance matrix for the random vector of the allele frequencies. Solving the reduced eigen-equation is numerically trivial and yields eigenvectors that are the axes of variation required for differentiating the populations. Using the reduced eigen-equation, we investigate the within-population fluctuations around the axes of variation on the PC scatter plot for simulated datasets. Specifically, we show that there exists an asymptotically stable pattern of the PC plot for large sample size. Our results provide theoretical guidance for interpreting the pattern of PC plot in terms of population relationships. For applications in genetic association tests, we demonstrate that, as a method of correcting for population stratification, regressing out the theoretical PCs corresponding to the axes of variation is equivalent to simply removing the population mean of allele counts and works as well as or better than the Eigenstrat method.
Although autism is one of the most heritable neuropsychiatric disorders, its underlying genetic architecture has largely eluded description. To comprehensively examine the hypothesis that common variation is important in autism, we performed a genome-wide association study (GWAS) using a discovery dataset of 438 autistic Caucasian families and the Illumina Human 1M beadchip. 96 single nucleotide polymorphisms (SNPs) demonstrated strong association with autism risk (p-value < 0.0001). The validation of the top 96 SNPs was performed using an independent dataset of 487 Caucasian autism families genotyped on the 550K Illumina BeadChip. A novel region on chromosome 5p14.1 showed significance in both the discovery and validation datasets. Joint analysis of all SNPs in this region identified 8 SNPs having improved p-values (3.24E-04 to 3.40E-06) than in either dataset alone. Our findings demonstrate that in addition to multiple rare variations, part of the complex genetic architecture of autism involves common variation.
We recently used computational phylogenetic methods on lexical data to test between two scenarios for the peopling of the Pacific. Our analyses of lexical data supported a pulse-pause scenario of Pacific settlement in which the Austronesian speakers originated in Taiwan around 5,200 years ago and rapidly spread through the Pacific in a series of expansion pulses and settlement pauses. We claimed that there was high congruence between traditional language subgroups and those observed in the language phylogenies, and that the estimated age of the Austronesian expansion at 5,200 years ago was consistent with the archaeological evidence. However, the congruence between the language phylogenies and the evidence from historical linguistics was not quantitatively assessed using tree comparison metrics. The robustness of the divergence time estimates to different calibration points was also not investigated exhaustively. Here we address these limitations by using a systematic tree comparison metric to calculate the similarity between the Bayesian phylogenetic trees and the subgroups proposed by historical linguistics, and by re-estimating the age of the Austronesian expansion using only the most robust calibrations. The results show that the Austronesian language phylogenies are highly congruent with the traditional subgroupings, and the date estimates are robust even when calculated using a restricted set of historical calibrations.
Recombination rates vary widely across the human genome, but little of that variation is correlated with known DNA sequence features. The genome contains more than one million Alu mobile element insertions, and these insertions have been implicated in non-homologous recombination, modulation of DNA methylation, and transcriptional regulation. If individual Alu insertions have even modest effects on local recombination rates, they could collectively have a significant impact on the pattern of linkage disequilibrium in the human genome and on the evolution of the Alu family itself.
We carried out sequencing, SNP identification, and SNP genotyping around 19 AluY insertion loci in 347 individuals sampled from diverse populations, then used the SNP genotypes to estimate local recombination rates around the AluY loci. The loci and SNPs were chosen so as to minimize other factors (such as SNP ascertainment bias and SNP density) that could influence recombination rate estimates. We detected a significant increase in recombination rate within ~2 kb of the AluY insertions in our African population sample. To test this observation against a larger set of AluY insertions, we applied our locus- and SNP-selection design and analyses to the HapMap Phase II data. In that data set, we observed a significantly increased recombination rate near AluY insertions in both the CEU and YRI populations.
We show that the presence of a fixed AluY insertion is significantly predictive of an elevated local recombination rate within 2 kb of the insertion, independent of other known predictors. The magnitude of this effect, approximately a 6% increase, is comparable to the effects of some recombinogenic DNA sequence motifs identified via their association with recombination hot spots.
The identification of repeat structure in eukaryotic genomes can be time-consuming and difficult because of the large amount of information (~3×109 bp) that needs to be processed and compared. We introduce a new approach based on exact word counts to evaluate, de novo, the repeat structure present within large eukaryotic genomes. This approach avoids sequence alignment and similarity search, two of the most time-consuming components of traditional methods for repeat identification. Algorithms were implemented to efficiently calculate exact counts for any length oligonucleotide in large genomes. Based on these oligonucleotide counts, oligonucleotide excess probability clouds, or “P-clouds”, were constructed. P-clouds are composed of clusters of related oligonucleotides that occur, as a group, more often than expected by chance. After construction, P-clouds were mapped back onto the genome, and regions of high P-cloud density were identified as repetitive regions based on a sliding window approach. This efficient method is capable of analyzing the repeat content of the entire human genome on a single desktop computer in less than half a day, at least 10-fold faster than current approaches. The predicted repetitive regions strongly overlap with known repeat elements, as well as other repetitive regions such as gene families, pseudogenes and segmental duplicons. This method should be extremely useful as a tool for use in de novo identification of repeat structure in large newly sequenced genomes.
alignment; complete genome annotation; oligonucleotide counts; P-clouds; repeat structure
The human Long Interspersed Element-1 (LINE-1) and the Short Interspersed Element (SINE) Alu comprise 28% of the human genome. They share the same L1-encoded endonuclease for insertion, which recognizes an A+T-rich sequence. Under a simple model of insertion distribution, this nucleotide preference would lead to the prediction that the populations of both elements would be biased towards A+T-rich regions. Genomic L1 elements do show an A+T-rich bias. In contrast, Alu is biased towards G+C-rich regions when compared to the genome average. Several analyses have demonstrated that relatively recent insertions of both elements show less G+C content bias relative to older elements. We have analyzed the repetitive element and G+C composition of more than 100 pre-insertion loci derived from de novo L1 insertions in cultured human cancer cells, which should represent an evolutionarily unbiased set of insertions. An A+T-rich bias is observed in the 50 bp flanking the endonuclease target site, consistent with the known target site for the L1 endonuclease. The L1, Alu, and G+C content of 20 kb of the de novo pre-insertion loci show a different set of biases than those observed for fixed L1s in the human genome. In contrast to the insertion sites of genomic L1s, the de novo L1 pre-insertion loci are relatively L1-poor, Alu-rich and G+C-neutral. Finally, a statistically significant cluster of de novo L1 insertions was localized in the vicinity of the c-myc gene. These results suggest that the initial insertion preference of L1, while A+T-rich in the initial vicinity of the break site, can be influenced by the broader content of the flanking genomic region and have implications for understanding the dynamics of L1 and Alu distributions in the human genome.
LINE; Retrotransposition; Alu; LINE; SINE
The long interspersed element-1 (LINE-1 or L1) is a highly successful retrotransposon in mammals. L1 elements have continued to actively propagate subsequent to the human–chimpanzee divergence, ~6 million years ago, resulting in species-specific inserts. Here, we report a detailed characterization of chimpanzee-specific L1 subfamily diversity and a comparison with their human-specific counterparts. Our results indicate that L1 elements have experienced different evolutionary fates in humans and chimpanzees within the past ~6 million years. Although the species-specific L1 copy numbers are on the same order in both species (1200–2000 copies), the number of retrotransposition-competent elements appears to be much higher in the human genome than in the chimpanzee genome. Also, while human L1 subfamilies belong to the same lineage, we identified two lineages of recently integrated L1 subfamilies in the chimpanzee genome. The two lineages seem to have coexisted for several million years, but only one shows evidence of expansion within the past three million years. These differential evolutionary paths may be the result of random variation, or the product of competition between L1 subfamily lineages. Our results suggest that the coexistence of several L1 subfamily lineages within a species may be resolved in a very short evolutionary period of time, perhaps in just a few million years. Therefore, the chimpanzee genome constitutes an excellent model in which to analyze the evolutionary dynamics of L1 retrotransposons.
L1 elements; Retrotransposons; Human; Chimpanzee; Species-specific; Polymorphism
Long interspersed element-1 elements compose on average one-fifth of mammalian genomes. The expression and retrotransposition of L1 is restricted by a number of cellular mechanisms in order to limit their damage in both germ-line and somatic cells. L1 transcription is largely suppressed in most tissues, but L1 mRNA and/or proteins are still detectable in testes, a number of specific somatic cell types, and malignancies. Down-regulation of L1 expression via premature polyadenylation has been found to be a secondary mechanism of limiting L1 expression. We demonstrate that mammalian L1 elements contain numerous functional splice donor and acceptor sites. Efficient usage of some of these sites results in extensive and complex splicing of L1. Several splice variants of both the human and mouse L1 elements undergo retrotransposition. Some of the spliced L1 mRNAs can potentially contribute to expression ofopen reading frame 2-related products and therefore have implications for the mobility of SINEs even if they are incompetent for L1 retrotransposition. Analysis of the human EST database revealed that L1 elements also participate in splicing events with other genes. Such contribution of functional splice sites by L1 may result in disruption of normal gene expression or formation of alternative mRNA transcripts.
Retrotransposons have had a considerable impact on the overall architecture of the human genome. Currently, there are three lineages of retrotransposons (Alu, L1, and SVA) that are believed to be actively replicating in humans. While estimates of their copy number, sequence diversity, and levels of insertion polymorphism can readily be obtained from existing genomic sequence data and population sampling, a detailed understanding of the temporal pattern of retrotransposon amplification remains elusive. Here we pose the question of whether, using genomic sequence and population frequency data from extant taxa, one can adequately reconstruct historical amplification patterns. To this end, we developed a computer simulation that incorporates several known aspects of primate Alu retrotransposon biology and accommodates sampling effects resulting from the methods by which mobile elements are typically discovered and characterized. By modeling a number of amplification scenarios and comparing simulation-generated expectations to empirical data gathered from existing Alu subfamilies, we were able to statistically reject a number of amplification scenarios for individual subfamilies, including that of a rapid expansion or explosion of Alu amplification at the time of human–chimpanzee divergence.
Nearly 50% of the human genome is composed of mobile elements. While much of this sequence consists of inactive “fossil” elements that are no longer actively moving or generating new copies, three families are currently proliferating in human genomes. Among these, the Alu lineage has reached a copy number of over 1 million and alone accounts for approximately 10% of the genome. While considerable evidence has been gathered concerning the underlying biological mechanisms of Alu mobilization and proliferation, a detailed understanding of Alu amplification history is currently lacking. Researchers are aware, for example, that several thousand Alu elements have inserted within the human genome since the divergence of humans and chimpanzees, but how those insertions were distributed over this ~6-million-year time period is currently unknown. In this work, the authors introduce a simulation framework that seeks to incorporate both sequence diversity and empirically gathered population data from human Alu elements, in order to provide a better understanding of the last several million years of human Alu evolution. The results suggest that a rapid explosion of Alu amplification at the time of the human–chimpanzee divergence is unlikely. Therefore, it is improbable that an increase in Alu retrotransposition activity was involved in the speciation of humans and chimpanzees.
Alu elements are short (~300 bp) interspersed elements that amplify in primate genomes through a process termed retroposition. The expansion of these elements has had a significant impact on the structure and function of primate genomes. Approximately 10 % of the mass of the human genome is comprised of Alu elements, making them the most abundant short interspersed element (SINE) in our genome. The majority of Alu amplification occurred early in primate evolution, and the current rate of Alu retroposition is at least 100 fold slower than the peak of amplification that occurred 30–50 million years ago. Alu elements are therefore a rich source of inter- and intra-species primate genomic variation.
A total of 153 Alu elements from the Ye subfamily were extracted from the draft sequence of the human genome. Analysis of these elements resulted in the discovery of two new Alu subfamilies, Ye4 and Ye6, complementing the previously described Ye5 subfamily. DNA sequence analysis of each of the Alu Ye subfamilies yielded average age estimates of ~14, ~13 and ~9.5 million years old for the Alu Ye4, Ye5 and Ye6 subfamilies, respectively. In addition, 120 Alu Ye4, Ye5 and Ye6 loci were screened using polymerase chain reaction (PCR) assays to determine their phylogenetic origin and levels of human genomic diversity.
The Alu Ye lineage appears to have started amplifying relatively early in primate evolution and continued propagating at a low level as many of its members are found in a variety of hominoid (humans, greater and lesser ape) genomes. Detailed sequence analysis of several Alu pre-integration sites indicated that multiple types of events had occurred, including gene conversions, near-parallel independent insertions of different Alu elements and Alu-mediated genomic deletions. A potential hotspot for Alu insertion in the Fer1L3 gene on chromosome 10 was also identified.
The Alu Yb-lineage is a 'young' primarily human-specific group of short interspersed element (SINE) subfamilies that have integrated throughout the human genome. In this study, we have computationally screened the draft sequence of the human genome for Alu Yb-lineage subfamily members present on autosomal chromosomes. A total of 1,733 Yb Alu subfamily members have integrated into human autosomes. The average ages of Yb-lineage subfamilies, Yb7, Yb8 and Yb9, are estimated as 4.81, 2.39 and 2.32 million years, respectively. In order to determine the contribution of the Alu Yb-lineage to human genomic diversity, 1,202 loci were analysed using polymerase chain reaction (PCR)-based assays, which amplify the genomic regions containing individual Yb-lineage subfamily members. Approximately 20 per cent of the Yb-lineage Alu elements are polymorphic for insertion presence/absence in the human genome. Fewer than 0.5 per cent of the Yb loci also demonstrate insertions at orthologous positions in non-human primate genomes. Genomic sequencing of these unusual loci demonstrates that each of the orthologous loci from non-human primate genomes contains older Y, Sg and Sx Alu family members that have been altered, through various mechanisms, into Yb8 sequences. These data suggest that Alu Yb-lineage subfamily members are largely restricted to the human genome. The high copy number, level of insertion polymorphism and estimated age indicate that members of the Alu Yb elements will be useful in a wide range of genetic analyses.
mobile elements; SINEs
Non-syndromic cleft lip/palate (NSCL/P) is a complex, frequent congenital malformation, determined by the interplay between genetic and environmental factors during embryonic development. Previous findings have appointed an aetiological overlap between NSCL/P and cancer, and alterations in similar biological pathways may underpin both conditions. Here, using a combination of transcriptomic profiling and functional approaches, we report that NSCL/P dental pulp stem cells exhibit dysregulation of a co-expressed gene network mainly associated with DNA double-strand break repair and cell cycle control (p = 2.88×10−2–5.02×10−9). This network included important genes for these cellular processes, such as BRCA1, RAD51, and MSH2, which are predicted to be regulated by transcription factor E2F1. Functional assays support these findings, revealing that NSCL/P cells accumulate DNA double-strand breaks upon exposure to H2O2. Furthermore, we show that E2f1, Brca1 and Rad51 are co-expressed in the developing embryonic orofacial primordia, and may act as a molecular hub playing a role in lip and palate morphogenesis. In conclusion, we show for the first time that cellular defences against DNA damage may take part in determining the susceptibility to NSCL/P. These results are in accordance with the hypothesis of aetiological overlap between this malformation and cancer, and suggest a new pathogenic mechanism for the disease.
Piwi-interacting RNAs (piRNAs) are a recently discovered class of small non-coding RNA found in animals. PiRNAs are primarily expressed in the germline where their best understood function is to repress transposable elements. Unlike previous studies that investigated the evolution of piRNA-generating loci at the level of nucleotide substitutions, here we studied the evolution of piRNA-generating loci at the level of copy number variation (i.e. duplications and deletions) using genome-wide copy number variation data from three human populations. Our analysis shows that at the level of copy number variation there is strong selective constraint and a very high mutation rate in human piRNA-generating loci. Our results differ from a model of positive selection on copy number variation in piRNA-generating loci previously proposed in rodents. We discuss possible reasons for this difference based on the transposable element insertion histories in the rodent and primate lineages.
Previous evidence from tooth agenesis studies suggested IRF6 and TGFA interact. Since tooth agenesis is commonly found in individuals with cleft lip/palate (CL/P), we used four large cohorts to evaluate if IRF6 and TGFA interaction contributes to CL/P. Markers within and flanking IRF6 and TGFA genes were tested using Taqman or SYBR green chemistries for case-control analyses in 1,000 Brazilian individuals. We looked for evidence of gene-gene interaction between IRF6 and TGFA by testing if markers associated with CL/P were overtransmitted together in the case-control Brazilian dataset and in the additional family datasets. Genotypes for an additional 142 case-parent trios from South America drawn from the Latin American Collaborative Study of Congenital Malformations (ECLAMC), 154 cases from Latvia, and 8,717 individuals from several cohorts were available for replication of tests for interaction. Tgfa and Irf6 expression at critical stages during palatogenesis was analyzed in wild type and Irf6 knockout mice. Markers in and near IRF6 and TGFA were associated with CL/P in the Brazilian cohort (p<10−6). IRF6 was also associated with cleft palate (CP) with impaction of permanent teeth (p<10−6). Statistical evidence of interaction between IRF6 and TGFA was found in all data sets (p = 0.013 for Brazilians; p = 0.046 for ECLAMC; p = 10−6 for Latvians, and p = 0.003 for the 8,717 individuals). Tgfa was not expressed in the palatal tissues of Irf6 knockout mice. IRF6 and TGFA contribute to subsets of CL/P with specific dental anomalies. Moreover, this potential IRF6-TGFA interaction may account for as much as 1% to 10% of CL/P cases. The Irf6-knockout model further supports the evidence of IRF6-TGFA interaction found in humans.
Several computer programs are available for detecting copy number variants (CNVs) using genome-wide SNP arrays. We evaluated the performance of four CNV detection software suites—Birdsuite, Partek, HelixTree, and PennCNV-Affy—in the identification of both rare and common CNVs. Each program's performance was assessed in two ways. The first was its recovery rate, i.e., its ability to call 893 CNVs previously identified in eight HapMap samples by paired-end sequencing of whole-genome fosmid clones, and 51,440 CNVs identified by array Comparative Genome Hybridization (aCGH) followed by validation procedures, in 90 HapMap CEU samples. The second evaluation was program performance calling rare and common CNVs in the Bipolar Genome Study (BiGS) data set (1001 bipolar cases and 1033 controls, all of European ancestry) as measured by the Affymetrix SNP 6.0 array. Accuracy in calling rare CNVs was assessed by positive predictive value, based on the proportion of rare CNVs validated by quantitative real-time PCR (qPCR), while accuracy in calling common CNVs was assessed by false positive/false negative rates based on qPCR validation results from a subset of common CNVs. Birdsuite recovered the highest percentages of known HapMap CNVs containing >20 markers in two reference CNV datasets. The recovery rate increased with decreased CNV frequency. In the tested rare CNV data, Birdsuite and Partek had higher positive predictive values than the other software suites. In a test of three common CNVs in the BiGS dataset, Birdsuite's call was 98.8% consistent with qPCR quantification in one CNV region, but the other two regions showed an unacceptable degree of accuracy. We found relatively poor consistency between the two “gold standards,” the sequence data of Kidd et al., and aCGH data of Conrad et al. Algorithms for calling CNVs especially common ones need substantial improvement, and a “gold standard” for detection of CNVs remains to be established.