|Home | About | Journals | Submit | Contact Us | Français|
Completely penetrant mutations in the surfactant protein B gene (SFTPB) and ≥75% reduction of SFTPB expression disrupt pulmonary surfactant function and cause neonatal respiratory distress syndrome. To inform studies of genetic regulation of SFTPB expression, we created a catalogue of SFTPB variants by comprehensive resequencing from an unselected, population-based cohort (N=1,116). We found an excess of low frequency variation (81 SNPs and 5 small insertion/deletions). Despite its small genomic size (9.7 kb), SFTPB was characterized by weak linkage disequilibrium (LD) and high haplotype diversity. Using the HapMap Yoruban and European populations, we identified a recombination hot spot that spans SFTPB, was not detectable in our focused resequencing data, and accounts for weak LD. Using homology based software tools, we discovered no definitively damaging exonic variants. We conclude that excess low frequency variation, intragenic recombination, and lack of common, disruptive exonic variants favor complete resequencing as the optimal approach for genetic association studies to identify regulatory SFTPB variants that cause neonatal respiratory distress syndrome in genetically diverse populations.
The 9.7 kb surfactant protein B gene (SFTPB) (GeneID: 6439 Locus tag: HGNC:10801; MIM: 178640) encodes a 79-amino acid, hydrophobic protein that is critical for function of the pulmonary surfactant (1). Functional pulmonary surfactant, a phospholipid-protein mixture that lines alveoli at the air-liquid interface, maintains alveolar patency at end expiration and is required for successful fetal-neonatal pulmonary transition. Studies in human newborn infants with rare, recessive loss of function SFTPB mutations have demonstrated that genetic disruption of SFTPB expression is completely penetrant and lethal due to dysfunction of the pulmonary surfactant (2, 3). Studies in conditionally regulated murine lineages and human infants indicate that >75% reduction in SFTPB expression is sufficient to cause surfactant dysfunction and respiratory distress (4, 5). To provide a catalogue of SFTPB variants (single nucleotide polymorphisms (SNPs) or insertion-deletions (in/dels)) for use in statistical and functional studies of SFTPB regulation, we used high throughput, comprehensive resequencing of SFTPB in a cohort of sufficient size (N=1,116) to detect low frequency variants. We report an excess of low frequency variation, high rates of intragenic recombination, and a lack of common, damaging exonic variants. Our results suggest that comprehensive resequencing will likely be advantageous over tagSNP genotyping approaches in genetic association analysis of SFTPB.
We extracted genomic DNA from 1,116 Guthrie cards collected for newborn screening by the Missouri Department of Health and Senior Services (DHSS) (6). We linked each DNA sample anonymously to clinical characteristics in a vital statistics (birth-death certificate) database maintained by the Missouri DHSS to determine ethnicity. Using small amplicons (<500 basepairs), robotic, high throughput automated processes, and BigDye terminator sequencing chemistry (7), we bidirectionally sequenced SFTPB, including 1.8 kB of the promoter region, 1.1 kB of exonic sequence (all 10 translated exons), and 5.9 kB that includes all intervening intronic sequence except 380 base pairs (genomic position 1649-2028) in intron 4. We omitted part of intron 4 due to the inability of BigDye terminator sequencing chemistry to resolve variable numbers of dinucleotide repeats in this region (8). We also omitted 1 untranslated exon (exon 11), and its preceding intron (intron 10). All amplification and sequencing primers and conditions are available at http://genome.wustl.edu/activity/med_seq/primers.cgi. We used software applications (Phred, Phrap, PolyPhred, and Consed) to call bases, assemble contigs, and scan sequencing chromatograms for variation (http://www.phrap.org/phredphrapconsed.html). To assess overall sequence quality, we used a quality averaging program (J. Sloan, University of Washington) to quantify Phred score at each base across SFTPB (Figure 1). Because of variation in trace file quality, analysts reviewed and confirmed or edited all polymorphic sites identified by Polyphred, sites with in/dels, and all sites previously identified as polymorphic in dbSNP in each individual. After manual polymorphism validation, we extracted genotypes for each DNA sample at the confirmed polymorphic sites for analysis. An average of 90% of genotypes were called in each individual using a minimum Phred score of 20.
Because of the high proportion of sequence variation attributable to rare, polymorphic sites, we were concerned that SNP detection errors might bias our analysis. Systematic comparison of the results from 2 independent analysts identified 0.99% of calls as discrepant (452/45,505 genotypes): 67% of these were judged as false positive calls (301/452) in low quality (Phred score <20) data, and all discrepant calls were classified as missing data. Using an independent genotyping method, Taqman (9), we compared genotypes at 5 high frequency, polymorphic sites in 558 individuals to the genotypes called from sequence data and found 27 discrepant calls in 2,790 genotypes, with 10 confirmed Taqman heterozygotes, for a false negative heterozygote detection rate of 0.36%. Next, we reamplified and resequenced all heterozygous sites identified in <3 individuals (41 genotypes in 49 individuals) with different primer sets and confirmed genotypes at all of these sites. Finally, we examined base calls and sequence quality (Phred score) at 42 sites polymorphic in other cohorts but not in this cohort (45,780 genotypes). Of the 41,555 genotypes with high quality (Phred score >20) sequence, we found no rare alleles missed by chromatogram analysis (0%). We could not call the remaining 5,317 genotypes (11.6%) due to low quality chromatograms in those specific samples. These results suggest false positive and negative rates of less than 1%.
Linkage disequilibrium (LD) is a measure of the allelic correlation between two SNPs. Several LD statistics are available (10); D′ is the ratio of the observed LD to the strongest possible LD given the allele frequencies of the SNPs. |D′|=1 when there is no detectable recombination between SNPs. Haplotypes are patterns of alleles across multiple SNPs along a single chromosome. We used PHASE (v. 2.1) to infer haplotypes computationally from genotypes within each racial group (11, 12). To assess whether haplotypes of common variants (minor allele frequency (MAF) >5%) can predict genotype at low frequency SFTPB alleles, we used HAPLOVIEW (v. 3.31) (http://www.broad.mit.edu/mpg/haploview/) in aggressive mode to select a minimal set of tagSNPs such that all other SNPs were strongly correlated (r2 ≥ 0.8) with either a tagSNP or a haplotype of several tagSNPs (13). We used PHASE to estimate background recombination rate, determine hot spot location, and compute Bayes factors (BFs) as previously described (14) for either intragenic SNPs with MAF >5% or for HapMap SNPs (MAF >5%) within 50 kb of SFTPB (data release #21 as of July, 2006)(http://www.hapmap.org). BFs are likelihood ratios of the probability of the observed data assuming a recombination hotspot divided by the probability of the data assuming uniform recombination across the region. A BF of 10 suggests that the haplotype data at a genomic location are 10 times more likely to be consistent with the presence of hot spot than the absence of a hot spot, and a BF of >10 is substantive evidence for the presence of a recombination hot spot.
Discovery of genomic regions under selective pressure may help inform genetic association studies, because evolutionarily constrained sequences are presumably functional. We used 3 statistical strategies to screen SFTPB for selective pressure. To assess whether genetic variation in regions of SFTPB was consistent with neutral evolution, we used two statistical tests of observed sequence diversity against theoretical predictions for neutral sequence, Tajima's D (15) and Fu and Li's D* (16). Tajima's D, compares 2 descriptive statistics (theta and pi) for sequence diversity: theta (θ) is based on based on the number of chromosomes screened and the number of polymorphisms observed in SFTPB (17), while pi (π) is based upon the number of chromosomes screened and the average allele frequency of the polymorphisms identified (18, 19). We used SLIDER (http://genapps.uchicago.edu/slider/index.html) to calculate Tajima's D. Fu and Li's D* compares π against a third sequence diversity statistic derived from the number of singleton polymorphisms observed (SNPs with the rare allele observed only once in the data) (19).
We also characterized selection pressure by using the ratio of non-synonymous to synonymous substitution rates (dN/dS) calculated from the observed SNPs using SNAP (Synonymous/Non-synonymous Analysis Program) (http://www.hiv.lanl.gov/content/hiv-db/SNAP/WEBSNAP/SNAP.html) (20, 21). A dN/dS ratio >1 suggests more non-synonymous substitutions than expected under the neutral model and is evidence for positive selection, whereas a dN/dS ratio <1 is evidence for purifying selection against some amino acid replacement mutations.
The third statistic we used was the MacDonald-Kreitman test (22) which compares the within-species dN/dS ratio for polymorphism in our sample against the between-species ratio for fixed differences (23) (http://www.ebi.ac.uk/clustalw/).
We analyzed all data using Statistical Analysis System (v. 9.3.1)(SAS, Inc., Cary, N.C.). The Human Research Protection Office at the Washington University Medical Center and the Institutional Review Board at the Missouri DHSS reviewed and approved this study.
We were unable to screen 380 bp of intron 4 due to a highly polymorphic repeat region. In the remaining sequence, we found 86 polymorphic sites including 81 SNPs and 5 small in/dels (9.8 polymorphic sites per 1,000 basepairs of SFTPB reference sequence), with similar frequencies in the promoter (8 per 1,000 basepairs), introns (10 per 1,000 basepairs), and exons (12 per 1,000 basepairs)(χ2 analysis, P=0.7) (Table 1). The overall SNP density was 9.2/1,000 basepairs. The Phred scores within 10 base pairs of each polymorphic site (37 ± 6)(mean ± S.D.) were excellent, suggesting that sequence quality did not limit genetic variant discovery (Figure 1). The average number of polymorphic sites per individual was greater in African-Americans than other races (all P<.01) (Table 1). The race-specific, relative genotype frequencies at each polymorphic site did not differ significantly from Hardy-Weinberg prediction (all P>.05). The majority of variant sites in SFTPB is low frequency: 67 of 86 sites had MAF <5%. Potentially disruptive variants were also rare: 8 of 9 nonsynonomous variants and 6 of 7 intronic SNPs within 20 base pairs of an intron-exon junction were rare. To determine whether nonsynonymous SNPs might disrupt surfactant protein B function, we used 2 homology-based software tools, SIFT (Sorting Intolerant from Tolerant)(24) and PolyPhen (25). We found that 8 of 9 sites were not classified as intolerant or damaging. One site (genomic position 2558) in exon 5 that encodes either glycine or glutamic acid (G183E) was classified as probably damaging by Polyphen, but tolerated by SIFT, and is rare (MAF 0.1%). The lack of definitively damaging or intolerant SNPs in this large cohort suggests strong purifying selective pressure against rare variants that encode dysfunctional surfactant protein B, likely due to the critical role of the encoded protein in successful fetal-neonatal pulmonary transition (26). Despite a much larger cohort size evaluated (1,116 vs. 90 individuals from the Polymorphism Resource Discovery panel), these estimates are considerably lower than estimates of damaging exonic variants in 213 environmental genes (27).
To determine whether variants at intron-exon junctions might disrupt expression, we used a neural network application (http://www.fruitfly.org/seq_tools/splice-instrucs.html) trained to recognize potential human splice sites on the basis of a large training set of known human splice sites. We found that the only common intron-exon junction SNP (genomic position 4550, rs893159) was predicted to alter RNA splicing by creating a second acceptor site for exon 8. The score for a second acceptor site increased from 0.47 to 0.78 when the minor allele was substituted, while the score for the predicted exon 8 acceptor site is 0.65. This finding suggests that RNA splicing may be altered by this SNP.
To validate experimentally a published mathematical simulation of the number of haploid genomes required to detect SNPs with MAF greater than a given frequency (28), we performed 1,000 race-stratified sampling iterations for SFTPB (Table 2). Our data for SFTPB confirm the theoretical prediction based on the standard neutral model of population genetics, show that a cohort size of ≤ 48 haploid genomes will miss 11% to 18% of SNPs with frequencies of ≥1%, providing direct evidence of the influence of population history on estimates of cohort size necessary to detect rare SNPs.
Statistical power of genetic association studies may be increased, and genotyping costs decreased by identifying highly correlated tagSNPs. Linkage disequilibrium (LD) is a statistical measure of allelic correlation between polymorphisms. Using common genotypes (MAF>5%), we detected weak LD across SFTPB despite its small genomic size (Figure 2). In view of the effect of cohort size on LD, we randomly selected European-American cohorts similar in size to the African-American cohort and found similar results (29, 30). Using the tagger function in HAPLOVIEW, we were unable to capture rare variants when using common markers as tagSNPs. Using the Genome Variation Server maintained by Seattle SNPs (http://gvs.gs.washington.edu/GVS), we found weak LD within SFTPB. Weak LD suggests that the genomic region that includes SFTPB spans a recombination hot spot (14).
We used PHASE with common genotypes (MAF >5%) to infer haplotypes (Figure 3) and observed high haplotype diversity consistent with intragenic recombination. To determine whether SFTPB includes a recombination hot spot, we estimated recombination parameters into PHASE and calculated Bayes factors (BF), a measurement of the strength of the evidence for a recombination hot spot (14). In the resequencing data alone, the intragenic recombination rate over background (Figure 4a) and BF values (5.9 in European-American, 2.2 in African-American) did not suggest a recombination hot spot. However, when we calculated recombination rate and BFs for a 107 kk window flanking SFTPB in HapMap data, we found a 20 fold to 80 fold increase in recombination rate within SFTPB (Figure 4b), and BF values of 1353 in both populations. As suggested by comparison of BFs with background recombination rates in each of these cohorts (Figure 5), the high intragenic recombination rate was not detected in the resequencing data because the recombination hot spot spans most of the resequenced region.
To test whether SFTPB variation is consistent with predictions from the neutral theory of molecular evolution, we used Tajima's D and Fu and Li's D* (Table 3). Both measures were consistently negative for both African-Americans and European-Americans, suggesting an excess of low frequency variation in SFTPB, although this trend was not significant. Using a sliding window approach (Figure 6) (19), we found that the genomic region that encodes mature surfactant protein B (exons 6 and 7) had the most negative values, consistent with negative selection against variation in these exons.
To evaluate conservation across species, we compared dN/dS in this cohort with SFTPB in Mus musculus (GenBank number: NM147779). The overall dN/dS ratio for this cohort was 2.0 (8 non-synonymous and 4 synonymous sites). In a human-mouse comparison, SNAP determined the dN/dS ratio to be 0.94 (2 non-synonymous and 2.12 synonymous) across these two species, consistent with neutral evolution over time. The MacDonald-Kreitman test was also consistent with neutral evolution (χ2 = 0.43, P-value = 0.51). These results suggest that although much of the variation in SFTPB is selectively neutral, the excess of low frequency variation near the exons containing mature SFTPB may be attributable to the presence of a modest number of mildly deleterious polymorphisms subject to negative selective pressure.
Because neonatal respiratory distress syndrome is unambiguously associated with rare, recessive SFTPB mutations and is observed when SFTPB expression is reduced by >75% (2-5), SFTPB is a candidate gene for neonatal respiratory distress syndrome. Previous studies using unrelated, case-control designs or family-based association tests with genotypes at high frequency polymorphic sites have suggested association between genotypes or haplotypes and neonatal respiratory distress (31-33). To inform studies of genetic regulation of SFTPB, we adapted production level, PCR-based sequencing technology for comprehensive genetic variant discovery (7). We found high SNP density (28, 34), weak LD, and, using data from the HapMap Project, strong evidence for a recombination hot spot within SFTPB. The coincidence of high SNP density, excess low frequency sites, and high recombination rate has been observed at other loci in Drosophila and humans (35-37), consistent with an elevated mutation rate within recombination hotspots. These characteristics suggest that use of common SFTPB haplotypes or tagSNPs will not capture statistically robust associations with disease causing alleles in unrelated, genetically diverse, case-control cohorts (38). Genetic bottlenecks in small populations will increase LD, but typically do so only for rare subsets of SNPs. LD between the higher frequency SNPs will not be substantially altered by bottlenecks in founder populations. Thus, at SFTPB, comprehensive resequencing in large case-control cohorts is advantageous for genetic association studies of neonatal respiratory distress syndrome, because the elevated mutation rate enhances the frequency of rare, deleterious mutations, while the high recombination rate makes LD between common SNPs too low for useful tagSNP selection.
In view of the lack of common, damaging exonic SNPs observed in SFTPB, association studies of neonatal respiratory distress syndrome will need to focus on regulatory variation. For example, our data using a neural network application trained to recognize potential human splice sites suggest that the intron-exon junction SNP at genomic position 4550 (rs893159) may alter RNA splicing, resulting in misprocessed or misdirected surfactant protein B, and disrupting surfactant function. Our results also suggest the value of mechanistic studies in the genetic pathogenesis of SFTPB mutations. A second SNP in intron 2 (SNP 1013, rs3024798) may affect recombination rates within SFTPB, because it disrupts a motif in intron 2 (CCTCCCT > CCTCCAT) that has been associated with recombination hotspot activity (39). Recombination rates correlate positively with mutation rates, so high recombination rate alleles may be more prone to the de novo SFTPB mutations seen in severe neonatal respiratory distress syndrome.
The authors thank members of the Seattle SNPs team (D. A. Nickerson, D. C. Crawford, M. J. Rieder, J. Sloan, and M. Eberle), R. H. Waterston, and H. R. Colten for helpful suggestions, and members of the Missouri DHSS (J. Eckstein, G. Land, M. Mosley, and J. Stockbauer) for collaboration.
Statement of financial support:
This work was supported by grants from the National Heart, Lung, and Blood Institute (RO1 HL 065174 to F.S.C., RO1 HL 065385 to A.H.), from the National Human Genome Research Institute (U54 HG 003079 to RKW), from the Children's Discovery Institute of St. Louis Children's Hospital (FSC and AH), and from the Saigh Foundation (FSC and AH).
Prior presentation of data:
These data have been presented in abstract form at the Pediatric Academic Societies Meeting (2005, 2006), at the NHLBI Program For Genomic Applications Meeting (2005), and at the Pulmonary Surfactant Meeting of the Federation of American Society of Experimental Biology (2006).