HaploReg consists of a PHP interface to a MySQL database. The initial database table was populated using genomic coordinates and sequences for 16
841 biallelic SNPs and small indels from the pilot release of the 1000 Genomes Project (17
). In some cases, such as novel indels, the variant call format (VCF) file from the pilot release did not have a RefSNP identifier (rsid); for the purpose of creating a unique identifier for this database, these variants were assigned the label of ‘chromosome:position’ in hg18 coordinates. To provide backward compatibility with obsolete rsids, dbSNP release 132 was checked for variants at the same position as 1000 Genomes pilot variants with multiple rsids (18
). In addition, annotations of functional consequences were extracted from dbSNP.
A variety of functional annotations were then intersected with the set of variants using the BEDTools package (19
), including the chromatin state segmentation of Ernst et al
), and conserved regions by GERP (20
) and SiPhy (21
). To obtain gene annotations, RefSeq genes (23
) were downloaded from the UCSC Genome Browser and GENCODE version 7 (24
) was downloaded from the project website. BEDTools was then used to calculate the proximity of each variant to a gene by either annotation, as well as the orientation (3′ or 5′) relative to the nearest end of the gene, based on the strand of the gene.
In order to annotate variants by their effect on regulatory motifs, a library of position weight matrices (PWMs) was constructed from literature sources and was scored on genomic sequences as described previously (6
). Briefly, a set of PWMs was collected from TRANSFAC (25
), JASPAR (26
), and protein-binding microarray (PBM) experiments (27–29
). The reference and alternate alleles for each of the 1000 Genomes pilot SNPs and indels were concatenated with 29
bp of genomic context on each side, using the hg18 sequence obtained from the UCSC Genome Browser (30
). PWMs were then scored for instances that passed either of two thresholds, a stringent threshold of P
and a less-stringent threshold of P
). Only instances where a motif in the sequence (i) passed the stringent threshold of a PWM in either the reference or the alternate genomic sequence, and (ii) overlapped the variable nucleotide(s) (thus changing the PWM score) were considered. Then, the change in log-odds (LOD) score was calculated. In cases where the weaker match was did not pass the less-stringent threshold, an approximate minimum change of LOD score was reported, corresponding to the difference between the score of the stronger match to the score required to pass the less-stringent threshold. In cases where both allelic variants surpassed the less-stringent threshold, the exact difference in score was reported.
GWAS results were obtained from the table curated by NHGRI (32
) (accessed June 29, 2011.) In cases where multiple studies were annotated as pertaining to the same phenotype, unique independent SNPs were consolidated into a single list.
LD was calculated using the phased genotype information accompanying the 1000 Genomes Project pilot release (17
). VCFTools (33
) was used to perform the calculation, using an LD threshold of r2
0.80, and a maximum distance between variants of 200
kb. Results from VCFTools were then consolidated such that for every variant in our database, a list of linked variants is accessible for each of the three populations, along with an r2
To perform enhancer enrichment analysis on sets of variants, tables of common array designs were obtained from the UCSC Table Browser (34
) and lists were constructed of 1000 Genomes SNPs segregating in each of the three pilot populations, as well as all SNPs in the database. Then, a background frequency of coverage was calculated for variants annotated as overlapping a strong enhancer state in each cell type. When a user submits a query list of variants, the coverage of strong enhancers in each cell type is calculated. If the coverage exceeds that of the background set selected by the user, a binomial test is performed, and enrichment is reported if it passes an uncorrected significance threshold of 0.05.