Centromeres are essential for proper chromosome segregation during mitosis and meiosis. All normal human centromeres are defined by the presence of a predominant satellite DNA family called alpha satellite (1
); however, the functional interplay between genome sequences and the epigenetic network involved in kinetochore assembly is poorly understood (2
). Efforts to explore the nature of such genomic signals have relied on the ability to study representative functional centromere sequences that colocalize with kinetochore proteins (6
), combined with assessment of de novo
centromere formation in artificial chromosome assays (7
). Previous studies of particular alpha satellite sequence DNAs have supported a sequence-based model of centromere identity (7
). However, such studies have been limited to a small number of well-characterized alpha satellite families, and the vast majority of such sequences in the genome have not been evaluated.
The human genome assembly (13
) provides the largest available collection of alpha satellite sequences assigned to individual chromosomes and, in concert with extensive experimental evidence, contributes to current models of human centromere sequence organization (7
). Well-characterized and assembled alpha satellite DNAs are defined by a highly divergent 171-bp monomer repeat unit, with pairwise sequence identities on the order of 60 to 80% within and between chromosomal subsets (14
). This level of sequence divergence within the genome-wide collection of alpha satellite sequences provides an inventory of sequence features for studying CENP-A association and centromere function. Nonetheless, our understanding of the range of sequences capable of de novo
centromere formation is limited to a small number of highly characterized alpha satellite DNAs (18
), restricting the opportunity to discern genome-wide signals of centromere competency within the majority of assembled alpha satellite sequences.
In this study, to overcome these limitations, we apply a novel strategy for extracting functional satellite sequence information from assembled human centromeric regions. To achieve this, we provide an annotation of all assembled alpha satellite sequences, reporting sites of intra- and interchromosomal homogenization patterns among assembled monomers. These alpha satellite sequence features are evaluated in the context of a global alpha satellite database from a single individual genome (20
), resulting in an informed centromere mappability track from which we are able to monitor epigenetic cell line-matched CENP-A enrichment patterns in endogenous human assembled regions. From this combined analysis, we are able to classify human centromeric regions as either functioning or nonfunctioning alpha satellite sequences. Next, to evaluate alpha satellite monomers that are not enriched for CENP-A in the genome, yet have similar monomer content and organization as satellite sequences classified as functioning, we selected collections of alpha satellite DNA (in total, comprising ~1 Mb) to test for de novo
centromere formation in human artificial chromosome assays, thus identifying sequences that, while not currently functioning in the particular genome tested, might be competent for centromere function in other settings.
This combination of genomic and functional strategies has allowed us to develop an initial epigenomic and functionally annotated map of human assembled centromeric regions, which provides a genetic and epigenetic foundation for further study of these regions of the human genome, their variation, and their underlying biology and function.